-- __Clustering__
+- __Clustering__
+
+ ______________________________________________________________________
- ---
Clustering/grouping of similar data points:
- Customer segmentation in marketing (like in the example above)
@@ -264,11 +260,12 @@ complex data, speed up algorithms and improve model performance.
- Product recommendations
- ...
-- __Dimensionality Reduction__
+- __Dimensionality Reduction__
+
+ ______________________________________________________________________
- ---
Reducing data complexity:
-
+
- Feature extraction from high-dimensional data
- Visualization of complex datasets
- Noise reduction in signals
@@ -283,26 +280,26 @@ complex data, speed up algorithms and improve model performance.
answers to compare against. The value of the results often depends on how
meaningful the discovered patterns are for the specific application.
----
+______________________________________________________________________
???+ tip "Domain knowledge"
- No matter if you're dealing with supervised or unsupervised learning,
- domain knowledge is crucial. Understanding the data and the problem you're
- trying to solve will help you select the right algorithms, features and
- interpret the results.
+ No matter if you're dealing with supervised or unsupervised learning, domain
+ knowledge is crucial. Understanding the data and the problem you're trying to
+ solve will help you select the right algorithms, features and interpret the
+ results.
## Recap
-This chapter introduced two fundamental concepts in machine learning,
+This chapter introduced two fundamental concepts in machine learning,
supervised and unsupervised learning:
| Concept | Data | Task | Goal |
-|---------------------------|------------------------|--------------------------|---------------------------|
+| ------------------------- | ---------------------- | ------------------------ | ------------------------- |
| **Supervised Learning** | Labeled (\(X\), \(y\)) | Regression | Predict continuous values |
| | | Classification | Predict categories |
| **Unsupervised Learning** | Unlabeled (\(X\)) | Clustering | Group similar data |
| | | Dimensionality Reduction | Reduce data complexity |
-The following chapters will cover algorithms for each task with theory and
-practical examples.
\ No newline at end of file
+The following chapters will cover algorithms for each task with theory and
+practical examples.
diff --git a/docs/data-science/algorithms/supervised/classification.md b/docs/data-science/algorithms/supervised/classification.md
index a4b9a818..58d11582 100644
--- a/docs/data-science/algorithms/supervised/classification.md
+++ b/docs/data-science/algorithms/supervised/classification.md
@@ -4,10 +4,10 @@
While linear regression helps us predict continuous values, other real-world
problems require predicting categorical outcomes: Will a customer subscribe to
-a term deposit? Is an email spam? Is a transaction fraudulent?
-Logistic regression addresses these binary classification problems by extending
-the concepts we learned in linear regression to predict probabilities between
-0 and 1.
+a term deposit? Is an email spam? Is a transaction fraudulent? Logistic
+regression addresses these binary classification problems by extending the
+concepts we learned in linear regression to predict probabilities between 0 and
+1\.
We will cover the theory and apply logistic regression to the breast cancer
dataset to predict whether a tumor is malignant or benign.
@@ -18,22 +18,21 @@ dataset to predict whether a tumor is malignant or benign.
The theoretical part is adapted from:
- ^^Daniel Jurafsky and James H. Martin. 2025. Speech and Language
- Processing: *An Introduction to Natural Language Processing, Computational
- Linguistics, and Speech Recognition with Language Models*[^1]^^
+ ^^Daniel Jurafsky and James H. Martin. 2025. Speech and Language Processing:
+ *An Introduction to Natural Language Processing, Computational Linguistics, and
+ Speech Recognition with Language Models*[^1]^^
- [^1]:
- 3rd edition. Online manuscript released January 12, 2025.
- [https://web.stanford.edu/~jurafsky/slp3](https://web.stanford.edu/~jurafsky/slp3)
+ [^1]: 3rd edition. Online manuscript released January 12, 2025.
+ [https://web.stanford.edu/~jurafsky/slp3](https://web.stanford.edu/~jurafsky/slp3)
#### Deja vu: Linear regression
-Just like in linear regression, we have a set of features \(x_1, x_2, ..., x_n\)
-describing an outcome \(y\). But instead of predicting a continuous value, \(y\)
-is binary: 0 or 1.
+Just like in linear regression, we have a set of features
+\(x_1, x_2, ..., x_n\) describing an outcome \(y\). But instead of predicting a
+continuous value, \(y\) is binary: 0 or 1.
Similar to linear regression, logistic regression uses a linear combination of
-the features to predict the outcome. I.e., each feature is assigned a
+the features to predict the outcome. I.e., each feature is assigned a
**weight**, and a **bias term** is added at the end.
???+ defi "Linear combination"
@@ -42,18 +41,18 @@ the features to predict the outcome. I.e., each feature is assigned a
z = b_1 \cdot x_1 + b_2 \cdot x_2 + ... + b_n \cdot x_n + a
\]
-with \(a\) being the bias term and \(b_1, b_2, ..., b_n\) the weights.
-"The resulting single number \(z\) expresses the weighted sum
-of the evidence for the class." (Jurafsky & Martin, 2025 p. 79)
-Bias, weights and the intercept are all real numbers.
+with \(a\) being the bias term and \(b_1, b_2, ..., b_n\) the weights. "The
+resulting single number \(z\) expresses the weighted sum of the evidence for
+the class." (Jurafsky & Martin, 2025 p. 79) Bias, weights and the intercept are
+all real numbers.
-So far, logistic regression is the same as linear regression with the sole
-difference that in [linear regression](regression.md#linear-regression) we
-referred to the bias \(a\) as the intercept, and the
-weights \(b_1, b_2, ..., b_n\) as coefficients or slope.
+So far, logistic regression is the same as linear regression with the sole
+difference that in [linear regression](regression.md#linear-regression) we
+referred to the bias \(a\) as the intercept, and the weights
+\(b_1, b_2, ..., b_n\) as coefficients or slope.
-However, \(z\) is not the final prediction, since it can take real values
-and in fact ranges from \(-\infty\) to \(+\infty\). Thus, \(z\) needs to be
+However, \(z\) is not the final prediction, since it can take real values and
+in fact ranges from \(-\infty\) to \(+\infty\). Thus, \(z\) needs to be
transformed to a probability between 0 and 1. This is where the sigmoid
function comes into play.
@@ -68,8 +67,8 @@ uses the sigmoid (or logistic) function to transform \(z\) into a probability
\sigma(z) = \frac{1}{1 + e^{-z}}
\]
- The sigmoid function takes the real number \(z\) and transforms it to the
- range (0,1).
+ The sigmoid function takes the real number \(z\) and transforms it to the range
+ (0,1).
-For given input features \(x_1, x_2, ..., x_n\), we can calculate the
-linear combination \(z\) and then apply the sigmoid function to get the
-probability of the outcome.
-To compute the probability of the outcome being 1
-:fontawesome-solid-arrow-right: \(P(y=1|x)\), for example
-if an email is spam, we have to set a decision boundary.
+For given input features \(x_1, x_2, ..., x_n\), we can calculate the linear
+combination \(z\) and then apply the sigmoid function to get the probability of
+the outcome. To compute the probability of the outcome being 1
+:fontawesome-solid-arrow-right: \(P(y=1|x)\), for example if an email is spam,
+we have to set a decision boundary.
???+ defi "Decision boundary"
If \(\sigma(z) \gt 0.5\), we predict \(y=1\), otherwise \(y=0\).
-For instance, if the probability of an email being spam is 0.7, we predict
-that the email is spam \((0.7 \gt 0.5)\). With a probability of 0.4, we
-predict that the email is *not* spam \((0.4 \le 0.5)\).
+For instance, if the probability of an email being spam is 0.7, we predict that
+the email is spam \((0.7 \gt 0.5)\). With a probability of 0.4, we predict that
+the email is *not* spam \((0.4 \le 0.5)\).
#### The optimization problem
-But how do we find the best parameter combination (weights and bias) for our
-logistic regression model?
-Unlike linear regression, which uses ordinary least squares, logistic
-regression typically uses Maximum Likelihood Estimation (MLE), i.e., the best
-parameters (weights and bias) that maximize the likelihood of the observed
-data.
+But how do we find the best parameter combination (weights and bias) for our
+logistic regression model? Unlike linear regression, which uses ordinary least
+squares, logistic regression typically uses Maximum Likelihood Estimation
+(MLE), i.e., the best parameters (weights and bias) that maximize the
+likelihood of the observed data.
Lo and behold, even more math...
-For optimization purposes we use the negative log-likelihood as our loss
+For optimization purposes we use the negative log-likelihood as our loss
function:
???+ defi "Negative log-likelihood"
@@ -119,7 +116,7 @@ function:
\]
with:
-
+
- \(m\) as the number of training examples
- \(y_i\) being the the actual class (0 or 1)
- \(\sigma(z_i)\) is the predicted probability using the sigmoid function
@@ -127,43 +124,40 @@ function:
???+ tip
- Intuitively speaking, the loss function penalizes the model for making
- wrong predictions. If the model predicts a probability of 0.9 for a
- spam email, and the email is actually spam (\(y=1\)), the loss is small.
- On the other hand, if the model predicts a probability of 0.1 for a
- spam email, and the email is spam (\(y=1\)), the loss will be high.
+ Intuitively speaking, the loss function penalizes the model for making wrong
+ predictions. If the model predicts a probability of 0.9 for a spam email, and
+ the email is actually spam (\(y=1\)), the loss is small. On the other hand, if
+ the model predicts a probability of 0.1 for a spam email, and the email is spam
+ (\(y=1\)), the loss will be high.
+
+ The weights are gradually adjusted to minimize the loss. Think of it like
+ turning knobs slowly until we get better predictions.
- The weights are gradually adjusted to minimize the loss.
- Think of it like turning knobs slowly until we get better predictions.
-
- Gradually adjusting these knobs to minimize the loss is referred to as
- gradient descent.
+ Gradually adjusting these knobs to minimize the loss is referred to as gradient
+ descent.
-Conveniently, `scikit-learn` provides a logistic regression implementation
-that takes care of the optimization for us. Finally, we look at a
-practical example to see logistic regression in action.
+Conveniently, `scikit-learn` provides a logistic regression implementation that
+takes care of the optimization for us. Finally, we look at a practical example
+to see logistic regression in action.
## Example
Let's apply logistic regression to the breast cancer dataset, a classic binary
-classification problem where we need to predict whether a tumor is *malignant
+classification problem where we need to predict whether a tumor is *malignant
or benign* based on various features.
With class labels \(y\) being 0 (malignant) or 1 (benign), we can use logistic
-regression to predict the probability of a tumor being benign. The features
+regression to predict the probability of a tumor being benign. The features
were calculated from digitized images of a breast mass.
???+ info
- See the [UCI Machine Learning Repository](https://doi.org/10.24432/C5DW2B)
- for more information on the data set.[^2]
-
- [^2]:
- Wolberg, W., Mangasarian, O., Street, N., & Street, W. (1993).
- Breast Cancer Wisconsin (Diagnostic) [Dataset]. UCI Machine
- Learning Repository.
- [https://doi.org/10.24432/C5DW2B](https://doi.org/10.24432/C5DW2B).
+ See the [UCI Machine Learning Repository](https://doi.org/10.24432/C5DW2B) for
+ more information on the data set.[^2]
+ [^2]: Wolberg, W., Mangasarian, O., Street, N., & Street, W. (1993). Breast
+ Cancer Wisconsin (Diagnostic) [Dataset]. UCI Machine Learning Repository.
+ [https://doi.org/10.24432/C5DW2B](https://doi.org/10.24432/C5DW2B).
### Load the data
@@ -190,10 +184,9 @@ tumors.
???+ tip
- Just like in the previous chapter, the data is divided into `X`, containing
- the attributes and `y` holding the corresponding labels. Having attributes
- and labels separated, makes life a bit easier when training and testing the
- model.
+ Just like in the previous chapter, the data is divided into `X`, containing the
+ attributes and `y` holding the corresponding labels. Having attributes and
+ labels separated, makes life a bit easier when training and testing the model.
???+ question "Number of features"
@@ -206,13 +199,14 @@ How many features (attributes) does the breast cancer dataset have?
- [ ] 32
`X.shape` reveals that we are dealing with 30 features.
+
### Split the data
Before training our model, we want to split our data into two parts. Just like
-in the previous chapter, we perform a 80/20 split, i.e., we use 80% to train
+in the previous chapter, we perform a 80/20 split, i.e., we use 80% to train
the model and evaluate it on the remaining 20%.
```python
@@ -225,9 +219,9 @@ X_train, X_test, y_train, y_test = train_test_split(
???+ tip
- If you need a refresh on the parameters used in `train_test_split()`
- revisit, the [Split the data](regression.md#split-the-data) section from
- the previous chapter.
+ If you need a refresh on the parameters used in `train_test_split()` revisit,
+ the [Split the data](regression.md#split-the-data) section from the previous
+ chapter.
### Train the model
@@ -240,20 +234,20 @@ model = LogisticRegression(random_state=42, max_iter=5_000) # (1)!
model.fit(X_train, y_train)
```
-1. The `random_state` parameter ensures reproducibility, while
- `max_iter` specifies the maximum number of iterations taken for the solver
- to converge (i.e., solving the optimization problem to find the best
+1. The `random_state` parameter ensures reproducibility, while `max_iter`
+ specifies the maximum number of iterations taken for the solver to
+ converge (i.e., solving the optimization problem to find the best
parameter combination).
`#!python model=LogisticRegression(...)` creates an instance of the logistic
-regression model. Only after calling the `fit()` method, the `model` is
-actually trained. Since we separated attributes and labels into `X_train` and
-`y_train` respectively, we can directly call the method without any
-further data handling.
+regression model. Only after calling the `fit()` method, the `model` is
+actually trained. Since we separated attributes and labels into `X_train` and
+`y_train` respectively, we can directly call the method without any further
+data handling.
#### Weights and bias
-With a trained model at hand, we can look at the weights \((b_1, b_2, ...,
+With a trained model at hand, we can look at the weights \((b_1, b_2, ...,
b_n)\) and bias \((a)\).
```python
@@ -264,43 +258,43 @@ print(f"Model weights: {model.coef_}")
Model weights: [[ 0.98208299 0.22519686 -0.36688444 0.0262268 ... ]]
```
-The `coef_` attribute contains the weight for each feature.
+The `coef_` attribute contains the weight for each feature.
[As discussed](#deja-vu-linear-regression), the weights are real numbers.
???+ warning "You might not have the exact same results"
- Your model weights might differ slightly from the ones shown above.
- This is completely normal and happens because:
+ Your model weights might differ slightly from the ones shown above. This is
+ completely normal and happens because:
- **Numerical precision**: The default optimization solver
- (`#!python "lbfgs"`) behind `LogisticRegression` encounters tiny
- hardware-specific variations. The underlying libraries handle
- floating-point arithmetic differently across hardware platforms. During the
- iterative optimization, these tiny rounding differences accumulate,
- causing the solver to converge to slightly different solutions.
+ **Numerical precision**: The default optimization solver (`#!python "lbfgs"`)
+ behind `LogisticRegression` encounters tiny hardware-specific variations. The
+ underlying libraries handle floating-point arithmetic differently across
+ hardware platforms. During the iterative optimization, these tiny rounding
+ differences accumulate, causing the solver to converge to slightly different
+ solutions.
- :fontawesome-solid-lightbulb: These small differences don't affect your
- model's predictions or accuracy.
+ :fontawesome-solid-lightbulb: These small differences don't affect your model's
+ predictions or accuracy.
Now, it's your turn to look at the bias.
???+ question "Model bias"
1. Open the `scikit-learn` docs on the
- [`LogisticRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
- class.
- 2. Find out how to access the bias term of the model.
- 3. Simply print the bias term of the model.
+ [`LogisticRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
+ class.
+ 1. Find out how to access the bias term of the model.
+ 1. Simply print the bias term of the model.
- :fontawesome-solid-lightbulb: Remember, the bias is often referred to as
+ :fontawesome-solid-lightbulb: Remember, the bias is often referred to as
intercept.
### Predictions
-Since, the main purpose of a machine learning model is to make predictions,
-we will do just that.
+Since, the main purpose of a machine learning model is to make predictions, we
+will do just that.
-Predicting, is as simple as using the `predict()` method. We will use the
+Predicting, is as simple as using the `predict()` method. We will use the
patient measurements of the test set - `X_test`.
```python
@@ -314,25 +308,24 @@ print(y_pred[:5])
[1 0 0 1 1]
```
-Congratulations, you just build a machine learning model to predict breast
-cancer. But how good is the model? To conclude the chapter, we will briefly
+Congratulations, you just build a machine learning model to predict breast
+cancer. But how good is the model? To conclude the chapter, we will briefly
evaluate the model's performance.
### Evaluate the model
-Surely, we could just manually compare the predictions (`y_pred`) with the
-actual labels (`y_test`) and evaluate how often the model was correct. Or
+Surely, we could just manually compare the predictions (`y_pred`) with the
+actual labels (`y_test`) and evaluate how often the model was correct. Or
instead, we can leverage another method called `score()`.
```python
score = model.score(X_test, y_test)
```
-First, the `score()` method takes `X_test` and makes the corresponding
-predictions and programmatically compares the predictions with the actual
-labels `y_test`. `score()` returns the accuracy
-:fontawesome-solid-arrow-right: the proportion of correctly
-classified instances.
+First, the `score()` method takes `X_test` and makes the corresponding
+predictions and programmatically compares the predictions with the actual
+labels `y_test`. `score()` returns the accuracy :fontawesome-solid-arrow-right:
+the proportion of correctly classified instances.
```python
print(f"Model accuracy: {round(score, 4)}")
@@ -342,33 +335,33 @@ print(f"Model accuracy: {round(score, 4)}")
Model accuracy: 0.9561
```
-In our case, the model correctly classified 95.61% of the test set. In
-other words, in 95.61% of instances, the model was able to correctly predict
-if a tumor is malignant or benign.
+In our case, the model correctly classified 95.61% of the test set. In other
+words, in 95.61% of instances, the model was able to correctly predict if a
+tumor is malignant or benign.
???+ tip
As the test set (both attributes and labels) were never used to train the
- model, the accuracy is a good indicator of how well the model generalizes
- to unseen data.
+ model, the accuracy is a good indicator of how well the model generalizes to
+ unseen data.
## Recap
We covered logistic regression, a popular algorithm for binary classification.
-Upon discussing the theory, we discovered similarities to linear regression
-in regard to the linear combination of features. With the help of the
-sigmoid function, we transformed the linear combination into probabilities
-between 0 and 1.
+Upon discussing the theory, we discovered similarities to linear regression in
+regard to the linear combination of features. With the help of the sigmoid
+function, we transformed the linear combination into probabilities between 0
+and 1.
-Subsequently, we trained a logistic regression model on the breast cancer
-data to predict whether a tumor is malignant or benign. To evaluate the
-model we split the data and finally calculated the accuracy.
+Subsequently, we trained a logistic regression model on the breast cancer data
+to predict whether a tumor is malignant or benign. To evaluate the model we
+split the data and finally calculated the accuracy.
???+ info
- In subsequent chapters we will explore more sophisticated ways to split
- data and evaluate models.
+ In subsequent chapters we will explore more sophisticated ways to split data
+ and evaluate models.
Next up, we will dive into algorithms, like decision trees and random forest,
that can handle both regression and classification problems.
diff --git a/docs/data-science/algorithms/supervised/regression.md b/docs/data-science/algorithms/supervised/regression.md
index d005d629..87b35cbf 100644
--- a/docs/data-science/algorithms/supervised/regression.md
+++ b/docs/data-science/algorithms/supervised/regression.md
@@ -2,14 +2,14 @@
## Linear Regression
-In machine learning, we often want to predict continuous numerical values, like
-house prices, temperatures or sales figures. Linear regression also knows as
-Ordinary Least Squares (OLS) provides a foundational approach to this problem
-by modeling the relationship between input variables and a target variable
+In machine learning, we often want to predict continuous numerical values, like
+house prices, temperatures or sales figures. Linear regression also knows as
+Ordinary Least Squares (OLS) provides a foundational approach to this problem
+by modeling the relationship between input variables and a target variable
using a straight line.
-This chapter introduces linear regression through a hands-on example.
-You'll learn to:
+This chapter introduces linear regression through a hands-on example. You'll
+learn to:
- Build and train a linear regression model
- Interpret model parameters (intercept and coefficients)
@@ -17,30 +17,21 @@ You'll learn to:
- Evaluate model performance using the coefficient of determination (\(R^2\))
- Get familiar with the `scikit-learn` workflow to train and evaluate models
----
+______________________________________________________________________
???+ info
This chapter adapts and expands upon:
- ^^scikit-learn: *Ordinary Least Squares and Ridge Regression*[^1]^^
-
- ^^scikit-learn: *Linear Models*[^2]^^
-
- ^^scikit-learn: *Metrics and scoring: quantifying the quality of predictions*[^3]^^
-
- [^1]:
- [https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols_ridge.html](https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols_ridge.html)
- [^2]:
- [https://scikit-learn.org/stable/modules/linear_model.html](https://scikit-learn.org/stable/modules/linear_model.html)
- [^3]:
- [https://scikit-learn.org/stable/modules/model_evaluation.html#r2-score-the-coefficient-of-determination](https://scikit-learn.org/stable/modules/model_evaluation.html#r2-score-the-coefficient-of-determination)
+ - ^^scikit-learn: *[Ordinary Least Squares and Ridge Regression](https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols_ridge.html)*^^
+ - ^^scikit-learn: *[Linear Models](https://scikit-learn.org/stable/modules/linear_model.html)*^^
+ - ^^scikit-learn: *[Metrics and scoring: quantifying the quality of predictions](https://scikit-learn.org/stable/modules/model_evaluation.html#r2-score-the-coefficient-of-determination)*^^
## Theory
-Linear regression, also known as Ordinary Least Squares (OLS), models the
-relationship between a continuous target variable \(y\) and one or more input
-variables \(X\). The goal is to find the best linear function that predicts
+Linear regression, also known as Ordinary Least Squares (OLS), models the
+relationship between a continuous target variable \(y\) and one or more input
+variables \(X\). The goal is to find the best linear function that predicts
\(\hat{y}\) from \(X\).
???+ defi "Linear combination"
@@ -50,15 +41,15 @@ variables \(X\). The goal is to find the best linear function that predicts
\]
where:
-
+
- \(w_0\) is the **intercept** (bias term)
- \(w_1, w_2, ..., w_n\) are the **coefficients** (weights)
- \(x_1, x_2, ..., x_n\) are the input features
-The term "Ordinary Least Squares" refers to the optimization objective,
-finding the weights \(w_0, w_1, ..., w_n\) that minimize the sum of squared
-differences called residuals between the actual values \(y\) and predicted
-values \(\hat{y}\).
+The term "Ordinary Least Squares" refers to the optimization objective, finding
+the weights \(w_0, w_1, ..., w_n\) that minimize the sum of squared differences
+called residuals between the actual values \(y\) and predicted values
+\(\hat{y}\).
???+ defi "Cost function"
@@ -68,28 +59,27 @@ values \(\hat{y}\).
where \(n\) is the number of observations.
-This minimization ensures that our model makes the smallest possible errors
-on average when predicting the training data. Let's look at an example.
+This minimization ensures that our model makes the smallest possible errors on
+average when predicting the training data. Let's look at an example.
## Example
-`scikit-learn` provides a couple of data sets for download. To fit a linear
+`scikit-learn` provides a couple of data sets for download. To fit a linear
regression on a real-world example, we choose the California housing data set.
-More information about the California Housing data set can be found
+More information about the California Housing data set can be found
[here](https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset).
???+ info
Data reference:
- ^^Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
- Statistics and Probability Letters, 33:291-297, 1997^^
+ ^^Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions, Statistics
+ and Probability Letters, 33:291-297, 1997^^
-Our objective is to model the target variable \(y\) using input variables
-\(X\). In this case, \(y\) corresponds to the median house value, expressed in
-hundreds of thousands of dollars ($100,000).
-Below figure shows all houses in California colored by their median value
-\(y\).
+Our objective is to model the target variable \(y\) using input variables
+\(X\). In this case, \(y\) corresponds to the median house value, expressed in
+hundreds of thousands of dollars ($100,000). Below figure shows all houses in
+California colored by their median value \(y\).
-- __Scatter Plot__
-
- ---
-
- Looking at the scatter plot, you might intuitively imagine drawing a straight
- line through the points that best captures the trend. This intuition is
- exactly what OLS does mathematically, it finds the optimal line that minimizes
- the distance between the line and all data points. :point_down:
-
--
-
-
+- __Scatter Plot__
+
+ ______________________________________________________________________
+
+ Looking at the scatter plot, you might intuitively imagine drawing a straight
+ line through the points that best captures the trend. This intuition is
+ exactly what OLS does mathematically, it finds the optimal line that
+ minimizes the distance between the line and all data points. :point_down:
+
+-
+
+
--
-
-
+-
+
+
-- __Best-Fit Line__
+- __Best-Fit Line__
+
+ ______________________________________________________________________
- ---
+ The OLS model finds the line that minimizes the sum of squared residuals, the
+ vertical distances between each point and the line. Recall from the theory
+ section that this is exactly what the cost function measures:
- The OLS model finds the line that minimizes the sum of squared residuals,
- the vertical distances between each point and the line. Recall from the
- theory section that this is exactly what the cost function measures:
-
\[
\text{min} \quad \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
\]
@@ -237,8 +228,8 @@ plt.show()
### Train the model
-Our next step is to train an OLS model to automatically find this "best-fit"
-line. Remember, since we have one input variable, the linear combination
+Our next step is to train an OLS model to automatically find this "best-fit"
+line. Remember, since we have one input variable, the linear combination
simplifies to:
\[
@@ -260,8 +251,8 @@ from sklearn.linear_model import LinearRegression
model = LinearRegression()
```
-At this point, the model is not trained, however that can be easily done
-using the `fit()` method. Remember, to use the training set
+At this point, the model is not trained, however that can be easily done using
+the `fit()` method. Remember, to use the training set
```python
model.fit(X=X_train[["MedInc"]], y=y_train)
@@ -269,7 +260,7 @@ model.fit(X=X_train[["MedInc"]], y=y_train)
#### Intercept and coefficient
-After training, we can inspect the model's learned parameters. The intercept
+After training, we can inspect the model's learned parameters. The intercept
and coefficient that define the best-fit line:
```python
@@ -290,15 +281,15 @@ These values tell us that our linear model is:
**Interpretation:**
-- **Intercept (0.4446)**: The baseline house value (when *MedInc* is zero)
- ~ $44,460
+- **Intercept (0.4446)**: The baseline house value (when *MedInc* is zero)
+ is around $44,460
- **Coefficient (0.4193)**: For each unit increase in *MedInc*, the house value
increases by ~ $41,930
### Predictions
-Now that the model is trained, we can predict house prices for new observations.
-Let's predict the price \(\hat{y}\) for a house in an area where
+Now that the model is trained, we can predict house prices for new
+observations. Let's predict the price \(\hat{y}\) for a house in an area where
*MedInc* is `#!python 3.5`:
```python
@@ -319,7 +310,8 @@ The model predicts a house value of approximately **$191,230**.
#### Manual validation
-We can verify this prediction using our linear equation. Substituting \(x_1 = 3.5\):
+We can verify this prediction using our linear equation. Substituting
+\(x_1 = 3.5\):
\[
\begin{align}
@@ -332,21 +324,21 @@ This matches our model's prediction!
???+ question "Practice: Make your own prediction"
- Calculate the predicted house price for an area where *MedInc* is
+ Calculate the predicted house price for an area where *MedInc* is
`#!python 5.0`.
-
+
1. Use `#!python model.predict()` to get the prediction.
- 2. Validate it by hand using the linear equation.
- 3. Do the results match?
+ 1. Validate it by hand using the linear equation.
+ 1. Do the results match?
### Evaluate the model
-Now we can make predictions, but we don't know how accurate they actually are.
-We need to quantify the model's performance to determine if it generalizes
-well to new, unseen data.
+Now we can make predictions, but we don't know how accurate they actually are.
+We need to quantify the model's performance to determine if it generalizes well
+to new, unseen data.
-Remember we set aside our test set earlier? This is where we use it. By
-evaluating on data the model hasn't seen during training, we get an honest
+Remember we set aside our test set earlier? This is where we use it. By
+evaluating on data the model hasn't seen during training, we get an honest
assessment of its predictive power.
To measure the model's performance, we'll use the coefficient of determination.
@@ -357,7 +349,7 @@ To measure the model's performance, we'll use the coefficient of determination.
This section focuses on the definition implemented by `scikit-learn`.
-The coefficient of determination, known as the \(R^2\) score, measures the
+The coefficient of determination, known as the \(R^2\) score, measures the
proportion of variance in the target variable that is explained by the model.
???+ defi "\(R^2\) Score"
@@ -391,48 +383,47 @@ r2 = r2_score(y_true=y_test, y_pred=y_pred)
print(f"R² Score: {round(r2, 4)}")
```
-``` title=">>> Output"
+```title=">>> Output"
R² Score: 0.4589
```
???+ tip "Understanding \(R^2\)"
- An \(R^2\) score of 0.4589 means the model explains 45.89% of the variance
- in house prices using only median income. While this is informative, it's
- not great. It suggests that other factors (location, house size, etc.)
+ An \(R^2\) score of 0.4589 means the model explains 45.89% of the variance in
+ house prices using only median income. While this is informative, it's not
+ great. It suggests that other factors (location, house size, etc.)
significantly influence house prices.
???+ question "Find a better model"
- Can you improve the \(R^2\) score? Fit new models and experiment with the
+ Can you improve the \(R^2\) score? Fit new models and experiment with the
following:
**Model variations:**
-
- - Use different individual input variables (e.g., *HouseAge*, *AveRooms*,
+
+ - Use different individual input variables (e.g., *HouseAge*, *AveRooms*,
*AveBedrms*)
- Use a combination of multiple input variables
- Compare single-variable vs. multi-variable models
-
+
**Data preparation:**
-
+
- Adjust the train-test split ratio
- Remember to use `#!python random_state` for reproducibility
-
+
**Analysis:**
-
+
- Calculate and compare \(R^2\) scores for each model
- Inspect the intercept and coefficients for multi-variable models
- Make predictions with your best-performing model
- Manually verify one prediction using the linear equation
-
- Which combination gives you the highest \(R^2\) score? What does this
- tell you about which features are most important for predicting house
- prices?
+
+ Which combination gives you the highest \(R^2\) score? What does this tell you
+ about which features are most important for predicting house prices?
## Detour: Model workflow
-The workflow you practiced here forms the foundation for all supervised
+The workflow you practiced here forms the foundation for all supervised
learning algorithms in `scikit-learn`:
```python
@@ -452,17 +443,17 @@ y_pred = model.predict(X_test)
score = model.score(y_test, y_pred)
```
-This consistent pattern applies to all upcoming chapters, whether you're
+This consistent pattern applies to all upcoming chapters, whether you're
building regression or classification models.
## Recap
-In this chapter, you learned the fundamentals of linear regression through a
+In this chapter, you learned the fundamentals of linear regression through a
practical example. The key takeaways:
-- **Linear regression** models the relationship between input variables and a
+- **Linear regression** models the relationship between input variables and a
target variable using a linear combination. Find the best-fit line by
minimizing the sum of squared residuals.
-- **\(R^2\) score** quantifies how well the model explains variance in the
+- **\(R^2\) score** quantifies how well the model explains variance in the
target variable
- **`scikit-learn` workflow** allows to easily train and evaluate model
diff --git a/docs/data-science/algorithms/supervised/tree-based/cart.md b/docs/data-science/algorithms/supervised/tree-based/cart.md
index fb53e97d..a44a7101 100644
--- a/docs/data-science/algorithms/supervised/tree-based/cart.md
+++ b/docs/data-science/algorithms/supervised/tree-based/cart.md
@@ -1,15 +1,15 @@
# Decision Tree
-So far we have covered linear regression and logistic regression which are
-limited to linear relationships. In contrast, decision trees are non-linear
-models able to capture complex relationships in the data. They are easy to
+So far we have covered linear regression and logistic regression which are
+limited to linear relationships. In contrast, decision trees are non-linear
+models able to capture complex relationships in the data. They are easy to
interpret and visualize, making them a popular choice for many applications.
Moreover, decision trees can be used for both regression ^^*and*^^
classification!
In this chapter, we will explore the theory behind decision trees followed by
-practical examples. As always we will use `scikit-learn` for hands-on
+practical examples. As always we will use `scikit-learn` for hands-on
experience.
## Basic intuition
@@ -35,13 +35,12 @@ graph TD
Depending on the answers, you can decide whether to go skiing or not.
A decision tree resembles a flowchart where each internal node represents a
-decision based on a feature (e.g., Is there any snow?), each branch represents
-the outcome of that decision, and each leaf node represents a final
-prediction (either a class label for classification or a continuous value
-for regression).
+decision based on a feature (e.g., Is there any snow?), each branch represents
+the outcome of that decision, and each leaf node represents a final prediction
+(either a class label for classification or a continuous value for regression).
-To get a better understanding of the terms node, branch and leaf, consider
-the illustration of a (rotated) tree.
+To get a better understanding of the terms node, branch and leaf, consider the
+illustration of a (rotated) tree.

@@ -50,9 +49,9 @@ the illustration of a (rotated) tree.
-In the skiing example, the nodes are the questions you ask yourself. With
-branches being a simple binary split (the answers to the question).
-The leaf nodes are the final predictions, in our case whether to go skiing.
+In the skiing example, the nodes are the questions you ask yourself. With
+branches being a simple binary split (the answers to the question). The leaf
+nodes are the final predictions, in our case whether to go skiing.
Given the skiing decision tree, what kind of supervised learning task is this?
@@ -77,129 +76,125 @@ which is a classic binary classification task.
???+ info
- This theoretical section on decision trees follows: ^^Christopher M.
- Bishop. 2006. *Pattern Recognition and Machine Learning*[^1]^^
-
- We focus on a particular algorithm called CART
- (=**C**lassification **A**nd **R**egression **T**rees).
- The theoretical foundations of CART were developed by:
- ^^Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone. 1984.
+ This theoretical section on decision trees follows: ^^Christopher M. Bishop.
+ 2006\. *Pattern Recognition and Machine Learning*[^1]^^
+
+ We focus on a particular algorithm called CART (=**C**lassification **A**nd
+ **R**egression **T**rees). The theoretical foundations of CART were developed
+ by: ^^Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone. 1984.
*Classification and Regression Trees*[^2]^^
-
- [^1]:
- Christopher M. Bishop. Pattern Recognition and Machine Learning.
- Springer, 2006. [Link](https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf)
- [^2]:
- Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone.
- Classification and Regression Trees. Chapman and Hall/CRC, 1984.
- [https://doi.org/10.1201/9781315139470](https://doi.org/10.1201/9781315139470)
----
+ [^1]: Christopher M. Bishop. Pattern Recognition and Machine Learning.
+ Springer, 2006.
+ [Link](https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf)
+ [^2]: Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone.
+ Classification and Regression Trees. Chapman and Hall/CRC, 1984.
+ [https://doi.org/10.1201/9781315139470](https://doi.org/10.1201/9781315139470)
+
+______________________________________________________________________
When building a decision tree a couple of questions arise:
-- :fontawesome-solid-question:{ .lg .middle } __Question__
+- :fontawesome-solid-question:{ .lg .middle } __Question__
- ---
+ ______________________________________________________________________
1. How do we pick the right feature for a split?
- 2. What's the decision criteria at each node?
- 3. How large do we grow the tree?
-
+ 1. What's the decision criteria at each node?
+ 1. How large do we grow the tree?
-- :fontawesome-solid-lightbulb:{ .lg .middle } __Intuition__
+- :fontawesome-solid-lightbulb:{ .lg .middle } __Intuition__
- ---
+ ______________________________________________________________________
- 1. Which questions do we ask? Why did we ask "Can I
- get to a skiing resort?" and "Is there any snow?"?
- 2. It does not have to be a simple yes/no question. It can be a
- threshold for continuous values as well. E.g., "Is there more than
- 10cm of fresh snow?" But how do we choose the threshold?
- 3. How many questions do we ask? Why only 2 and not more?
+ 1. Which questions do we ask? Why did we ask "Can I get to a skiing resort?"
+ and "Is there any snow?"?
+ 1. It does not have to be a simple yes/no question. It can be a threshold for
+ continuous values as well. E.g., "Is there more than 10cm of fresh
+ snow?" But how do we choose the threshold?
+ 1. How many questions do we ask? Why only 2 and not more?
-With these questions in mind, let's dive into the theory of decision trees
-in order to tackle them.
+With these questions in mind, let's dive into the theory of decision trees in
+order to tackle them.
----
+______________________________________________________________________
### Greedy optimization
As a decision tree is a supervised learning algorithm, the goal is to predict
the target variable \(y\) with a set of features \(x_1, x_2, ..., x_n\).
-With the data at hand, the CART algorithm finds the optimal tree
-structure that minimizes the prediction error. In turn, the
-optimal tree structure depends on the chosen splits.
+With the data at hand, the CART algorithm finds the optimal tree structure that
+minimizes the prediction error. In turn, the optimal tree structure depends on
+the chosen splits.
???+ info
-
+
A split in CART is a binary decision rule that divides the dataset into two
subsets based on a specific feature and threshold.
- Imagine if we extend our skiing example with the split "Is there more than
- 10cm of fresh snow?". The split divides the data into two subsets: one
- where observations have more than 10cm of fresh snow and another where
- observations don't. With *amount of fresh snow* being the feature and *10cm*
- the threshold.
+ Imagine if we extend our skiing example with the split "Is there more than 10cm
+ of fresh snow?". The split divides the data into two subsets: one where
+ observations have more than 10cm of fresh snow and another where observations
+ don't. With *amount of fresh snow* being the feature and *10cm* the threshold.
-However, given large data sets, there are simply too many splitting
-possibilities to consider at once. Hence, the tree is grown in a greedy fashion.
+However, given large data sets, there are simply too many splitting
+possibilities to consider at once. Hence, the tree is grown in a greedy
+fashion.
-The greedy optimization starts with a single root node splitting the data
-into two partitions and adds additional nodes one at a time. At each step, the
+The greedy optimization starts with a single root node splitting the data into
+two partitions and adds additional nodes one at a time. At each step, the
algorithm chooses a split using exhaustive search. The best split is determined
-by a criterion. Remember, that decision trees can deal with regression and
+by a criterion. Remember, that decision trees can deal with regression and
classification problems. Hence, the criterion differs for the two tasks.
----
+______________________________________________________________________
#### Regression
-For regression trees, the best split (feature threshold combination) at each
-node is determined by minimizing the *residual sum-of-squares error (RSS)*,
+For regression trees, the best split (feature threshold combination) at each
+node is determined by minimizing the *residual sum-of-squares error (RSS)*,
defined as:
???+ defi "Residual sum-of-squares (RSS)"
- \[
- RSS = \sum_{i \in t_L} (y_i - \bar{y}_L)^2 + \sum_{i \in t_R} (y_i -
- \bar{y}_R)^2
+ \[
+ RSS = \sum_{i \in t_L} (y_i - \bar{y}_L)^2 + \sum_{i \in t_R} (y_i -
+ \bar{y}_R)^2
\]
where \(t_L\) and \(t_R\) are the left and right child nodes after the split,
and \(\bar{y}_L\) and \(\bar{y}_R\) are the mean target values in the
respective nodes.
-The algorithm searches through all possible splits to find the one that
+The algorithm searches through all possible splits to find the one that
minimizes this RSS criterion.
???+ info
- Since each split separates the input data into two partitions, the
- prediction is the mean of the target variable \(y\) in the respective
- partition.
-
- Hence, intuitively speaking, we do not optimize the entire tree at once
- but rather optimize each split locally.
+ Since each split separates the input data into two partitions, the prediction
+ is the mean of the target variable \(y\) in the respective partition.
+
+ Hence, intuitively speaking, we do not optimize the entire tree at once but
+ rather optimize each split locally.
#### Classification
-For classification tasks, the best split at each node is determined by minimizing
-the *Gini impurity*.
+For classification tasks, the best split at each node is determined by
+minimizing the *Gini impurity*.
???+ defi "Gini impurity"
For a node \(t\) with \(K\) classes, the Gini impurity is defined as:
\[
- Gini(t) = \sum_{k=1}^K p_{k}(1-p_{k}) = 1 - \sum_{k=1}^K p_{k}^2
+ Gini(t) = \sum_{k=1}^K p_{k}(1-p_{k}) = 1 - \sum_{k=1}^K p_{k}^2
\]
-
+
where \(p_k\) is the proportion of class \(k\) observations.
The Gini impurity (sometimes referred to as Gini index) encourages leaf nodes
@@ -207,64 +202,63 @@ where the majority of observations belong to a single class.
???+ info
- The prediction at each leaf node is the majority class among the training
+ The prediction at each leaf node is the majority class among the training
observations in that node.
----
+______________________________________________________________________
#### TLDR
-No matter the task (regression or classification), with a greedy optimization
-strategy, the CART algorithm searches for the best split using an exhaustive
+No matter the task (regression or classification), with a greedy optimization
+strategy, the CART algorithm searches for the best split using an exhaustive
search at each node to ultimately minimize the prediction error. Thus answering
-the first two questions, *a* (How do we pick the right feature for a split?)
+the first two questions, *a* (How do we pick the right feature for a split?)
and *b* (What's the decision criteria at each node?).
-A CART can be seen as a piecewise-constant model, as it partitions the feature
-space into regions and assigns a constant prediction (either the mean of a
+A CART can be seen as a piecewise-constant model, as it partitions the feature
+space into regions and assigns a constant prediction (either the mean of a
continuous value or a label) to each region.
### Tree size
-Lastly, we answer question, *c* (How large do we grow the tree?).
-Put differently, when should we stop adding nodes?
+Lastly, we answer question, *c* (How large do we grow the tree?). Put
+differently, when should we stop adding nodes?
-First, the tree is grown as large as possible until a stopping criterion is
-met. This criterion can be the maximum tree depth or a minimum number of
-observations per leaf. Second, the tree is pruned back. Pruning is the process
-of removing nodes that do not improve the model's performance. It balances the
+First, the tree is grown as large as possible until a stopping criterion is
+met. This criterion can be the maximum tree depth or a minimum number of
+observations per leaf. Second, the tree is pruned back. Pruning is the process
+of removing nodes that do not improve the model's performance. It balances the
RSS error or Gini impurity against model complexity.
???+ info
- If you want to dive deeper into tree pruning, we recommend reading page 665
- of Bishop's book *Pattern Recognition and Machine Learning*[^1]
+ If you want to dive deeper into tree pruning, we recommend reading page 665 of
+ Bishop's book *Pattern Recognition and Machine Learning*[^1]
----
+______________________________________________________________________
## Advantages and Limitations
-Decision trees offer several significant advantages, but they also have their
+Decision trees offer several significant advantages, but they also have their
limitations:
-- :fontawesome-regular-thumbs-up:{ .lg .middle } __Advantages__
+- :fontawesome-regular-thumbs-up:{ .lg .middle } __Advantages__
- ---
+ ______________________________________________________________________
- Easy to interpret and visualize
- Can capture non-linear relationships
+- :fontawesome-regular-thumbs-down:{ .lg .middle } __Limitations__
-- :fontawesome-regular-thumbs-down:{ .lg .middle } __Limitations__
-
- ---
+ ______________________________________________________________________
- - Prone to overfitting, i.e., building a model that perfectly fits the
- training data but fails to generalize on new (unseen) data.
- - Sensitive to data, i.e., small changes in the data can lead to
- significantly different trees.
+ - Prone to overfitting, i.e., building a model that perfectly fits the
+ training data but fails to generalize on new (unseen) data.
+ - Sensitive to data, i.e., small changes in the data can lead to
+ significantly different trees.
@@ -276,14 +270,14 @@ As mentioned earlier, we will use `scikit-learn` for hands-on experience.
[^3]:
`scikit-learn` documentation: [Decision Trees](https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart)
-Functionalities around decision trees are all part of the
+Functionalities around decision trees are all part of the
[`tree` module](https://scikit-learn.org/stable/api/sklearn.tree.html) in
`scikit-learn`.
### Regression
-First, we start with a regression task. We will use the California housing
-data to predict house prices using a decision tree regressor.
+First, we start with a regression task. We will use the California housing data
+to predict house prices using a decision tree regressor.
#### Load data
@@ -303,7 +297,7 @@ X_train, X_test, y_train, y_test = train_test_split(
)
```
-As always, a seed is set for reproducibility (`#!python random_state=42`). It
+As always, a seed is set for reproducibility (`#!python random_state=42`). It
can be any integer, you can simply pick any number.
#### Fit and evaluate the model
@@ -330,17 +324,17 @@ print(f"Model performance (R²): {round(score, 2)}")
Model performance (R²): 0.61
```
-The `score()` method returns the coefficient of determination \(R^2\).
-You should be already familiar with \(R^2\), as it was first introduced
-in the [Regression chapter](../regression.md#coefficient-of-determination) to
-evaluate the fit of a linear regression.
+The `score()` method returns the coefficient of determination \(R^2\). You
+should be already familiar with \(R^2\), as it was first introduced in the
+[Regression chapter](../regression.md#coefficient-of-determination) to evaluate
+the fit of a linear regression.
-The decision tree model achieved an \(R^2\) of 0.61 on the test set, which
+The decision tree model achieved an \(R^2\) of 0.61 on the test set, which
leaves room for improvement.
???+ info
- On a side note: Although we fitted a decision tree on `#!python 16512`
+ On a side note: Although we fitted a decision tree on `#!python 16512`
observations, the process of actually training the model is quite fast!
#### Plot the tree
@@ -352,8 +346,8 @@ We can easily visualize the tree using the `plot_tree` function.
???+ tip
- This is the first time that we discourage you from running the code
- snippet below. Soon you will know why.
+ This is the first time that we discourage you from running the code snippet
+ below. Soon you will know why.
```python
import matplotlib.pyplot as plt
@@ -369,14 +363,14 @@ plt.show() # use matplotlib to show the plot
-Though we can't read any of the information present, the plot hints at a huge
+Though we can't read any of the information present, the plot hints at a huge
tree. Due to its complexity, the model does not add much value to the
understanding of the data (it's simply not interpretable).
-Actually visualizing this particular tree takes some time, hence we
-discouraged you from executing the code.
+Actually visualizing this particular tree takes some time, hence we discouraged
+you from executing the code.
-But why do we get such a huge tree? By default, the CART implementation in
+But why do we get such a huge tree? By default, the CART implementation in
`scikit-learn` grows the tree as large as possible and does *not* prune it.
##### ... to fix
@@ -395,9 +389,9 @@ model = DecisionTreeRegressor(
model.fit(X_train, y_train)
```
-The `max_depth` parameter limits the depth of the tree, while `min_samples_leaf`
-sets the minimum number of samples (observations) required to be in a leaf
-node. Both prevent the tree from growing too large.
+The `max_depth` parameter limits the depth of the tree, while
+`min_samples_leaf` sets the minimum number of samples (observations) required
+to be in a leaf node. Both prevent the tree from growing too large.
???+ info
@@ -412,28 +406,28 @@ import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
plot_tree(
- model,
- filled=True, # (1)!
+ model,
+ filled=True, # (1)!
feature_names=X.columns, # (2)!
- proportion=True # (3)!
+ proportion=True, # (3)!
)
plt.show()
```
-1. `#!python filled=True` colors nodes according to prediction values.
- A stronger color indicating a higher value.
-2. The parameter `feature_names` is used to label the features in the tree.
-3. `proportion=True` displays the proportion of samples in each node.
+1. `#!python filled=True` colors nodes according to prediction values. A
+ stronger color indicating a higher value.
+1. The parameter `feature_names` is used to label the features in the tree.
+1. `proportion=True` displays the proportion of samples in each node.
???+ info
-
- Generally, it is always good practice to consult the documentation, if
- you are unsure about the usage of a function/class.
- Regarding `plot_tree()`, you might find some useful information in the
+ Generally, it is always good practice to consult the documentation, if you are
+ unsure about the usage of a function/class.
+
+ Regarding `plot_tree()`, you might find some useful information in the
[docs](https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html)
- that can help you customize the plot to your liking.
- So don't shy away from reading the documentation!
+ that can help you customize the plot to your liking. So don't shy away from
+ reading the documentation!

@@ -446,25 +440,23 @@ plt.show()
???+ tip
The nodes are quite easy to read:
-
- Starting with the root node, the feature `MedInc` performs
- the first split. If the median income is less than 5.086, we follow the
- left branch else the right branch. The resulting `squared_error` of the
- split is shown as well. At the root node, the `squared_error` (sum of the
- squared differences between the actual values and the predicted value)
- is 1.337. The lower the `squared_error`, the better the split. A "perfect
- split" would result in a `squared_error` of 0.
-
- The root node splits the data into two subsets, the left branch results
- in a subest containing 79.3% of the training data and the right branch
- 20.7%. Compared to the root node, both additional splits lead to a
- decrease of the `squared_error` and thus increase the predictive power.
- After two more splits, we reach the leaf nodes. Each leaf node contains
- a value, the final prediction.
-
-Now we have a pruned tree, which reduced the risk of overfitting. However, at
-the cost of model performance. The \(R^2\) decreased from 0.61 to 0.42 which
-might indicate that such a simple tree might not capture the complexity of the
+
+ Starting with the root node, the feature `MedInc` performs the first split. If
+ the median income is less than 5.086, we follow the left branch else the right
+ branch. The resulting `squared_error` of the split is shown as well. At the
+ root node, the `squared_error` (sum of the squared differences between the
+ actual values and the predicted value) is 1.337. The lower the `squared_error`,
+ the better the split. A "perfect split" would result in a `squared_error` of 0.
+
+ The root node splits the data into two subsets, the left branch results in a
+ subest containing 79.3% of the training data and the right branch 20.7%.
+ Compared to the root node, both additional splits lead to a decrease of the
+ `squared_error` and thus increase the predictive power. After two more splits,
+ we reach the leaf nodes. Each leaf node contains a value, the final prediction.
+
+Now we have a pruned tree, which reduced the risk of overfitting. However, at
+the cost of model performance. The \(R^2\) decreased from 0.61 to 0.42 which
+might indicate that such a simple tree might not capture the complexity of the
data well.
@@ -476,26 +468,26 @@ data well.
In practice, you have to find the right parameters to balance model complexity
-and performance. Unfortunately, there is no one-size-fits-all solution. You
+and performance. Unfortunately, there is no one-size-fits-all solution. You
have to tune the parameters based on the data and the task at hand.
???+ question "Parameter tuning"
- Try some different combinations of `max_depth` and `min_samples_leaf`.
- Use the same train test split, we defined earlier.
-
+ Try some different combinations of `max_depth` and `min_samples_leaf`. Use the
+ same train test split, we defined earlier.
+
1. Manually change the values.
- 2. Fit the model.
- 3. Evaluate the model.
- 4. Plot the model.
- 5. Repeat! :repeat:
+ 1. Fit the model.
+ 1. Evaluate the model.
+ 1. Plot the model.
+ 1. Repeat! :repeat:
Can you get an \(R^2\) higher than `#!python 0.7`?
### Classification
-Next, we switch to a classification task. We will re-use the breast cancer
-data set introduced in the previous Classification chapter.
+Next, we switch to a classification task. We will re-use the breast cancer data
+set introduced in the previous Classification chapter.
#### Load data
@@ -510,7 +502,7 @@ X_train, X_test, y_train, y_test = train_test_split(
#### Fit and evaluate the model
-For classification trees, `scikit-learn` provides the class
+For classification trees, `scikit-learn` provides the class
`DecisionTreeClassifier`.
```python hl_lines="1"
@@ -518,21 +510,23 @@ from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(
# again, set max_depth and min_samples_leaf to prevent growing a huge tree
- random_state=784, max_depth=7, min_samples_leaf=5
+ random_state=784,
+ max_depth=7,
+ min_samples_leaf=5,
)
```
???+ question "Fit and evaluate the model"
Now it is your time to fit and evaluate the model. Although, you have never
- used an instance of `DecisionClassifier` before, you can use the same
- methods as with other models in `scikit-learn`. Simply refer to the
- previous regression example.
-
+ used an instance of `DecisionClassifier` before, you can use the same methods
+ as with other models in `scikit-learn`. Simply refer to the previous regression
+ example.
+
1. Fit the model on `X_train` and `y_train`.
- 2. Evaluate the model on `X_test` and `y_test`.
- 3. Print the model's performance.
- 4. Plot the tree.
+ 1. Evaluate the model on `X_test` and `y_test`.
+ 1. Print the model's performance.
+ 1. Plot the tree.
Lastly answer following quiz question to evaluate your result.
@@ -550,14 +544,14 @@ from the logistic regression.
## Recap
-We comprehensively explored decision trees, focusing on the CART algorithm.
-The theory section illuminated its core mechanisms, while practical
-examples demonstrated building and evaluating decision trees for regression and
+We comprehensively explored decision trees, focusing on the CART algorithm. The
+theory section illuminated its core mechanisms, while practical examples
+demonstrated building and evaluating decision trees for regression and
classification tasks. Key takeaways include:
- Algorithm insights into tree construction
-- Practical implementation skills
+- Practical implementation skills
- Understanding of decision trees' interpretability and overfitting risks
-Next, we'll extend our knowledge to Random Forests, an ensemble method
+Next, we'll extend our knowledge to Random Forests, an ensemble method
combining multiple decision trees to enhance predictive performance.
diff --git a/docs/data-science/algorithms/supervised/tree-based/forest.md b/docs/data-science/algorithms/supervised/tree-based/forest.md
index 4e270cfa..2afe8c3f 100644
--- a/docs/data-science/algorithms/supervised/tree-based/forest.md
+++ b/docs/data-science/algorithms/supervised/tree-based/forest.md
@@ -12,65 +12,65 @@ CART (Classification and Regression Trees) algorithm, we can dive right in.
???+ info
- Random forests were introduced by Leo Breiman in 2001. The following
- section closely follows the original paper.
+ Random forests were introduced by Leo Breiman in 2001. The following section
+ closely follows the original paper.
^^Breiman, L. Random Forests. *Machine Learning 45*, 5–32 (2001).^^
[https://doi.org/10.1023/A:1010933404324](https://doi.org/10.1023/A:1010933404324)
A random forest combines multiple decision trees to create an ensemble model.
-The idea is to grow multiple trees and average their predictions. Thus,
+The idea is to grow multiple trees and average their predictions. Thus,
resulting in a more robust model that improves generalization and reduces
overfitting.
The randomness in a random forest stems from two techniques:
1. Bootstrap sampling
-2. Random feature selection
+1. Random feature selection
### Bootstrap sampling
-The first technique is known as **bootstrap sampling**. Given a
-training set of size $N$, we draw $N$ samples ==with replacement==. This means
-that some samples may be repeated, while others may not be included at all.
-This results in a new training set of the same size as the original, but with
-some samples missing and others duplicated.
+The first technique is known as **bootstrap sampling**. Given a training set of
+size $N$, we draw $N$ samples ==with replacement==. This means that some
+samples may be repeated, while others may not be included at all. This results
+in a new training set of the same size as the original, but with some samples
+missing and others duplicated.
-Each tree is fit on a different bootstrap sample. Intuitively speaking, this
+Each tree is fit on a different bootstrap sample. Intuitively speaking, this
means that each tree sees a slightly different "version" of the training data.
### Random feature selection
-The second technique is **random feature selection**.
-Remember, that a CART is grown by selecting the best split at each node.
-This is done by considering all features. Contrary when growing trees for a
-random forest, we only consider a random subset of features at each split.
+The second technique is **random feature selection**. Remember, that a CART is
+grown by selecting the best split at each node. This is done by considering all
+features. Contrary when growing trees for a random forest, we only consider a
+random subset of features at each split.
----
+______________________________________________________________________
### Putting it all together
Each tree in a random forest is fit on a bootstrap sample and uses a random
-subset of features at each split.
-In case of regression, the predictions of all trees are simply averaged. In
-case of classification, the majority vote is taken. The majority vote in a
-random forest classification means that the class predicted most frequently by
-the individual trees is selected as the final prediction.
-
-No matter the task, classification or regression: it was observed that
-introducing randomness in the tree-growing process improves the model
+subset of features at each split. In case of regression, the predictions of all
+trees are simply averaged. In case of classification, the majority vote is
+taken. The majority vote in a random forest classification means that the class
+predicted most frequently by the individual trees is selected as the final
+prediction.
+
+No matter the task, classification or regression: it was observed that
+introducing randomness in the tree-growing process improves the model
performance.
???+ info
- Contrary to the classic CART, random forests do not constrain the tree
- growth. I.e., trees are fully grown and not pruned.
+ Contrary to the classic CART, random forests do not constrain the tree growth.
+ I.e., trees are fully grown and not pruned.
## Examples
-With a basic understanding of random forests we take a look at some
-examples. As always, we'll use our favorite machine learning package
-`scikit-learn` (at least that of the author :wink:).
+With a basic understanding of random forests we take a look at some examples.
+As always, we'll use our favorite machine learning package `scikit-learn` (at
+least that of the author :wink:).
In order to focus on the random forest implementation and its parameters, we'll
reuse the California housing data (for regression) and the breast cancer data
@@ -82,8 +82,8 @@ Let's start with building a random forest to predict California housing prices.
#### Load data
-As usual, we load the data and split it into a training and test set in
-order to evaluate the model later on.
+As usual, we load the data and split it into a training and test set in order
+to evaluate the model later on.
```python
from sklearn.datasets import fetch_california_housing
@@ -98,8 +98,8 @@ X_train, X_test, y_train, y_test = train_test_split(
#### Fit the model
-Just like with decision trees, `scikit-learn` provides two separate classes
-for regression and classification, namely `RandomForestRegressor` and
+Just like with decision trees, `scikit-learn` provides two separate classes for
+regression and classification, namely `RandomForestRegressor` and
`RandomForestClassifier`. Both are part of the `ensemble` module.
```python
@@ -109,8 +109,8 @@ model = RandomForestRegressor(random_state=784) # (1)!
model.fit(X_train, y_train)
```
-1. As a random forest is well random :sweat_smile:, we set the
- `random_state` to ensure the reproducibility of our results.
+1. As a random forest is well random :sweat_smile:, we set the `random_state`
+ to ensure the reproducibility of our results.
Depending on your setup, the fitting process might take a couple of seconds.
@@ -127,18 +127,18 @@ Model performance (R²): 0.81
???+ info
- Remember, that the `score()` method of a decision tree regressor
- (`DecisionTreeRegressor`) returned the coefficient of determination
- \(R^2\). The same applies to random forests regressors.
+ Remember, that the `score()` method of a decision tree regressor
+ (`DecisionTreeRegressor`) returned the coefficient of determination \(R^2\).
+ The same applies to random forests regressors.
Compared to a single tree with an \(R^2\) of 0.61, the random forest performs
-considerably better with an \(R^2\) of 0.81. You can re-visit the according
+considerably better with an \(R^2\) of 0.81. You can re-visit the according
section [here](cart.md#fit-and-evaluate-the-model).
???+ question "How many trees are in the forest?"
-
- Consult the `scikit-learn` docs to find out how many trees are in the
- forest by default. Use the following question for self-assessment.
+
+ Consult the `scikit-learn` docs to find out how many trees are in the forest by
+ default. Use the following question for self-assessment.
How many trees form a forest by default?
@@ -152,16 +152,13 @@ The parameter `n_estimators` defaults to 100 trees.
???+ info
- If you want to get closer to the original definition of a random forest
- regressor by Breiman, you have to set the `max_features` parameter.
- Specifically, with \(m\) features, the number of features considered at
- each split should be \(\frac{m}{3}\) for regression.
+ If you want to get closer to the original definition of a random forest
+ regressor by Breiman, you have to set the `max_features` parameter.
+ Specifically, with \(m\) features, the number of features considered at each
+ split should be \(\frac{m}{3}\) for regression.
```python hl_lines="2"
- RandomForestRegressor(
- max_features=len(X_train.columns) // 3,
- random_state=784
- )
+ RandomForestRegressor(max_features=len(X_train.columns) // 3, random_state=784)
```
By default, `scikit-learn` considers \(m\) features for each split.
@@ -169,9 +166,9 @@ The parameter `n_estimators` defaults to 100 trees.
???+ tip
If you're unsure how to set parameters of a model (such as `max_features`),
- stick to the defaults. `scikit-learn` provides sensible defaults
- that work well. In later chapters, we will explore methods to
- automatically tune these hyperparameters.
+ stick to the defaults. `scikit-learn` provides sensible defaults that work
+ well. In later chapters, we will explore methods to automatically tune these
+ hyperparameters.
### Classification
@@ -180,14 +177,14 @@ Next, we switch to a classification task.
???+ question
Load the breast cancer data, fit and evaluate a random forest.
-
+
1. Load the data and split it into a training and test set.
- 2. Load the appropriate random forest class.
- 3. Fit the model.
- 4. Evaluate the model on the test set.
+ 1. Load the appropriate random forest class.
+ 1. Fit the model.
+ 1. Evaluate the model on the test set.
- Hint: This and the previous chapter should provide all necessary
- information, to solve the tasks.
+ Hint: This and the previous chapter should provide all necessary information,
+ to solve the tasks.
#### Inspecting the forest
@@ -211,24 +208,20 @@ print(model.estimators_) # (1)!
`estimators_` is a list of individual tree instances. If you're dealing with a
`RandomForestRegressor`, `estimators_` is a list of `DecisionTreeRegressor`.
-In most cases, you won't need to inspect the individual trees. Nevertheless,
-we can utilize this information to solidify our understanding of random
-forests.
+In most cases, you won't need to inspect the individual trees. Nevertheless, we
+can utilize this information to solidify our understanding of random forests.
----
+______________________________________________________________________
### Stronger together
-We fit a random forest classifier on a synthetic data set to
-==literally== illustrate the different trees. First, we generate the data.
+We fit a random forest classifier on a synthetic data set to ==literally==
+illustrate the different trees. First, we generate the data.
```python
from sklearn.datasets import make_classification
-X, y = make_classification(
- random_state=42,
- n_clusters_per_class=1
-)
+X, y = make_classification(random_state=42, n_clusters_per_class=1)
```
Next, we initialize and fit a random forest classifier.
@@ -240,13 +233,13 @@ classifier = RandomForestClassifier(
classifier.fit(X, y)
```
-Note, that we set the number of trees to `#!python 4`. We keep the number
-small as we visualize them later on. The `max_depth` parameter limits the
-depth of each tree to `#!python 3`. This is done to perform pruning and thus
-keep the trees simple and easier to plot.
+Note, that we set the number of trees to `#!python 4`. We keep the number small
+as we visualize them later on. The `max_depth` parameter limits the depth of
+each tree to `#!python 3`. This is done to perform pruning and thus keep the
+trees simple and easier to plot.
Finally, we visualize all trees. We access the trees via the `estimators_`
-attribute and plot them using the familiar `plot_tree()` function. Everything
+attribute and plot them using the familiar `plot_tree()` function. Everything
else is just plot customization.
```python hl_lines="5 7"
@@ -276,26 +269,26 @@ plt.show()
-Although there is a lot of information cramped inside one figure, at first
-glance it is obvious that all four trees are different. Each of them differs
-in splits (feature and threshold), number of nodes and predictions.
+Although there is a lot of information cramped inside one figure, at first
+glance it is obvious that all four trees are different. Each of them differs in
+splits (feature and threshold), number of nodes and predictions.
Each one of these trees on their own might not generalize well, hence they are
-often referred to as weak learners. However, when combined, they form a
+often referred to as weak learners. However, when combined, they form a
"strong" model. That's the essence of an ensemble method!
### Feature importance
-One of the most powerful attribute of random forests is their ability to
-assess feature importance: measuring how much each input variable contributes
-to predicting the target variable.
+One of the most powerful attribute of random forests is their ability to assess
+feature importance: measuring how much each input variable contributes to
+predicting the target variable.
-Remember that trees are fitted on a [bootstrap](forest.md#bootstrap-sampling)
-training set. Since some samples are left out during this process, we can use
-these to measure the importance of each feature. These unused observations are
-called "out-of-bag" (OOB) samples. For each feature, the OOB samples are
-randomly permuted (shuffled) and the increase in prediction error is measured.
-Features that lead to larger increases in error when permuted are considered
+Remember that trees are fitted on a [bootstrap](forest.md#bootstrap-sampling)
+training set. Since some samples are left out during this process, we can use
+these to measure the importance of each feature. These unused observations are
+called "out-of-bag" (OOB) samples. For each feature, the OOB samples are
+randomly permuted (shuffled) and the increase in prediction error is measured.
+Features that lead to larger increases in error when permuted are considered
more important.
Let's examine feature importance using the breast cancer dataset:
@@ -318,27 +311,26 @@ print(rf.feature_importances_)
To keep the example concise, we did not perform a train test split.
-Feature importance values are a `#!python list` of `#!python float`s.
-Each value corresponds to a feature in the order they were passed to the
-model. The values are normalized and sum to `#!python 1.0`.
-A higher value indicates that the feature contributes more to making correct
-predictions.
+Feature importance values are a `#!python list` of `#!python float`s. Each
+value corresponds to a feature in the order they were passed to the model. The
+values are normalized and sum to `#!python 1.0`. A higher value indicates that
+the feature contributes more to making correct predictions.
Feature importance can help with:
1. Feature selection: Identifying which features are most relevant for
- predictions
-2. Model interpretation: Understanding which features drive the model's
- decisions
-3. Data collection: Guiding future data collection efforts by highlighting
- important measurements
+ predictions
+1. Model interpretation: Understanding which features drive the model's
+ decisions
+1. Data collection: Guiding future data collection efforts by highlighting
+ important measurements
???+ question "Visualize the feature importance"
- Generate a bar plot to visualize the feature importance.
- Use any package of your choice. For convenience, you can use the
- following code snippet to get started.
-
+ Generate a bar plot to visualize the feature importance. Use any package of
+ your choice. For convenience, you can use the following code snippet to get
+ started.
+
```python
import pandas as pd
@@ -367,6 +359,6 @@ sensitivity to data changes. While slightly less interpretable than single
trees, random forests provide better generalization, more robust predictions,
and useful insights through feature importance measures.
-With `scikit-learn`, you are now able to build a random forest for regression
+With `scikit-learn`, you are now able to build a random forest for regression
and classification tasks. You have also learned how to inspect individual trees
and assess feature importance.
diff --git a/docs/data-science/algorithms/unsupervised/clustering.md b/docs/data-science/algorithms/unsupervised/clustering.md
index 1e006a7d..789e300b 100644
--- a/docs/data-science/algorithms/unsupervised/clustering.md
+++ b/docs/data-science/algorithms/unsupervised/clustering.md
@@ -1,16 +1,16 @@
# Clustering
-In this section, we will start to explore unsupervised learning, where we work
-with data that isn't accompanied by labels. One of the primary techniques
-within this realm is clustering, which aims to uncover patterns or structures
-in the data by grouping similar data points together. A popular method for
-achieving this is k-means clustering, which aims to identify clusters of
+In this section, we will start to explore unsupervised learning, where we work
+with data that isn't accompanied by labels. One of the primary techniques
+within this realm is clustering, which aims to uncover patterns or structures
+in the data by grouping similar data points together. A popular method for
+achieving this is k-means clustering, which aims to identify clusters of
similar observations.
## K-means
-K-means was briefly introduced in the [Introduction](../index.md#example_1) to
-Supervised vs. Unsupervised Learning and used to segment customers based on
+K-means was briefly introduced in the [Introduction](../index.md#example_1) to
+Supervised vs. Unsupervised Learning and used to segment customers based on
their annual spending and average basket size.
@@ -22,37 +22,36 @@ their annual spending and average basket size.
The algorithm groups similar data points together based on their attributes
-without being told what these groups should be.
+without being told what these groups should be.
To get a better understanding of k-means, we will explore the theory behind it
-and employ the algorithm to cluster data from Spotify and a semiconductor
+and employ the algorithm to cluster data from Spotify and a semiconductor
manufacturer.
### Theory
???+ info
- The theoretical part is adapted from:
- ^^Christopher M. Bishop. 2006. *Pattern Recognition and Machine
- Learning*[^1]^^
+ The theoretical part is adapted from: ^^Christopher M. Bishop. 2006. *Pattern
+ Recognition and Machine Learning*[^1]^^
- [^1]:
- Christopher M. Bishop. Pattern Recognition and Machine Learning.
- Springer, 2006. [Link](https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf)
+ [^1]: Christopher M. Bishop. Pattern Recognition and Machine Learning.
+ Springer, 2006.
+ [Link](https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf)
Assume a set of features \(x_1, x_2, ..., x_n\). K-means partitions the data
-into \(K\) number of clusters. Each cluster is represented by \(\mu_k\),
-which can be seen as the center of a cluster \(k\).
+into \(K\) number of clusters. Each cluster is represented by \(\mu_k\), which
+can be seen as the center of a cluster \(k\).
-Intuitively speaking, the goal is to assign each data point \(x_n\) to the
-cluster with the closest center \(\mu_k\).
+Intuitively speaking, the goal is to assign each data point \(x_n\) to the
+cluster with the closest center \(\mu_k\).
#### The objective
Since, the optimal assignment of data points to specific clusters is not known,
-the objective is to minimize the sum of squared distances between data
-points and their assigned cluster centers.
-This is known as the **distortion measure**:
+the objective is to minimize the sum of squared distances between data points
+and their assigned cluster centers. This is known as the **distortion
+measure**:
???+ defi "Distortion measure"
@@ -61,33 +60,33 @@ This is known as the **distortion measure**:
\]
where:
-
+
- \(N\) is the number of data points,
- \(K\) being the number of clusters,
- - \(r_{nk}\) is a binary indicator of whether data point \(x_n\) is
- assigned to cluster \(k\),
+ - \(r_{nk}\) is a binary indicator of whether data point \(x_n\) is assigned to
+ cluster \(k\),
- \(\mu_k\) representing the cluster center.
-In short, we want to find the optimal \(r_{nk}\) and \(\mu_k\) that minimize
+In short, we want to find the optimal \(r_{nk}\) and \(\mu_k\) that minimize
the distortion measure \(J\).
-\(J\) is minimized in an iterative process. First, we initialize \(\mu_k\)
-with some random values. Then we alternate between two steps:
+\(J\) is minimized in an iterative process. First, we initialize \(\mu_k\) with
+some random values. Then we alternate between two steps:
-1. **Assignment step**: Keep \(\mu_k\) fixed. Minimize \(J\) with respect
- to \(r_{nk}\). This is done by assigning each data point to the closest
+1. **Assignment step**: Keep \(\mu_k\) fixed. Minimize \(J\) with respect to
+ \(r_{nk}\). This is done by assigning each data point to the closest
cluster center.
-2. **Update step**: Keep \(r_{nk}\) fixed. Minimize \(J\) with respect to
- \(\mu_k\). This is done by updating the cluster centers to the mean of
- the data points assigned to the cluster.
+1. **Update step**: Keep \(r_{nk}\) fixed. Minimize \(J\) with respect to
+ \(\mu_k\). This is done by updating the cluster centers to the mean of the
+ data points assigned to the cluster.
Step 1 can be seen as re-assigning the data points to clusters, while step 2
re-computes the cluster centers.
???+ info
- Since \(\mu_k\) is the mean of the data points assigned to cluster \(k\),
- we speak of the k-means algorithm.
+ Since \(\mu_k\) is the mean of the data points assigned to cluster \(k\), we
+ speak of the k-means algorithm.
The optimization of \(J\) is guaranteed to converge, but it might not find the
global minimum. The final solution depends on the initial cluster centers.
@@ -95,11 +94,11 @@ global minimum. The final solution depends on the initial cluster centers.
???+ question "Get a better understanding"
To improve your understanding of the k-means algorithm, either watch the
- following video or visit the interactive visualization.
- Both variants illustrate the iterative process of k-means.
+ following video or visit the interactive visualization. Both variants
+ illustrate the iterative process of k-means.
=== "Option 1: :fontawesome-brands-youtube: Video"
-
+
VIDEO

@@ -148,40 +148,39 @@ clustering semiconductor data.
### Recommendation system
-If you're using a music streaming service, you're familiar with listening to
-playlist. At the end of a playlist, the service recommends you similar songs
+If you're using a music streaming service, you're familiar with listening to
+playlist. At the end of a playlist, the service recommends you similar songs
based on the previous songs.
-We will build such a recommendation system (a rudimentary one) with
-k-means. The goal is to cluster songs based on their audio features and
-recommend similar songs to the user.
+We will build such a recommendation system (a rudimentary one) with k-means.
+The goal is to cluster songs based on their audio features and recommend
+similar songs to the user.
-To build our own recommendation system, we will use a modified
-Spotify dataset.
+To build our own recommendation system, we will use a modified Spotify dataset.
???+ info
- The original data can be found on
+ The original data can be found on
[Kaggle](https://www.kaggle.com/datasets/asaniczka/top-spotify-songs-in-73-countries-daily-updated?resource=download).
- The modified data we are using, contains songs from 2024 up until now
- (time of writing: January 31, 2025).
-
----
+ The modified data we are using, contains songs from 2024 up until now (time of
+ writing: January 31, 2025).
+
+______________________________________________________________________
???+ question "Download and read data"
1. Download the data set.
- 2. Read it with `pandas` and for convenience assign it to a variable called
- `data`. Then you will be able to use the following code snippets more
- easily.
- 3. Print the first rows of `data`.
+ 1. Read it with `pandas` and for convenience assign it to a variable called
+ `data`. Then you will be able to use the following code snippets more
+ easily.
+ 1. Print the first rows of `data`.
[Download Spotify tracks :fontawesome-solid-download:](../../../assets/data-science/algorithms/clustering/spotify.csv){ .md-button }
----
+______________________________________________________________________
With the data set loaded, we pick the following audio features for clustering:
@@ -205,9 +204,9 @@ X = data[features]
???+ question "Have a look at the data"
1. Look at the first couple of rows of the `DataFrame` `X`.
- 2. Check for potential missing values.
+ 1. Check for potential missing values.
- Hint: If you need a refresh on missing values, visit the
+ Hint: If you need a refresh on missing values, visit the
[Data preprocessing](../../data/preprocessing.md#missing-values) chapter.
You might have noticed that all features are numerical. In fact, k-means
@@ -215,12 +214,12 @@ You might have noticed that all features are numerical. In fact, k-means
???+ danger
- K-means clustering relies on Euclidean distances, which ==only make
- sense for numerical data==.
+ K-means clustering relies on Euclidean distances, which ==only make sense for
+ numerical data==.
- :warning: Never use k-means for categorical data, even if you encode the
- categories as numbers or labels. Distances between categorical values
- are not meaningful!
+ :warning: Never use k-means for categorical data, even if you encode the
+ categories as numbers or labels. Distances between categorical values are not
+ meaningful!
For clustering categorical data, use specialized algorithms like k-modes or
other appropriate methods.
@@ -243,15 +242,16 @@ min 0.093900 0.001740 ... 0.000010 46.999000
max 0.988000 0.998000 ... 0.989000 236.089000
```
-These basic statistics reveal that the features have different scales.
-For example, compare `tempo` and `danceability`. Tempo ranges from
-`#!python 46` to `#!python 236`, while danceability ranges from
-`#!python 0.0939` to `#!python 0.988`.
+These basic statistics reveal that the features have different scales. For
+example, compare `tempo` and `danceability`. Tempo ranges from `#!python 46` to
+`#!python 236`, while danceability ranges from `#!python 0.0939` to
+`#!python 0.988`.
-Thus, we apply a [Z-Score normalization](../../data/preprocessing.md#z-score-normalization)
-to all features (to have a mean of `0` and a standard deviation of `1`).
-This prevents k-means to disproportionately weigh features like `tempo` and
-ensures each feature contributes equally to the distance calculations.
+Thus, we apply a
+[Z-Score normalization](../../data/preprocessing.md#z-score-normalization) to
+all features (to have a mean of `0` and a standard deviation of `1`). This
+prevents k-means to disproportionately weigh features like `tempo` and ensures
+each feature contributes equally to the distance calculations.
```python
from sklearn.preprocessing import StandardScaler
@@ -276,7 +276,7 @@ print(cluster_indices)
array([4, 0, 3, ..., 1, 1, 2], dtype=int32)
```
-The `n_clusters` parameter specifies the number of clusters. We set it to
+The `n_clusters` parameter specifies the number of clusters. We set it to
`#!python 5` for now. The `random_state` parameter ensures reproducibility.
Using the `fit_predict()` method, we obtain the cluster indices for each data
point. In this case, these indices range from `#!python 0` to `#!python 4`.
@@ -288,17 +288,18 @@ This is where the elbow method comes into play. :flexed_biceps:
#### Elbow method
-With the attribute `inertia_`, we can access the distortion measure \(J\).
-From the k-means docs:
+With the attribute `inertia_`, we can access the distortion measure \(J\). From
+the k-means docs:
> `inertia_`:
->
+>
> Sum of squared distances of samples to their closest cluster center,...
->
-> -- [KMeans docs](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)
+>
+>
+> [KMeans docs](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)
In a loop we fit the k-means algorithm for different numbers of clusters \(K\)
-and store the corresponding distortion measure (`inertia_`). Then we plot the
+and store the corresponding distortion measure (`inertia_`). Then we plot the
results.
We define a function to apply the elbow method:
@@ -321,19 +322,21 @@ def elbow_method(X, max_clusters=15):
return distortions
```
-By default, the function `elbow_method()` tries values for \(K\) from
-`#!python 1` to `#!python 15` and stores the corresponding distortion measure
-in a `DataFrame`.
+By default, the function `elbow_method()` tries values for \(K\) from
+`#!python 1` to `#!python 15` and stores the corresponding distortion measure
+in a `DataFrame`.
----
+______________________________________________________________________
???+ question "Apply the elbow method"
1. Apply the `elbow_method()` on our scaled data `X`.
- 2. Create a line plot with the number of clusters (K) on the x-axis and
- the distortion measure on the y-axis.
- Hint: Use the [`plot()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html)
+ 1. Create a line plot with the number of clusters (K) on the x-axis and the
+ distortion measure on the y-axis.
+
+ Hint: Use the
+ [`plot()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html)
method of the resulting `DataFrame`.
Expand the below section to see a plot as possible solution.
@@ -350,14 +353,14 @@ Expand the below section to see a plot as possible solution.
#### Choice paralysis
Like in our example, it is not always obvious how many clusters to pick,
-because the "elbow" can sometimes be subtle or ambiguous. Ideally,
-you choose the point where the distortion/inertia sharply decreases and then
-levels off, forming an elbow-like bend in the plot.
+because the "elbow" can sometimes be subtle or ambiguous. Ideally, you choose
+the point where the distortion/inertia sharply decreases and then levels off,
+forming an elbow-like bend in the plot.
-In this example, possible candidates for the number of clusters \(K\) are
+In this example, possible candidates for the number of clusters \(K\) are
`#!python 5`, `#!python 6` or `#!python 7`. As we have to make a choice, we
-choose `#!python 6` clusters.
-Now, we have to simply fit the k-means algorithm with `#!python n_clusters=6`.
+choose `#!python 6` clusters. Now, we have to simply fit the k-means algorithm
+with `#!python n_clusters=6`.
```python
kmeans = KMeans(n_clusters=6, random_state=42)
@@ -374,13 +377,13 @@ cluster_indices = kmeans.fit_predict(X)
-The goal of this exercise is to recommend a song based on a previous
-track. The idea is to pick a song as recommendation that is in the same
-cluster as the previous one. To do so, we can use the `cluster_indices` to
-recommend similar songs.
+The goal of this exercise is to recommend a song based on a previous track. The
+idea is to pick a song as recommendation that is in the same cluster as the
+previous one. To do so, we can use the `cluster_indices` to recommend similar
+songs.
-Since the `cluster_indices` are in the same order as our initial `data`, we
-can simply assign them as a new column.
+Since the `cluster_indices` are in the same order as our initial `data`, we can
+simply assign them as a new column.
```python
data["cluster"] = cluster_indices
@@ -396,12 +399,12 @@ print(data.head())
4 6dOtVTDdiauQNBQEDOtlAB BIRDS OF A FEATHER Billie Eilish ... 0.438 104.978 4
```
-Now, that we assigned a cluster to all `#!python 11320` tracks, we can easily
-recommend a song based on a given `spotify_id` (the unique identifier of a
-song on the platform).
+Now, that we assigned a cluster to all `#!python 11320` tracks, we can easily
+recommend a song based on a given `spotify_id` (the unique identifier of a song
+on the platform).
-Use the below functions to see your recommender system in action. Don't
-worry about the details of these functions.
+Use the below functions to see your recommender system in action. Don't worry
+about the details of these functions.
```python
def print_track_info(track):
@@ -463,57 +466,52 @@ Cluster index: 4
recommendation. Try it out!
1. Pick another `spotify_id` and recommend a song.
- 2. Repeat the process a couple of times.
-
+ 1. Repeat the process a couple of times.
#### Are the recommendations good?
As you've tried the recommender system a couple of times, you might have
-wondered if the recommendations are actually good?!
-:thinking_face:
+wondered if the recommendations are actually good?! :thinking_face:
-Simply put, you have to be the judge if we were actually able to cluster
+Simply put, you have to be the judge if we were actually able to cluster
similar songs together and build a good recommendation system.
-In this application, it's quite intuitive: If you as a user like the
-recommendations and keep listening to the recommended songs, the system is
+In this application, it's quite intuitive: If you as a user like the
+recommendations and keep listening to the recommended songs, the system is
successful.
-
???+ info
-
- When talking about supervised tasks, we were able to measure the
- performance of our models. However, in unsupervised learning, like
- clustering, we do not have labels to compare our results to. Thus,
- evaluating the performance of unsupervised learning methods is challenging.
-
- In practice, you have to rely on domain knowledge to interpret the
- results and assess the quality of the model.
----
+ When talking about supervised tasks, we were able to measure the performance of
+ our models. However, in unsupervised learning, like clustering, we do not have
+ labels to compare our results to. Thus, evaluating the performance of
+ unsupervised learning methods is challenging.
+
+ In practice, you have to rely on domain knowledge to interpret the results and
+ assess the quality of the model.
+
+______________________________________________________________________
### Semiconductor data
-K-means is not only useful for recommendation systems, but also for
-anomaly detection. The idea is to form clusters which in turn can be used to
-detect the outliers/anomalies.
+K-means is not only useful for recommendation systems, but also for anomaly
+detection. The idea is to form clusters which in turn can be used to detect the
+outliers/anomalies.
???+ info
The data is adapted from the UCI Machine Learning Repository.[^2]
-
- [^2]:
- McCann, M. & Johnston, A. (2008). SECOM [Dataset].
- UCI Machine Learning Repository.
- [https://doi.org/10.24432/C54305](https://doi.org/10.24432/C54305)
+
+ [^2]: McCann, M. & Johnston, A. (2008). SECOM [Dataset]. UCI Machine Learning
+ Repository. [https://doi.org/10.24432/C54305](https://doi.org/10.24432/C54305)
In this example, you will apply k-means to semiconductor data.
???+ question "Download and read data"
1. Download the below data set.
- 2. Read it with `pandas`.
- 3. Have a look at the data.
+ 1. Read it with `pandas`.
+ 1. Have a look at the data.
[Download semiconductor data :fontawesome-solid-download:](../../../assets/data-science/algorithms/clustering/semiconductor.csv){ .md-button }
@@ -522,47 +520,48 @@ In this example, you will apply k-means to semiconductor data.
Each row in the data set
> represents a single production entity with associated measured features [...]
->
-> --
UCI Machine Learning Repository
+>
+>
UCI Machine Learning Repository
???+ question "Apply k-means"
Solve the following tasks to apply k-means to the semiconductor data:
1. Are there any missing values in the data?
- 2. Deal with potential missing values; choose any suitable strategy. We
- recommend to utilize the [`SimpleImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) with your chosen strategy. The application
- of the `SimpleImputer` should be straightforward as it implements the
- methods you already know, e.g., `fit_transform()`.
- 3. Do you need to scale the features? If so, apply a `StandardScaler`.
- 4. Use the elbow method to determine the number of clusters.
- 5. Fit the k-means algorithm with the optimal number of clusters.
-
- Hint: You can reuse the functions and code snippets from the Spotify
- example.
+ 1. Deal with potential missing values; choose any suitable strategy. We
+ recommend to utilize the
+ [`SimpleImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)
+ with your chosen strategy. The application of the `SimpleImputer` should
+ be straightforward as it implements the methods you already know, e.g.,
+ `fit_transform()`.
+ 1. Do you need to scale the features? If so, apply a `StandardScaler`.
+ 1. Use the elbow method to determine the number of clusters.
+ 1. Fit the k-means algorithm with the optimal number of clusters.
+
+ Hint: You can reuse the functions and code snippets from the Spotify example.
??? info
- If you have solved the above tasks, you might wonder how to interpret
- your clustering results. Moreover, how can you detect potential anomalies?
+ If you have solved the above tasks, you might wonder how to interpret your
+ clustering results. Moreover, how can you detect potential anomalies?
- Again, it all depends on domain knowledge. If you're a expert in the
- semiconductor industry you might be able to tell if the clusters
- make sense and if there are any anomalies in the data. Otherwise,
- interpretation can be quite challenging.
+ Again, it all depends on domain knowledge. If you're a expert in the
+ semiconductor industry you might be able to tell if the clusters make sense and
+ if there are any anomalies in the data. Otherwise, interpretation can be quite
+ challenging.
## Recap
-In this chapter, we introduced k-means clustering. We covered the theory
+In this chapter, we introduced k-means clustering. We covered the theory
followed by two practical examples: building a recommendation system for
Spotify tracks and clustering semiconductor data.
We employed the elbow method to determine the optimal number of clusters and
discussed the challenges of evaluating clustering results.
-In the upcoming chapter, we introduce another unsupervised method,
-namely Principal Component Analysis (PCA) to reduce the dimensionality of data.
-PCA can be useful in various ways:
+In the upcoming chapter, we introduce another unsupervised method, namely
+Principal Component Analysis (PCA) to reduce the dimensionality of data. PCA
+can be useful in various ways:
- reducing the computational complexity of algorithms
- visualizing high-dimensional data in a 2D or 3D space
diff --git a/docs/data-science/algorithms/unsupervised/dim-reduction.md b/docs/data-science/algorithms/unsupervised/dim-reduction.md
index 8a6e4b07..fc581eb9 100644
--- a/docs/data-science/algorithms/unsupervised/dim-reduction.md
+++ b/docs/data-science/algorithms/unsupervised/dim-reduction.md
@@ -2,73 +2,73 @@
## Principal Component Analysis (PCA)
-In data science and machine learning, we often encounter data sets with
-hundreds or even thousands of features. We speak of high-dimensional data
-sets. While these features may contain valuable information, working with
-such high-dimensional data can be computationally expensive, prone to
-overfitting, and difficult to visualize. This is where another
-unsupervised method, dimensionality reduction comes in — a technique used to
-simplify data sets, while retaining much of the critical information.
-
-One of the most widely used methods for dimensionality reduction is
-Principal Component Analysis (PCA). PCA transforms a high-dimensional (=
-lots of features) data set into a smaller set of features (components). In
-practice, PCA can reduce hundreds of features down to just 2 or 3
-features, making PCA an ideal tool for visualization, preprocessing, and
-feature extraction.
-
-In this section, we will explain the inner workings of PCA and apply it to
-the semiconductor data set.
+In data science and machine learning, we often encounter data sets with
+hundreds or even thousands of features. We speak of high-dimensional data sets.
+While these features may contain valuable information, working with such
+high-dimensional data can be computationally expensive, prone to overfitting,
+and difficult to visualize. This is where another unsupervised method,
+dimensionality reduction comes in — a technique used to simplify data sets,
+while retaining much of the critical information.
+
+One of the most widely used methods for dimensionality reduction is Principal
+Component Analysis (PCA). PCA transforms a high-dimensional (= lots of
+features) data set into a smaller set of features (components). In practice,
+PCA can reduce hundreds of features down to just 2 or 3 features, making PCA an
+ideal tool for visualization, preprocessing, and feature extraction.
+
+In this section, we will explain the inner workings of PCA and apply it to the
+semiconductor data set.
### What is PCA?
-PCA is a **linear transformation technique** that identifies the directions
-(also called **principal components**) in which the data varies the most.
-These principal components capture as much variance as possible. PCA has a
-variety of applications, such as:
+PCA is a **linear transformation technique** that identifies the directions
+(also called **principal components**) in which the data varies the most. These
+principal components capture as much variance as possible. PCA has a variety of
+applications, such as:
- **Data visualization**: Plot a dimensionality reduced data set in 2D.
- **Preprocessing**: Removing noise or redundant features while retaining the
- essential patterns in data.
+ essential patterns in data.
- **Feature engineering**: Summarizing high-dimensional data into a smaller set
- of meaningful features.
+ of meaningful features.
### How does it work?
PCA follows these essential steps:
1. **Compute the covariance matrix**: PCA captures relationships between
- features by calculating the covariance between them.
+ features by calculating the covariance between them.
???+ info
-
- Think of the covariance matrix as the "spread" of the data. PCA looks
- at the interaction :fontawesome-solid-arrow-right: the correlation of
- features with each other. Visit the
+
+ Think of the covariance matrix as the "spread" of the data. PCA looks at the
+ interaction :fontawesome-solid-arrow-right: the correlation of features with
+ each other. Visit the
[correlation chapter](../../../statistics/bivariate/Correlation.md#covariance)
in the statistics course to learn more about covariance.
-2. **Eigen decomposition**: Identify the eigenvalues and eigenvectors of the
- covariance matrix. The eigenvectors represent the directions of the
- principal components, while the eigenvalues represent the amount of variance
- captured by each component.
+1. **Eigen decomposition**: Identify the eigenvalues and eigenvectors of the
+ covariance matrix. The eigenvectors represent the directions of the
+ principal components, while the eigenvalues represent the amount of
+ variance captured by each component.
???+ info
-
- If you want to know more about eigenvalues and eigenvectors, check out
- this [site](https://www.mathsisfun.com/algebra/eigenvalue.html).
-3. **Rank components**: Components are ranked by their eigenvalues. The first
- principal component captures the most variance, the second captures the
- next-most, and so on.
-4. **Transform the data**: Project the original data onto the top principal
- components to reduce its dimensionality.
+ If you want to know more about eigenvalues and eigenvectors, check out this
+ [site](https://www.mathsisfun.com/algebra/eigenvalue.html).
+
+1. **Rank components**: Components are ranked by their eigenvalues. The first
+ principal component captures the most variance, the second captures the
+ next-most, and so on.
+
+1. **Transform the data**: Project the original data onto the top principal
+ components to reduce its dimensionality.
### The mathematical objective
-Let’s assume we have a data set \(X\) with \(p\) features (dimensions). We
-aim to transform \(X\) into a new matrix \(Z\) with \(k\) features such
-that \(k < p\), while retaining as much variance as possible.
+Let’s assume we have a data set \(X\) with \(p\) features (dimensions). We aim
+to transform \(X\) into a new matrix \(Z\) with \(k\) features such that
+\(k < p\), while retaining as much variance as possible.
The transformation (described previously under point 4) is defined as:
@@ -79,24 +79,23 @@ The transformation (described previously under point 4) is defined as:
\]
Where:
-
+
- \(Z\) is the transformed data set in the lower-dimensional space,
- \(W\) is a matrix whose columns are the top \(k\) eigenvectors of the
covariance matrix of \(X\).
???+ tip
- Dimensionality reduction helps in combating the *curse of dimensionality*,
- a phenomenon where the performance of algorithms deteriorates with an
- increase in the number of features. Algorithms like clustering
- often struggle to find meaningful patterns when working with a
- high-dimensional data set.
+ Dimensionality reduction helps in combating the *curse of dimensionality*, a
+ phenomenon where the performance of algorithms deteriorates with an increase in
+ the number of features. Algorithms like clustering often struggle to find
+ meaningful patterns when working with a high-dimensional data set.
## Example
-It’s time to apply PCA to real-world data. We'll revisit the semiconductor
-data set that we used in the previous clustering chapter. The first goal
-is to use PCA to reduce the data set's dimensions and visualize them.
+It’s time to apply PCA to real-world data. We'll revisit the semiconductor data
+set that we used in the previous clustering chapter. The first goal is to use
+PCA to reduce the data set's dimensions and visualize them.
### Prepare the data
@@ -132,8 +131,8 @@ scaled_data = scaler.fit_transform(data)
### Apply PCA
-We now apply PCA to reduce the dimensions. First, we fit the PCA model on
-the `scaled_data`:
+We now apply PCA to reduce the dimensions. First, we fit the PCA model on the
+`scaled_data`:
```python
from sklearn.decomposition import PCA
@@ -142,11 +141,11 @@ pca = PCA(n_components=2, random_state=42) # (1)!
components = pca.fit_transform(scaled_data)
```
-1. Although the above definition of PCA is deterministic, the actual
- implementation can be stochastic (depending on the solver used). Since
- `svd_solver` is set to `#!python "auto"` by default, the results can
- vary slightly. Long story short, setting `random_state` ensures
- reproducibility in all cases.
+1. Although the above definition of PCA is deterministic, the actual
+ implementation can be stochastic (depending on the solver used). Since
+ `svd_solver` is set to `#!python "auto"` by default, the results can vary
+ slightly. Long story short, setting `random_state` ensures reproducibility
+ in all cases.
`n_components=2` specifies that we want to reduce the data set to 2 dimensions.
@@ -167,7 +166,7 @@ plt.show()
```
1. The `alpha` parameter controls the transparency of the points. A value of
- `#!python 0.5` makes the points semi-transparent.
+ `#!python 0.5` makes the points semi-transparent.

@@ -178,21 +177,20 @@ plt.show()
-To quickly recap so far:
-We were able to reduce the semiconductor data set from `#!python 590`
-features to just `#!python 2`.
+To quickly recap so far: We were able to reduce the semiconductor data set from
+`#!python 590` features to just `#!python 2`.
#### Plot interpretation
-The scatter plot shows the data set in a 2D space with each observation as
-a point. Additionally, we can observe clusters. Since, principal
-components are ranked by the amount of variance they capture, the first
-component (PC1) is "more important" than the second component (PC2).
+The scatter plot shows the data set in a 2D space with each observation as a
+point. Additionally, we can observe clusters. Since, principal components are
+ranked by the amount of variance they capture, the first component (PC1) is
+"more important" than the second component (PC2).
Therefore, differences along the x-axis (PC1) are more significant than
-differences along the y-axis (PC2). As we are interested in potential
-anomalies in semiconductor products, we can detect some observations that might
-be well worth some further investigation:
+differences along the y-axis (PC2). As we are interested in potential anomalies
+in semiconductor products, we can detect some observations that might be well
+worth some further investigation:

@@ -201,31 +199,31 @@ be well worth some further investigation:
-A majority of the data points are clustered in the upper left corner.
-Contrary, these single observations with a high difference on the x-axis
-(PC1) might be anomalies (annotated by these arrows). Although, samples
-within the encircled area have their differences on the y-axis (PC2),
-they are still worth investigating.
+A majority of the data points are clustered in the upper left corner. Contrary,
+these single observations with a high difference on the x-axis (PC1) might be
+anomalies (annotated by these arrows). Although, samples within the encircled
+area have their differences on the y-axis (PC2), they are still worth
+investigating.
???+ question "Re-apply PCA on unscaled data"
What would happen if you apply PCA to the unscaled data?
-
+
1. Create a new PCA instance with `n_components=2`.
- 2. Fit the PCA model on the `data` (unscaled) and transform it.
- 3. Visualize the new components in a 2D scatter plot.
- 4. Compare the results with the previous PCA visualization.
+ 1. Fit the PCA model on the `data` (unscaled) and transform it.
+ 1. Visualize the new components in a 2D scatter plot.
+ 1. Compare the results with the previous PCA visualization.
???+ tip
PCA is sensitive to the scale of the data. Thus, the scaled data nicely
- separates the clusters, while the unscaled data does not. So be sure to
- pick the right preprocessing steps for your data.
+ separates the clusters, while the unscaled data does not. So be sure to pick
+ the right preprocessing steps for your data.
### Explained variance
When evaluating a PCA model, it is crucial to understand how much variance is
-captured by each principal component. Simply access the
+captured by each principal component. Simply access the
`explained_variance_ratio_` attribute:
```python
@@ -244,38 +242,37 @@ capture roughly `10%` of the variance.
???+ tip
- Put simply, our two principal components capture `10%` of the variance
- of the original `#!python 590` features which is not that great.
+ Put simply, our two principal components capture `10%` of the variance of the
+ original `#!python 590` features which is not that great.
:slightly_frowning_face:
Unfortunately, when dealing with real world data, results may not be as
-promising as expected. In this case, we might need to consider more
-components to capture a higher percentage of the variance.
+promising as expected. In this case, we might need to consider more components
+to capture a higher percentage of the variance.
???+ info "Choosing the number of components"
-
+
It is essential to choose the right number of components. For example, you
- could use the components as features for another machine learning model,
- hence you want to retain as much information as possible.
-
- However, the choice of how many components to keep is subjective.
- A common approach is to retain enough components to explain 90-95% of
- the variance.
+ could use the components as features for another machine learning model, hence
+ you want to retain as much information as possible.
+
+ However, the choice of how many components to keep is subjective. A common
+ approach is to retain enough components to explain 90-95% of the variance.
-???+ question "Number of components to exceed 95% variance"
+???+ question "Number of components to exceed 95% variance"
Using the *scaled* semiconductor dataset:
-
+
1. Create a PCA model to analyze the variance in the data
- 2. Determine the minimum number of principal components needed to explain
- at least 95% of the total variance
-
+ 1. Determine the minimum number of principal components needed to explain at
+ least 95% of the total variance
+
Solution approaches:
- You can use the `explained_variance_ratio_` attribute, OR
- - There is an alternative approach that requires only 3 lines of code
- maximum (hint: google and check the PCA documentation)
-
+ - There is an alternative approach that requires only 3 lines of code maximum
+ (hint: google and check the PCA documentation)
+
Use the following quiz question to evaluate your answer.
@@ -303,17 +300,17 @@ solution.
def elbow_method(X, max_clusters=15):
inertia = []
K = range(1, max_clusters + 1)
-
+
for k in K:
model = KMeans(n_clusters=k, random_state=42)
model.fit(X)
inertia.append(model.inertia_)
-
+
# for convenience store in a DataFrame
distortions = pd.DataFrame(
{"k (number of cluster)": K, "inertia (J)": inertia}
)
-
+
return distortions
```
@@ -371,12 +368,12 @@ components.plot(
plt.show()
```
-To summarize, we applied the same preprocessing steps, reduced the data to
-2 dimensions using PCA. Afterward, we called the elbow method on the 2
-components to determine the optimal number of clusters. Then we applied
-k-means with `#!python n_clusters=5`. Finally, we plot the 2 components and
-color the observations according to their corresponding clusters. Have a look
-at the resulting plots.
+To summarize, we applied the same preprocessing steps, reduced the data to 2
+dimensions using PCA. Afterward, we called the elbow method on the 2 components
+to determine the optimal number of clusters. Then we applied k-means with
+`#!python n_clusters=5`. Finally, we plot the 2 components and color the
+observations according to their corresponding clusters. Have a look at the
+resulting plots.
=== "Clustered components"
@@ -387,12 +384,12 @@ at the resulting plots.
- The plot shows the semiconductor data set clustered into 5 groups.
- Each color represents a different cluster. The clusters are well
- separated in the 2D space.
+ The plot shows the semiconductor data set clustered into 5 groups. Each color
+ represents a different cluster. The clusters are well separated in the 2D
+ space.
=== "Elbow method"
-
+

@@ -400,21 +397,19 @@ at the resulting plots.
- The plot shows the distortion (inertia) for different numbers of
- clusters. This time around, we can distinctly see an elbow at `k=5`
- clusters. :flexed_biceps:
+ The plot shows the distortion (inertia) for different numbers of clusters. This
+ time around, we can distinctly see an elbow at `k=5` clusters. :flexed_biceps:
----
+______________________________________________________________________
## Recap
-In this chapter, we concluded the Supervised vs. Unsupervised Learning
-portion of this course and introduced **Principal Component Analysis
-(PCA)**, a linear technique for dimensionality reduction.
+In this chapter, we concluded the Supervised vs. Unsupervised Learning portion
+of this course and introduced **Principal Component Analysis (PCA)**, a linear
+technique for dimensionality reduction.
-We discussed the inner workings of PCA and applied it to the semiconductor
-data set, where we could identify potential anomalies in the data. We also
+We discussed the inner workings of PCA and applied it to the semiconductor data
+set, where we could identify potential anomalies in the data. We also
visualized the data set in a 2D space, making it easier to interpret and
-analyze.
-Lastly, a combination of PCA and k-means revealed distinct clusters in the
-semiconductor data set.
+analyze. Lastly, a combination of PCA and k-means revealed distinct clusters in
+the semiconductor data set.
diff --git a/docs/data-science/basics/intro.md b/docs/data-science/basics/intro.md
index 4d0e6178..4404f062 100644
--- a/docs/data-science/basics/intro.md
+++ b/docs/data-science/basics/intro.md
@@ -6,21 +6,19 @@ The terms data science and machine learning are often used interchangeably.
Let's explore them to get a better understanding of this course's content.
=== ":bar_chart: Data Science"
-
- **Data Science** is an interdisciplinary field that combines statistics,
- programming and domain knowledge to extract insights from data. As a data
- scientist, you could work in vastly different domains, from healthcare and
- finance to manufacturing and entertainment. The core skills remain the
- same, but the questions you answer and the data you work with vary greatly.
+ **Data Science** is an interdisciplinary field that combines statistics,
+ programming and domain knowledge to extract insights from data. As a data
+ scientist, you could work in vastly different domains, from healthcare and
+ finance to manufacturing and entertainment. The core skills remain the same,
+ but the questions you answer and the data you work with vary greatly.
=== ":robot: Machine Learning"
- **Machine Learning (ML)** is a subset of Data Science that focuses on
- building algorithms that learn patterns from data to make predictions or
- decisions.
+ **Machine Learning (ML)** is a subset of Data Science that focuses on building
+ algorithms that learn patterns from data to make predictions or decisions.
----
+______________________________________________________________________
The primary focus of this course is the data science workflow, from
@@ -29,13 +27,13 @@ Let's explore them to get a better understanding of this course's content.
----
+______________________________________________________________________
## What to Expect
Before diving into examples and workflows, let's set realistic expectations.
-Data science is fundamentally about **understanding and insight**, not
+Data science is fundamentally about **understanding and insight**, not
perfection. You won't find models that are 100% accurate and that's okay - it's
not the goal. Instead, data science helps us:
@@ -48,27 +46,27 @@ not the goal. Instead, data science helps us:
Chances are you've already used services built by data scientists today:
-- :material-currency-usd: **Dynamic Pricing**: Airlines and concert platforms
+- :material-currency-usd: **Dynamic Pricing**: Airlines and concert platforms
adjust prices based on demand, time and user behavior
-- :material-movie: **Recommendation Systems**: Netflix suggests movies based
- on your viewing history; Instagram curates your feed
-- :material-email: **Spam Detection**: Your email provider filters unwanted
+- :material-movie: **Recommendation Systems**: Netflix suggests movies based on
+ your viewing history; Instagram curates your feed
+- :material-email: **Spam Detection**: Your email provider filters unwanted
messages automatically
In this course, we'll build models for tasks like:
-- :material-home: **Price Prediction**: Estimating house prices based on
+- :material-home: **Price Prediction**: Estimating house prices based on
features like size and location
- :material-hospital: **Medical Diagnosis**: Classifying tumors as malignant or
benign
-- :material-alert: **Anomaly Detection**: Identifying faulty products in
+- :material-alert: **Anomaly Detection**: Identifying faulty products in
manufacturing data
## Building blocks
-A typical data science project includes several stages, from collecting raw
-data to deploying models in production. This course focuses on the
-**core workflow**:
+A typical data science project includes several stages, from collecting raw
+data to deploying models in production. This course focuses on the **core
+workflow**:
@@ -84,23 +82,22 @@ data to deploying models in production. This course focuses on the
| Stage | What You'll Learn |
-|------------------------|------------------------------------------------|
+| ---------------------- | ---------------------------------------------- |
| **Data Preparation** | Inspect, clean and structure datasets |
| **Data Preprocessing** | Transform features (encoding, scaling, etc., ) |
| **Modeling** | Train different machine learning algorithms |
| **Evaluation** | Measure performance and interpret results |
-
???+ tip "Iterative Process"
- Data science is rarely linear. You’ll repeatedly cycle through collecting
- data, preparing it, training models and evaluating results. Each evaluation
- highlights new issues (e.g., missing data or unrealistic assumptions) that
- send you back to earlier stages to improve your approach.
+ Data science is rarely linear. You’ll repeatedly cycle through collecting data,
+ preparing it, training models and evaluating results. Each evaluation
+ highlights new issues (e.g., missing data or unrealistic assumptions) that send
+ you back to earlier stages to improve your approach.
----
+______________________________________________________________________
-Throughout the course, we'll use hands-on Python examples. By the end, you'll
+Throughout the course, we'll use hands-on Python examples. By the end, you'll
apply these skills to a complete project from start to finish.
Let's start by setting up your computer for the data science journey.
diff --git a/docs/data-science/basics/setup.md b/docs/data-science/basics/setup.md
index cf5e0493..db481c14 100644
--- a/docs/data-science/basics/setup.md
+++ b/docs/data-science/basics/setup.md
@@ -1,20 +1,20 @@
# Setup
-To get started, we setup the programming environment. Follow these couple
-of steps to get ready, no prerequisites needed.
+To get started, we setup the programming environment. Follow these couple of
+steps to get ready, no prerequisites needed.
## Visual Studio Code
-First, install a code editor. We urge you to instal Visual Studio Code
-(VS Code) a free and open-source editor developed by Microsoft
+First, install a code editor. We urge you to instal Visual Studio Code (VS
+Code) a free and open-source editor developed by Microsoft
:fontawesome-brands-windows:.
-If you don't have Visual Studio Code already installed, download it from their
+If you don't have Visual Studio Code already installed, download it from their
website: .
### Profile
-To quickstart your VS Code setup, download our profile that includes essential
+To quickstart your VS Code setup, download our profile that includes essential
plugins and convenient settings tailored for data science work.