diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml new file mode 100644 index 00000000..b41fc756 --- /dev/null +++ b/.pre-commit-config.yaml @@ -0,0 +1,14 @@ +repos: + - repo: https://github.com/JakobKlotz/md-snakeoil + rev: v0.1.7 + hooks: + - id: snakeoil + + - repo: https://github.com/hukkin/mdformat + rev: 1.0.0 # Use the ref you want to point at + hooks: + - id: mdformat + additional_dependencies: + - mdformat-mkdocs + args: [--wrap, "79"] + \ No newline at end of file diff --git a/docs/data-science/algorithms/index.md b/docs/data-science/algorithms/index.md index d7d2eb5a..b1bf07a0 100644 --- a/docs/data-science/algorithms/index.md +++ b/docs/data-science/algorithms/index.md @@ -1,37 +1,36 @@ # Introduction -With extensive data preparation knowledge, we can tackle the next -big part of the course: algorithms. An algorithm is a +With extensive data preparation knowledge, we can tackle the next big part of +the course: algorithms. An algorithm is a > a set of mathematical instructions or rules that, especially if given to a > computer, will help to calculate an answer to a problem. -> +> > [Cambridge Dictionary](https://dictionary.cambridge.org/de/worterbuch/englisch/algorithm) -In data science/machine learning, algorithms are used to solve problems, -such as modelling data to make predictions for unseen data, or clustering data -to find patterns. +In data science/machine learning, algorithms are used to solve problems, such +as modelling data to make predictions for unseen data, or clustering data to +find patterns. -The consecutive chapters will introduce you to common algorithms, like -linear and logistic regression, decision trees and k-means clustering. We -will explore the theory as well as practical examples. First, we establish two -main concepts in machine learning: supervised and unsupervised learning. +The consecutive chapters will introduce you to common algorithms, like linear +and logistic regression, decision trees and k-means clustering. We will explore +the theory as well as practical examples. First, we establish two main concepts +in machine learning: supervised and unsupervised learning. ## Supervised Learning -Supervised learning is a type of machine learning where algorithms learn from -^^labeled^^ training data to make predictions on new, unseen data. The term -"supervised" comes from the idea that the algorithm is guided by a -"supervisor" (the labeled data) that provides the correct answers during -training. +Supervised learning is a type of machine learning where algorithms learn from +^^labeled^^ training data to make predictions on new, unseen data. The term +"supervised" comes from the idea that the algorithm is guided by a "supervisor" +(the labeled data) that provides the correct answers during training. In supervised learning, each training example consists of: -- Input features (\(X\)): The characteristics or attributes we use to make +- Input features (\(X\)): The characteristics or attributes we use to make predictions - Target variable (\(y\)): The correct output we want to predict -The algorithm learns the relationship between inputs (\(X\)) and outputs +The algorithm learns the relationship between inputs (\(X\)) and outputs (\(y\)), creating a model that can then (hopefully!) generalize to new data. ### Example @@ -60,49 +59,49 @@ new_apartment = [[150, 5]] predicted_price = model.predict(new_apartment) ``` -1. Underscores can be used as visual separators in numeric literals - to improve readability. They have no effect on the value of the number. For - example, `#!python 500_000` is the same as `#!python 500000`. +1. Underscores can be used as visual separators in numeric literals to improve + readability. They have no effect on the value of the number. For example, + `#!python 500_000` is the same as `#!python 500000`. For each new observation, we can use the trained model to predict the price. -The apartment with 150m² and 5 rooms has a predicted price of `#!python -775000`. +The apartment with 150m² and 5 rooms has a predicted price of +`#!python 775000`. ???+ info - Whether this estimate is actually close to reality depends on the - quality of the model and its underlying data. Later, we will - discuss how to measure a model's quality. + Whether this estimate is actually close to reality depends on the quality of + the model and its underlying data. Later, we will discuss how to measure a + model's quality. ---- +______________________________________________________________________ ### Classification vs. Regression Supervised learning encapsulates ^^both^^ classification and regression tasks. -``` mermaid +```mermaid graph LR A[Supervised Learning] --> B[Classification]; A --> C[Regression]; ``` ---- +______________________________________________________________________ #### Classification Classification problems involve predicting discrete categories or labels. The output is always one of a fixed set of classes. For instance, in binary -classification, the model decides between two possibilities. +classification, the model decides between two possibilities. -For example, the Portuguese retail bank data can be used to predict -whether a customer would subscribe to a term deposit. The target variable is -binary: yes or no. +For example, the Portuguese retail bank data can be used to predict whether a +customer would subscribe to a term deposit. The target variable is binary: yes +or no. -On the other hand, multiclass classification handles three or more categories -(like classifying animals in photos :fontawesome-solid-arrow-right: dog, -cat, dolphin, tiger, elephant, etc.). +On the other hand, multiclass classification handles three or more categories +(like classifying animals in photos :fontawesome-solid-arrow-right: dog, cat, +dolphin, tiger, elephant, etc.). ---- +______________________________________________________________________ #### Regression @@ -112,18 +111,19 @@ numerical value along a continuous spectrum. These models work by finding patterns in the data to estimate a mathematical function that best describes the relationship between input features and the target variable. -For instance the example, predicting the price of an apartment based on -its size and the number of rooms is a regression task. +For instance the example, predicting the price of an apartment based on its +size and the number of rooms is a regression task. ---- +______________________________________________________________________ #### Examples
-- __Classification__ +- __Classification__ + + ______________________________________________________________________ - --- Predicting a ^^categorical^^ target variable: - Spam or not spam @@ -133,11 +133,12 @@ its size and the number of rooms is a regression task. - Image classification (cat, dog, dolphin, etc.) - ... -- __Regression__ +- __Regression__ + + ______________________________________________________________________ - --- Predicting a ^^continuous^^ target variable: - + - Apartment prices (like in the example above) - Temperature - Sales revenue @@ -146,18 +147,18 @@ its size and the number of rooms is a regression task.
???+ info - - No matter if you're dealing with a classification or regression task, the - key to successful supervised learning lies in having high-quality labeled - data and selecting appropriate features (variables) that have predictive - power for the target variable. + + No matter if you're dealing with a classification or regression task, the key + to successful supervised learning lies in having high-quality labeled data and + selecting appropriate features (variables) that have predictive power for the + target variable. ## Unsupervised Learning -Contrary, unsupervised learning deals with ^^unlabeled^^ data to discover -hidden patterns and structures. Unlike supervised learning, there is no -"supervisor" providing correct answers. The algorithm tries to find -meaningful patterns on its own. +Contrary, unsupervised learning deals with ^^unlabeled^^ data to discover +hidden patterns and structures. Unlike supervised learning, there is no +"supervisor" providing correct answers. The algorithm tries to find meaningful +patterns on its own. In unsupervised learning, we solely have: @@ -174,13 +175,7 @@ Let's say we want to segment customers based on their shopping behavior: from sklearn.cluster import KMeans # customer data [annual_spending, avg_basket_size] -X = [ - [1200, 50], - [5000, 150], - [800, 30], - [4500, 140], - [1000, 45] -] +X = [[1200, 50], [5000, 150], [800, 30], [4500, 140], [1000, 45]] # use k-means to find customer segments model = KMeans(n_clusters=2, random_state=42) # (1)! @@ -189,22 +184,22 @@ segments = model.fit_predict(X) print(segments) ``` -1. Setting the `random_state` parameter ensures that you always get the same - results when executing the code repeatedly. Reproducibility is discussed +1. Setting the `random_state` parameter ensures that you always get the same + results when executing the code repeatedly. Reproducibility is discussed more in-depth in upcoming chapters. ```title=">>> Output" [1 0 1 0 1] ``` -The variable `segments` contains the cluster assignments for each customer. -The cluster assignment is simply an `#!python int` indicating which group the -customer belongs to. In this example, we have two clusters with the first -customer (`#!python [1200, 50]`) belonging to cluster 1 and the second -customer (`#!python [5000, 150]`) to cluster 0 and so on. +The variable `segments` contains the cluster assignments for each customer. The +cluster assignment is simply an `#!python int` indicating which group the +customer belongs to. In this example, we have two clusters with the first +customer (`#!python [1200, 50]`) belonging to cluster 1 and the second customer +(`#!python [5000, 150]`) to cluster 0 and so on. -The following plot visualizes the input data as scatter plot -colored by the cluster assignments: +The following plot visualizes the input data as scatter plot colored by the +cluster assignments:
Lo and behold, even more math...
-For optimization purposes we use the negative log-likelihood as our loss +For optimization purposes we use the negative log-likelihood as our loss function: ???+ defi "Negative log-likelihood" @@ -119,7 +116,7 @@ function: \] with: - + - \(m\) as the number of training examples - \(y_i\) being the the actual class (0 or 1) - \(\sigma(z_i)\) is the predicted probability using the sigmoid function @@ -127,43 +124,40 @@ function: ???+ tip - Intuitively speaking, the loss function penalizes the model for making - wrong predictions. If the model predicts a probability of 0.9 for a - spam email, and the email is actually spam (\(y=1\)), the loss is small. - On the other hand, if the model predicts a probability of 0.1 for a - spam email, and the email is spam (\(y=1\)), the loss will be high. + Intuitively speaking, the loss function penalizes the model for making wrong + predictions. If the model predicts a probability of 0.9 for a spam email, and + the email is actually spam (\(y=1\)), the loss is small. On the other hand, if + the model predicts a probability of 0.1 for a spam email, and the email is spam + (\(y=1\)), the loss will be high. + + The weights are gradually adjusted to minimize the loss. Think of it like + turning knobs slowly until we get better predictions. - The weights are gradually adjusted to minimize the loss. - Think of it like turning knobs slowly until we get better predictions. - - Gradually adjusting these knobs to minimize the loss is referred to as - gradient descent. + Gradually adjusting these knobs to minimize the loss is referred to as gradient + descent. -Conveniently, `scikit-learn` provides a logistic regression implementation -that takes care of the optimization for us. Finally, we look at a -practical example to see logistic regression in action. +Conveniently, `scikit-learn` provides a logistic regression implementation that +takes care of the optimization for us. Finally, we look at a practical example +to see logistic regression in action. ## Example Let's apply logistic regression to the breast cancer dataset, a classic binary -classification problem where we need to predict whether a tumor is *malignant +classification problem where we need to predict whether a tumor is *malignant or benign* based on various features. With class labels \(y\) being 0 (malignant) or 1 (benign), we can use logistic -regression to predict the probability of a tumor being benign. The features +regression to predict the probability of a tumor being benign. The features were calculated from digitized images of a breast mass. ???+ info - See the [UCI Machine Learning Repository](https://doi.org/10.24432/C5DW2B) - for more information on the data set.[^2] - - [^2]: - Wolberg, W., Mangasarian, O., Street, N., & Street, W. (1993). - Breast Cancer Wisconsin (Diagnostic) [Dataset]. UCI Machine - Learning Repository. - [https://doi.org/10.24432/C5DW2B](https://doi.org/10.24432/C5DW2B). + See the [UCI Machine Learning Repository](https://doi.org/10.24432/C5DW2B) for + more information on the data set.[^2] + [^2]: Wolberg, W., Mangasarian, O., Street, N., & Street, W. (1993). Breast + Cancer Wisconsin (Diagnostic) [Dataset]. UCI Machine Learning Repository. + [https://doi.org/10.24432/C5DW2B](https://doi.org/10.24432/C5DW2B). ### Load the data @@ -190,10 +184,9 @@ tumors. ???+ tip - Just like in the previous chapter, the data is divided into `X`, containing - the attributes and `y` holding the corresponding labels. Having attributes - and labels separated, makes life a bit easier when training and testing the - model. + Just like in the previous chapter, the data is divided into `X`, containing the + attributes and `y` holding the corresponding labels. Having attributes and + labels separated, makes life a bit easier when training and testing the model. ???+ question "Number of features" @@ -206,13 +199,14 @@ How many features (attributes) does the breast cancer dataset have? - [ ] 32 `X.shape` reveals that we are dealing with 30 features. +

### Split the data Before training our model, we want to split our data into two parts. Just like -in the previous chapter, we perform a 80/20 split, i.e., we use 80% to train +in the previous chapter, we perform a 80/20 split, i.e., we use 80% to train the model and evaluate it on the remaining 20%. ```python @@ -225,9 +219,9 @@ X_train, X_test, y_train, y_test = train_test_split( ???+ tip - If you need a refresh on the parameters used in `train_test_split()` - revisit, the [Split the data](regression.md#split-the-data) section from - the previous chapter. + If you need a refresh on the parameters used in `train_test_split()` revisit, + the [Split the data](regression.md#split-the-data) section from the previous + chapter. ### Train the model @@ -240,20 +234,20 @@ model = LogisticRegression(random_state=42, max_iter=5_000) # (1)! model.fit(X_train, y_train) ``` -1. The `random_state` parameter ensures reproducibility, while - `max_iter` specifies the maximum number of iterations taken for the solver - to converge (i.e., solving the optimization problem to find the best +1. The `random_state` parameter ensures reproducibility, while `max_iter` + specifies the maximum number of iterations taken for the solver to + converge (i.e., solving the optimization problem to find the best parameter combination). `#!python model=LogisticRegression(...)` creates an instance of the logistic -regression model. Only after calling the `fit()` method, the `model` is -actually trained. Since we separated attributes and labels into `X_train` and -`y_train` respectively, we can directly call the method without any -further data handling. +regression model. Only after calling the `fit()` method, the `model` is +actually trained. Since we separated attributes and labels into `X_train` and +`y_train` respectively, we can directly call the method without any further +data handling. #### Weights and bias -With a trained model at hand, we can look at the weights \((b_1, b_2, ..., +With a trained model at hand, we can look at the weights \((b_1, b_2, ..., b_n)\) and bias \((a)\). ```python @@ -264,43 +258,43 @@ print(f"Model weights: {model.coef_}") Model weights: [[ 0.98208299 0.22519686 -0.36688444 0.0262268 ... ]] ``` -The `coef_` attribute contains the weight for each feature. +The `coef_` attribute contains the weight for each feature. [As discussed](#deja-vu-linear-regression), the weights are real numbers. ???+ warning "You might not have the exact same results" - Your model weights might differ slightly from the ones shown above. - This is completely normal and happens because: + Your model weights might differ slightly from the ones shown above. This is + completely normal and happens because: - **Numerical precision**: The default optimization solver - (`#!python "lbfgs"`) behind `LogisticRegression` encounters tiny - hardware-specific variations. The underlying libraries handle - floating-point arithmetic differently across hardware platforms. During the - iterative optimization, these tiny rounding differences accumulate, - causing the solver to converge to slightly different solutions. + **Numerical precision**: The default optimization solver (`#!python "lbfgs"`) + behind `LogisticRegression` encounters tiny hardware-specific variations. The + underlying libraries handle floating-point arithmetic differently across + hardware platforms. During the iterative optimization, these tiny rounding + differences accumulate, causing the solver to converge to slightly different + solutions. - :fontawesome-solid-lightbulb: These small differences don't affect your - model's predictions or accuracy. + :fontawesome-solid-lightbulb: These small differences don't affect your model's + predictions or accuracy. Now, it's your turn to look at the bias. ???+ question "Model bias" 1. Open the `scikit-learn` docs on the - [`LogisticRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) - class. - 2. Find out how to access the bias term of the model. - 3. Simply print the bias term of the model. + [`LogisticRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) + class. + 1. Find out how to access the bias term of the model. + 1. Simply print the bias term of the model. - :fontawesome-solid-lightbulb: Remember, the bias is often referred to as + :fontawesome-solid-lightbulb: Remember, the bias is often referred to as intercept. ### Predictions -Since, the main purpose of a machine learning model is to make predictions, -we will do just that. +Since, the main purpose of a machine learning model is to make predictions, we +will do just that. -Predicting, is as simple as using the `predict()` method. We will use the +Predicting, is as simple as using the `predict()` method. We will use the patient measurements of the test set - `X_test`. ```python @@ -314,25 +308,24 @@ print(y_pred[:5]) [1 0 0 1 1] ``` -Congratulations, you just build a machine learning model to predict breast -cancer. But how good is the model? To conclude the chapter, we will briefly +Congratulations, you just build a machine learning model to predict breast +cancer. But how good is the model? To conclude the chapter, we will briefly evaluate the model's performance. ### Evaluate the model -Surely, we could just manually compare the predictions (`y_pred`) with the -actual labels (`y_test`) and evaluate how often the model was correct. Or +Surely, we could just manually compare the predictions (`y_pred`) with the +actual labels (`y_test`) and evaluate how often the model was correct. Or instead, we can leverage another method called `score()`. ```python score = model.score(X_test, y_test) ``` -First, the `score()` method takes `X_test` and makes the corresponding -predictions and programmatically compares the predictions with the actual -labels `y_test`. `score()` returns the accuracy -:fontawesome-solid-arrow-right: the proportion of correctly -classified instances. +First, the `score()` method takes `X_test` and makes the corresponding +predictions and programmatically compares the predictions with the actual +labels `y_test`. `score()` returns the accuracy :fontawesome-solid-arrow-right: +the proportion of correctly classified instances. ```python print(f"Model accuracy: {round(score, 4)}") @@ -342,33 +335,33 @@ print(f"Model accuracy: {round(score, 4)}") Model accuracy: 0.9561 ``` -In our case, the model correctly classified 95.61% of the test set. In -other words, in 95.61% of instances, the model was able to correctly predict -if a tumor is malignant or benign. +In our case, the model correctly classified 95.61% of the test set. In other +words, in 95.61% of instances, the model was able to correctly predict if a +tumor is malignant or benign. ???+ tip As the test set (both attributes and labels) were never used to train the - model, the accuracy is a good indicator of how well the model generalizes - to unseen data. + model, the accuracy is a good indicator of how well the model generalizes to + unseen data. ## Recap We covered logistic regression, a popular algorithm for binary classification. -Upon discussing the theory, we discovered similarities to linear regression -in regard to the linear combination of features. With the help of the -sigmoid function, we transformed the linear combination into probabilities -between 0 and 1. +Upon discussing the theory, we discovered similarities to linear regression in +regard to the linear combination of features. With the help of the sigmoid +function, we transformed the linear combination into probabilities between 0 +and 1. -Subsequently, we trained a logistic regression model on the breast cancer -data to predict whether a tumor is malignant or benign. To evaluate the -model we split the data and finally calculated the accuracy. +Subsequently, we trained a logistic regression model on the breast cancer data +to predict whether a tumor is malignant or benign. To evaluate the model we +split the data and finally calculated the accuracy. ???+ info - In subsequent chapters we will explore more sophisticated ways to split - data and evaluate models. + In subsequent chapters we will explore more sophisticated ways to split data + and evaluate models. Next up, we will dive into algorithms, like decision trees and random forest, that can handle both regression and classification problems. diff --git a/docs/data-science/algorithms/supervised/regression.md b/docs/data-science/algorithms/supervised/regression.md index d005d629..87b35cbf 100644 --- a/docs/data-science/algorithms/supervised/regression.md +++ b/docs/data-science/algorithms/supervised/regression.md @@ -2,14 +2,14 @@ ## Linear Regression -In machine learning, we often want to predict continuous numerical values, like -house prices, temperatures or sales figures. Linear regression also knows as -Ordinary Least Squares (OLS) provides a foundational approach to this problem -by modeling the relationship between input variables and a target variable +In machine learning, we often want to predict continuous numerical values, like +house prices, temperatures or sales figures. Linear regression also knows as +Ordinary Least Squares (OLS) provides a foundational approach to this problem +by modeling the relationship between input variables and a target variable using a straight line. -This chapter introduces linear regression through a hands-on example. -You'll learn to: +This chapter introduces linear regression through a hands-on example. You'll +learn to: - Build and train a linear regression model - Interpret model parameters (intercept and coefficients) @@ -17,30 +17,21 @@ You'll learn to: - Evaluate model performance using the coefficient of determination (\(R^2\)) - Get familiar with the `scikit-learn` workflow to train and evaluate models ---- +______________________________________________________________________ ???+ info This chapter adapts and expands upon: - ^^scikit-learn: *Ordinary Least Squares and Ridge Regression*[^1]^^ - - ^^scikit-learn: *Linear Models*[^2]^^ - - ^^scikit-learn: *Metrics and scoring: quantifying the quality of predictions*[^3]^^ - - [^1]: - [https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols_ridge.html](https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols_ridge.html) - [^2]: - [https://scikit-learn.org/stable/modules/linear_model.html](https://scikit-learn.org/stable/modules/linear_model.html) - [^3]: - [https://scikit-learn.org/stable/modules/model_evaluation.html#r2-score-the-coefficient-of-determination](https://scikit-learn.org/stable/modules/model_evaluation.html#r2-score-the-coefficient-of-determination) + - ^^scikit-learn: *[Ordinary Least Squares and Ridge Regression](https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols_ridge.html)*^^ + - ^^scikit-learn: *[Linear Models](https://scikit-learn.org/stable/modules/linear_model.html)*^^ + - ^^scikit-learn: *[Metrics and scoring: quantifying the quality of predictions](https://scikit-learn.org/stable/modules/model_evaluation.html#r2-score-the-coefficient-of-determination)*^^ ## Theory -Linear regression, also known as Ordinary Least Squares (OLS), models the -relationship between a continuous target variable \(y\) and one or more input -variables \(X\). The goal is to find the best linear function that predicts +Linear regression, also known as Ordinary Least Squares (OLS), models the +relationship between a continuous target variable \(y\) and one or more input +variables \(X\). The goal is to find the best linear function that predicts \(\hat{y}\) from \(X\). ???+ defi "Linear combination" @@ -50,15 +41,15 @@ variables \(X\). The goal is to find the best linear function that predicts \] where: - + - \(w_0\) is the **intercept** (bias term) - \(w_1, w_2, ..., w_n\) are the **coefficients** (weights) - \(x_1, x_2, ..., x_n\) are the input features -The term "Ordinary Least Squares" refers to the optimization objective, -finding the weights \(w_0, w_1, ..., w_n\) that minimize the sum of squared -differences called residuals between the actual values \(y\) and predicted -values \(\hat{y}\). +The term "Ordinary Least Squares" refers to the optimization objective, finding +the weights \(w_0, w_1, ..., w_n\) that minimize the sum of squared differences +called residuals between the actual values \(y\) and predicted values +\(\hat{y}\). ???+ defi "Cost function" @@ -68,28 +59,27 @@ values \(\hat{y}\). where \(n\) is the number of observations. -This minimization ensures that our model makes the smallest possible errors -on average when predicting the training data. Let's look at an example. +This minimization ensures that our model makes the smallest possible errors on +average when predicting the training data. Let's look at an example. ## Example -`scikit-learn` provides a couple of data sets for download. To fit a linear +`scikit-learn` provides a couple of data sets for download. To fit a linear regression on a real-world example, we choose the California housing data set. -More information about the California Housing data set can be found +More information about the California Housing data set can be found [here](https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset). ???+ info Data reference: - ^^Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions, - Statistics and Probability Letters, 33:291-297, 1997^^ + ^^Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions, Statistics + and Probability Letters, 33:291-297, 1997^^ -Our objective is to model the target variable \(y\) using input variables -\(X\). In this case, \(y\) corresponds to the median house value, expressed in -hundreds of thousands of dollars ($100,000). -Below figure shows all houses in California colored by their median value -\(y\). +Our objective is to model the target variable \(y\) using input variables +\(X\). In this case, \(y\) corresponds to the median house value, expressed in +hundreds of thousands of dollars ($100,000). Below figure shows all houses in +California colored by their median value \(y\).
-- __Scatter Plot__ - - --- - - Looking at the scatter plot, you might intuitively imagine drawing a straight - line through the points that best captures the trend. This intuition is - exactly what OLS does mathematically, it finds the optimal line that minimizes - the distance between the line and all data points. :point_down: - --
- - +- __Scatter Plot__ + + ______________________________________________________________________ + + Looking at the scatter plot, you might intuitively imagine drawing a straight + line through the points that best captures the trend. This intuition is + exactly what OLS does mathematically, it finds the optimal line that + minimizes the distance between the line and all data points. :point_down: + +-
+ +
--
- - +-
+ +
-- __Best-Fit Line__ +- __Best-Fit Line__ + + ______________________________________________________________________ - --- + The OLS model finds the line that minimizes the sum of squared residuals, the + vertical distances between each point and the line. Recall from the theory + section that this is exactly what the cost function measures: - The OLS model finds the line that minimizes the sum of squared residuals, - the vertical distances between each point and the line. Recall from the - theory section that this is exactly what the cost function measures: - \[ \text{min} \quad \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \] @@ -237,8 +228,8 @@ plt.show() ### Train the model -Our next step is to train an OLS model to automatically find this "best-fit" -line. Remember, since we have one input variable, the linear combination +Our next step is to train an OLS model to automatically find this "best-fit" +line. Remember, since we have one input variable, the linear combination simplifies to: \[ @@ -260,8 +251,8 @@ from sklearn.linear_model import LinearRegression model = LinearRegression() ``` -At this point, the model is not trained, however that can be easily done -using the `fit()` method. Remember, to use the training set +At this point, the model is not trained, however that can be easily done using +the `fit()` method. Remember, to use the training set ```python model.fit(X=X_train[["MedInc"]], y=y_train) @@ -269,7 +260,7 @@ model.fit(X=X_train[["MedInc"]], y=y_train) #### Intercept and coefficient -After training, we can inspect the model's learned parameters. The intercept +After training, we can inspect the model's learned parameters. The intercept and coefficient that define the best-fit line: ```python @@ -290,15 +281,15 @@ These values tell us that our linear model is: **Interpretation:** -- **Intercept (0.4446)**: The baseline house value (when *MedInc* is zero) - ~ $44,460 +- **Intercept (0.4446)**: The baseline house value (when *MedInc* is zero) + is around $44,460 - **Coefficient (0.4193)**: For each unit increase in *MedInc*, the house value increases by ~ $41,930 ### Predictions -Now that the model is trained, we can predict house prices for new observations. -Let's predict the price \(\hat{y}\) for a house in an area where +Now that the model is trained, we can predict house prices for new +observations. Let's predict the price \(\hat{y}\) for a house in an area where *MedInc* is `#!python 3.5`: ```python @@ -319,7 +310,8 @@ The model predicts a house value of approximately **$191,230**. #### Manual validation -We can verify this prediction using our linear equation. Substituting \(x_1 = 3.5\): +We can verify this prediction using our linear equation. Substituting +\(x_1 = 3.5\): \[ \begin{align} @@ -332,21 +324,21 @@ This matches our model's prediction! ???+ question "Practice: Make your own prediction" - Calculate the predicted house price for an area where *MedInc* is + Calculate the predicted house price for an area where *MedInc* is `#!python 5.0`. - + 1. Use `#!python model.predict()` to get the prediction. - 2. Validate it by hand using the linear equation. - 3. Do the results match? + 1. Validate it by hand using the linear equation. + 1. Do the results match? ### Evaluate the model -Now we can make predictions, but we don't know how accurate they actually are. -We need to quantify the model's performance to determine if it generalizes -well to new, unseen data. +Now we can make predictions, but we don't know how accurate they actually are. +We need to quantify the model's performance to determine if it generalizes well +to new, unseen data. -Remember we set aside our test set earlier? This is where we use it. By -evaluating on data the model hasn't seen during training, we get an honest +Remember we set aside our test set earlier? This is where we use it. By +evaluating on data the model hasn't seen during training, we get an honest assessment of its predictive power. To measure the model's performance, we'll use the coefficient of determination. @@ -357,7 +349,7 @@ To measure the model's performance, we'll use the coefficient of determination. This section focuses on the definition implemented by `scikit-learn`. -The coefficient of determination, known as the \(R^2\) score, measures the +The coefficient of determination, known as the \(R^2\) score, measures the proportion of variance in the target variable that is explained by the model. ???+ defi "\(R^2\) Score" @@ -391,48 +383,47 @@ r2 = r2_score(y_true=y_test, y_pred=y_pred) print(f"R² Score: {round(r2, 4)}") ``` -``` title=">>> Output" +```title=">>> Output" R² Score: 0.4589 ``` ???+ tip "Understanding \(R^2\)" - An \(R^2\) score of 0.4589 means the model explains 45.89% of the variance - in house prices using only median income. While this is informative, it's - not great. It suggests that other factors (location, house size, etc.) + An \(R^2\) score of 0.4589 means the model explains 45.89% of the variance in + house prices using only median income. While this is informative, it's not + great. It suggests that other factors (location, house size, etc.) significantly influence house prices. ???+ question "Find a better model" - Can you improve the \(R^2\) score? Fit new models and experiment with the + Can you improve the \(R^2\) score? Fit new models and experiment with the following: **Model variations:** - - - Use different individual input variables (e.g., *HouseAge*, *AveRooms*, + + - Use different individual input variables (e.g., *HouseAge*, *AveRooms*, *AveBedrms*) - Use a combination of multiple input variables - Compare single-variable vs. multi-variable models - + **Data preparation:** - + - Adjust the train-test split ratio - Remember to use `#!python random_state` for reproducibility - + **Analysis:** - + - Calculate and compare \(R^2\) scores for each model - Inspect the intercept and coefficients for multi-variable models - Make predictions with your best-performing model - Manually verify one prediction using the linear equation - - Which combination gives you the highest \(R^2\) score? What does this - tell you about which features are most important for predicting house - prices? + + Which combination gives you the highest \(R^2\) score? What does this tell you + about which features are most important for predicting house prices? ## Detour: Model workflow -The workflow you practiced here forms the foundation for all supervised +The workflow you practiced here forms the foundation for all supervised learning algorithms in `scikit-learn`: ```python @@ -452,17 +443,17 @@ y_pred = model.predict(X_test) score = model.score(y_test, y_pred) ``` -This consistent pattern applies to all upcoming chapters, whether you're +This consistent pattern applies to all upcoming chapters, whether you're building regression or classification models. ## Recap -In this chapter, you learned the fundamentals of linear regression through a +In this chapter, you learned the fundamentals of linear regression through a practical example. The key takeaways: -- **Linear regression** models the relationship between input variables and a +- **Linear regression** models the relationship between input variables and a target variable using a linear combination. Find the best-fit line by minimizing the sum of squared residuals. -- **\(R^2\) score** quantifies how well the model explains variance in the +- **\(R^2\) score** quantifies how well the model explains variance in the target variable - **`scikit-learn` workflow** allows to easily train and evaluate model diff --git a/docs/data-science/algorithms/supervised/tree-based/cart.md b/docs/data-science/algorithms/supervised/tree-based/cart.md index fb53e97d..a44a7101 100644 --- a/docs/data-science/algorithms/supervised/tree-based/cart.md +++ b/docs/data-science/algorithms/supervised/tree-based/cart.md @@ -1,15 +1,15 @@ # Decision Tree -So far we have covered linear regression and logistic regression which are -limited to linear relationships. In contrast, decision trees are non-linear -models able to capture complex relationships in the data. They are easy to +So far we have covered linear regression and logistic regression which are +limited to linear relationships. In contrast, decision trees are non-linear +models able to capture complex relationships in the data. They are easy to interpret and visualize, making them a popular choice for many applications. Moreover, decision trees can be used for both regression ^^*and*^^ classification! In this chapter, we will explore the theory behind decision trees followed by -practical examples. As always we will use `scikit-learn` for hands-on +practical examples. As always we will use `scikit-learn` for hands-on experience. ## Basic intuition @@ -35,13 +35,12 @@ graph TD Depending on the answers, you can decide whether to go skiing or not. A decision tree resembles a flowchart where each internal node represents a -decision based on a feature (e.g., Is there any snow?), each branch represents -the outcome of that decision, and each leaf node represents a final -prediction (either a class label for classification or a continuous value -for regression). +decision based on a feature (e.g., Is there any snow?), each branch represents +the outcome of that decision, and each leaf node represents a final prediction +(either a class label for classification or a continuous value for regression). -To get a better understanding of the terms node, branch and leaf, consider -the illustration of a (rotated) tree. +To get a better understanding of the terms node, branch and leaf, consider the +illustration of a (rotated) tree.
![Decision tree illustration](../../../../assets/data-science/algorithms/tree-based/tree.png) @@ -50,9 +49,9 @@ the illustration of a (rotated) tree.
-In the skiing example, the nodes are the questions you ask yourself. With -branches being a simple binary split (the answers to the question). -The leaf nodes are the final predictions, in our case whether to go skiing. +In the skiing example, the nodes are the questions you ask yourself. With +branches being a simple binary split (the answers to the question). The leaf +nodes are the final predictions, in our case whether to go skiing. Given the skiing decision tree, what kind of supervised learning task is this? @@ -77,129 +76,125 @@ which is a classic binary classification task. ???+ info - This theoretical section on decision trees follows: ^^Christopher M. - Bishop. 2006. *Pattern Recognition and Machine Learning*[^1]^^ - - We focus on a particular algorithm called CART - (=**C**lassification **A**nd **R**egression **T**rees). - The theoretical foundations of CART were developed by: - ^^Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone. 1984. + This theoretical section on decision trees follows: ^^Christopher M. Bishop. + 2006\. *Pattern Recognition and Machine Learning*[^1]^^ + + We focus on a particular algorithm called CART (=**C**lassification **A**nd + **R**egression **T**rees). The theoretical foundations of CART were developed + by: ^^Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone. 1984. *Classification and Regression Trees*[^2]^^ - - [^1]: - Christopher M. Bishop. Pattern Recognition and Machine Learning. - Springer, 2006. [Link](https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf) - [^2]: - Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone. - Classification and Regression Trees. Chapman and Hall/CRC, 1984. - [https://doi.org/10.1201/9781315139470](https://doi.org/10.1201/9781315139470) ---- + [^1]: Christopher M. Bishop. Pattern Recognition and Machine Learning. + Springer, 2006. + [Link](https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf) + [^2]: Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone. + Classification and Regression Trees. Chapman and Hall/CRC, 1984. + [https://doi.org/10.1201/9781315139470](https://doi.org/10.1201/9781315139470) + +______________________________________________________________________ When building a decision tree a couple of questions arise:
-- :fontawesome-solid-question:{ .lg .middle } __Question__ +- :fontawesome-solid-question:{ .lg .middle } __Question__ - --- + ______________________________________________________________________ 1. How do we pick the right feature for a split? - 2. What's the decision criteria at each node? - 3. How large do we grow the tree? - + 1. What's the decision criteria at each node? + 1. How large do we grow the tree? -- :fontawesome-solid-lightbulb:{ .lg .middle } __Intuition__ +- :fontawesome-solid-lightbulb:{ .lg .middle } __Intuition__ - --- + ______________________________________________________________________ - 1. Which questions do we ask? Why did we ask "Can I - get to a skiing resort?" and "Is there any snow?"? - 2. It does not have to be a simple yes/no question. It can be a - threshold for continuous values as well. E.g., "Is there more than - 10cm of fresh snow?" But how do we choose the threshold? - 3. How many questions do we ask? Why only 2 and not more? + 1. Which questions do we ask? Why did we ask "Can I get to a skiing resort?" + and "Is there any snow?"? + 1. It does not have to be a simple yes/no question. It can be a threshold for + continuous values as well. E.g., "Is there more than 10cm of fresh + snow?" But how do we choose the threshold? + 1. How many questions do we ask? Why only 2 and not more?
-With these questions in mind, let's dive into the theory of decision trees -in order to tackle them. +With these questions in mind, let's dive into the theory of decision trees in +order to tackle them. ---- +______________________________________________________________________ ### Greedy optimization As a decision tree is a supervised learning algorithm, the goal is to predict the target variable \(y\) with a set of features \(x_1, x_2, ..., x_n\). -With the data at hand, the CART algorithm finds the optimal tree -structure that minimizes the prediction error. In turn, the -optimal tree structure depends on the chosen splits. +With the data at hand, the CART algorithm finds the optimal tree structure that +minimizes the prediction error. In turn, the optimal tree structure depends on +the chosen splits. ???+ info - + A split in CART is a binary decision rule that divides the dataset into two subsets based on a specific feature and threshold. - Imagine if we extend our skiing example with the split "Is there more than - 10cm of fresh snow?". The split divides the data into two subsets: one - where observations have more than 10cm of fresh snow and another where - observations don't. With *amount of fresh snow* being the feature and *10cm* - the threshold. + Imagine if we extend our skiing example with the split "Is there more than 10cm + of fresh snow?". The split divides the data into two subsets: one where + observations have more than 10cm of fresh snow and another where observations + don't. With *amount of fresh snow* being the feature and *10cm* the threshold. -However, given large data sets, there are simply too many splitting -possibilities to consider at once. Hence, the tree is grown in a greedy fashion. +However, given large data sets, there are simply too many splitting +possibilities to consider at once. Hence, the tree is grown in a greedy +fashion. -The greedy optimization starts with a single root node splitting the data -into two partitions and adds additional nodes one at a time. At each step, the +The greedy optimization starts with a single root node splitting the data into +two partitions and adds additional nodes one at a time. At each step, the algorithm chooses a split using exhaustive search. The best split is determined -by a criterion. Remember, that decision trees can deal with regression and +by a criterion. Remember, that decision trees can deal with regression and classification problems. Hence, the criterion differs for the two tasks. ---- +______________________________________________________________________ #### Regression -For regression trees, the best split (feature threshold combination) at each -node is determined by minimizing the *residual sum-of-squares error (RSS)*, +For regression trees, the best split (feature threshold combination) at each +node is determined by minimizing the *residual sum-of-squares error (RSS)*, defined as: ???+ defi "Residual sum-of-squares (RSS)" - \[ - RSS = \sum_{i \in t_L} (y_i - \bar{y}_L)^2 + \sum_{i \in t_R} (y_i - - \bar{y}_R)^2 + \[ + RSS = \sum_{i \in t_L} (y_i - \bar{y}_L)^2 + \sum_{i \in t_R} (y_i - + \bar{y}_R)^2 \] where \(t_L\) and \(t_R\) are the left and right child nodes after the split, and \(\bar{y}_L\) and \(\bar{y}_R\) are the mean target values in the respective nodes. -The algorithm searches through all possible splits to find the one that +The algorithm searches through all possible splits to find the one that minimizes this RSS criterion. ???+ info - Since each split separates the input data into two partitions, the - prediction is the mean of the target variable \(y\) in the respective - partition. - - Hence, intuitively speaking, we do not optimize the entire tree at once - but rather optimize each split locally. + Since each split separates the input data into two partitions, the prediction + is the mean of the target variable \(y\) in the respective partition. + + Hence, intuitively speaking, we do not optimize the entire tree at once but + rather optimize each split locally. #### Classification -For classification tasks, the best split at each node is determined by minimizing -the *Gini impurity*. +For classification tasks, the best split at each node is determined by +minimizing the *Gini impurity*. ???+ defi "Gini impurity" For a node \(t\) with \(K\) classes, the Gini impurity is defined as: \[ - Gini(t) = \sum_{k=1}^K p_{k}(1-p_{k}) = 1 - \sum_{k=1}^K p_{k}^2 + Gini(t) = \sum_{k=1}^K p_{k}(1-p_{k}) = 1 - \sum_{k=1}^K p_{k}^2 \] - + where \(p_k\) is the proportion of class \(k\) observations. The Gini impurity (sometimes referred to as Gini index) encourages leaf nodes @@ -207,64 +202,63 @@ where the majority of observations belong to a single class. ???+ info - The prediction at each leaf node is the majority class among the training + The prediction at each leaf node is the majority class among the training observations in that node. ---- +______________________________________________________________________ #### TLDR -No matter the task (regression or classification), with a greedy optimization -strategy, the CART algorithm searches for the best split using an exhaustive +No matter the task (regression or classification), with a greedy optimization +strategy, the CART algorithm searches for the best split using an exhaustive search at each node to ultimately minimize the prediction error. Thus answering -the first two questions, *a* (How do we pick the right feature for a split?) +the first two questions, *a* (How do we pick the right feature for a split?) and *b* (What's the decision criteria at each node?). -A CART can be seen as a piecewise-constant model, as it partitions the feature -space into regions and assigns a constant prediction (either the mean of a +A CART can be seen as a piecewise-constant model, as it partitions the feature +space into regions and assigns a constant prediction (either the mean of a continuous value or a label) to each region. ### Tree size -Lastly, we answer question, *c* (How large do we grow the tree?). -Put differently, when should we stop adding nodes? +Lastly, we answer question, *c* (How large do we grow the tree?). Put +differently, when should we stop adding nodes? -First, the tree is grown as large as possible until a stopping criterion is -met. This criterion can be the maximum tree depth or a minimum number of -observations per leaf. Second, the tree is pruned back. Pruning is the process -of removing nodes that do not improve the model's performance. It balances the +First, the tree is grown as large as possible until a stopping criterion is +met. This criterion can be the maximum tree depth or a minimum number of +observations per leaf. Second, the tree is pruned back. Pruning is the process +of removing nodes that do not improve the model's performance. It balances the RSS error or Gini impurity against model complexity. ???+ info - If you want to dive deeper into tree pruning, we recommend reading page 665 - of Bishop's book *Pattern Recognition and Machine Learning*[^1] + If you want to dive deeper into tree pruning, we recommend reading page 665 of + Bishop's book *Pattern Recognition and Machine Learning*[^1] ---- +______________________________________________________________________ ## Advantages and Limitations -Decision trees offer several significant advantages, but they also have their +Decision trees offer several significant advantages, but they also have their limitations:
-- :fontawesome-regular-thumbs-up:{ .lg .middle } __Advantages__ +- :fontawesome-regular-thumbs-up:{ .lg .middle } __Advantages__ - --- + ______________________________________________________________________ - Easy to interpret and visualize - Can capture non-linear relationships +- :fontawesome-regular-thumbs-down:{ .lg .middle } __Limitations__ -- :fontawesome-regular-thumbs-down:{ .lg .middle } __Limitations__ - - --- + ______________________________________________________________________ - - Prone to overfitting, i.e., building a model that perfectly fits the - training data but fails to generalize on new (unseen) data. - - Sensitive to data, i.e., small changes in the data can lead to - significantly different trees. + - Prone to overfitting, i.e., building a model that perfectly fits the + training data but fails to generalize on new (unseen) data. + - Sensitive to data, i.e., small changes in the data can lead to + significantly different trees.
@@ -276,14 +270,14 @@ As mentioned earlier, we will use `scikit-learn` for hands-on experience. [^3]: `scikit-learn` documentation: [Decision Trees](https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart) -Functionalities around decision trees are all part of the +Functionalities around decision trees are all part of the [`tree` module](https://scikit-learn.org/stable/api/sklearn.tree.html) in `scikit-learn`. ### Regression -First, we start with a regression task. We will use the California housing -data to predict house prices using a decision tree regressor. +First, we start with a regression task. We will use the California housing data +to predict house prices using a decision tree regressor. #### Load data @@ -303,7 +297,7 @@ X_train, X_test, y_train, y_test = train_test_split( ) ``` -As always, a seed is set for reproducibility (`#!python random_state=42`). It +As always, a seed is set for reproducibility (`#!python random_state=42`). It can be any integer, you can simply pick any number. #### Fit and evaluate the model @@ -330,17 +324,17 @@ print(f"Model performance (R²): {round(score, 2)}") Model performance (R²): 0.61 ``` -The `score()` method returns the coefficient of determination \(R^2\). -You should be already familiar with \(R^2\), as it was first introduced -in the [Regression chapter](../regression.md#coefficient-of-determination) to -evaluate the fit of a linear regression. +The `score()` method returns the coefficient of determination \(R^2\). You +should be already familiar with \(R^2\), as it was first introduced in the +[Regression chapter](../regression.md#coefficient-of-determination) to evaluate +the fit of a linear regression. -The decision tree model achieved an \(R^2\) of 0.61 on the test set, which +The decision tree model achieved an \(R^2\) of 0.61 on the test set, which leaves room for improvement. ???+ info - On a side note: Although we fitted a decision tree on `#!python 16512` + On a side note: Although we fitted a decision tree on `#!python 16512` observations, the process of actually training the model is quite fast! #### Plot the tree @@ -352,8 +346,8 @@ We can easily visualize the tree using the `plot_tree` function. ???+ tip - This is the first time that we discourage you from running the code - snippet below. Soon you will know why. + This is the first time that we discourage you from running the code snippet + below. Soon you will know why. ```python import matplotlib.pyplot as plt @@ -369,14 +363,14 @@ plt.show() # use matplotlib to show the plot
-Though we can't read any of the information present, the plot hints at a huge +Though we can't read any of the information present, the plot hints at a huge tree. Due to its complexity, the model does not add much value to the understanding of the data (it's simply not interpretable). -Actually visualizing this particular tree takes some time, hence we -discouraged you from executing the code. +Actually visualizing this particular tree takes some time, hence we discouraged +you from executing the code. -But why do we get such a huge tree? By default, the CART implementation in +But why do we get such a huge tree? By default, the CART implementation in `scikit-learn` grows the tree as large as possible and does *not* prune it. ##### ... to fix @@ -395,9 +389,9 @@ model = DecisionTreeRegressor( model.fit(X_train, y_train) ``` -The `max_depth` parameter limits the depth of the tree, while `min_samples_leaf` -sets the minimum number of samples (observations) required to be in a leaf -node. Both prevent the tree from growing too large. +The `max_depth` parameter limits the depth of the tree, while +`min_samples_leaf` sets the minimum number of samples (observations) required +to be in a leaf node. Both prevent the tree from growing too large. ???+ info @@ -412,28 +406,28 @@ import matplotlib.pyplot as plt from sklearn.tree import plot_tree plot_tree( - model, - filled=True, # (1)! + model, + filled=True, # (1)! feature_names=X.columns, # (2)! - proportion=True # (3)! + proportion=True, # (3)! ) plt.show() ``` -1. `#!python filled=True` colors nodes according to prediction values. - A stronger color indicating a higher value. -2. The parameter `feature_names` is used to label the features in the tree. -3. `proportion=True` displays the proportion of samples in each node. +1. `#!python filled=True` colors nodes according to prediction values. A + stronger color indicating a higher value. +1. The parameter `feature_names` is used to label the features in the tree. +1. `proportion=True` displays the proportion of samples in each node. ???+ info - - Generally, it is always good practice to consult the documentation, if - you are unsure about the usage of a function/class. - Regarding `plot_tree()`, you might find some useful information in the + Generally, it is always good practice to consult the documentation, if you are + unsure about the usage of a function/class. + + Regarding `plot_tree()`, you might find some useful information in the [docs](https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html) - that can help you customize the plot to your liking. - So don't shy away from reading the documentation! + that can help you customize the plot to your liking. So don't shy away from + reading the documentation!
![A small tree](../../../../assets/data-science/algorithms/tree-based/small-tree.png) @@ -446,25 +440,23 @@ plt.show() ???+ tip The nodes are quite easy to read: - - Starting with the root node, the feature `MedInc` performs - the first split. If the median income is less than 5.086, we follow the - left branch else the right branch. The resulting `squared_error` of the - split is shown as well. At the root node, the `squared_error` (sum of the - squared differences between the actual values and the predicted value) - is 1.337. The lower the `squared_error`, the better the split. A "perfect - split" would result in a `squared_error` of 0. - - The root node splits the data into two subsets, the left branch results - in a subest containing 79.3% of the training data and the right branch - 20.7%. Compared to the root node, both additional splits lead to a - decrease of the `squared_error` and thus increase the predictive power. - After two more splits, we reach the leaf nodes. Each leaf node contains - a value, the final prediction. - -Now we have a pruned tree, which reduced the risk of overfitting. However, at -the cost of model performance. The \(R^2\) decreased from 0.61 to 0.42 which -might indicate that such a simple tree might not capture the complexity of the + + Starting with the root node, the feature `MedInc` performs the first split. If + the median income is less than 5.086, we follow the left branch else the right + branch. The resulting `squared_error` of the split is shown as well. At the + root node, the `squared_error` (sum of the squared differences between the + actual values and the predicted value) is 1.337. The lower the `squared_error`, + the better the split. A "perfect split" would result in a `squared_error` of 0. + + The root node splits the data into two subsets, the left branch results in a + subest containing 79.3% of the training data and the right branch 20.7%. + Compared to the root node, both additional splits lead to a decrease of the + `squared_error` and thus increase the predictive power. After two more splits, + we reach the leaf nodes. Each leaf node contains a value, the final prediction. + +Now we have a pruned tree, which reduced the risk of overfitting. However, at +the cost of model performance. The \(R^2\) decreased from 0.61 to 0.42 which +might indicate that such a simple tree might not capture the complexity of the data well.
@@ -476,26 +468,26 @@ data well.
In practice, you have to find the right parameters to balance model complexity -and performance. Unfortunately, there is no one-size-fits-all solution. You +and performance. Unfortunately, there is no one-size-fits-all solution. You have to tune the parameters based on the data and the task at hand. ???+ question "Parameter tuning" - Try some different combinations of `max_depth` and `min_samples_leaf`. - Use the same train test split, we defined earlier. - + Try some different combinations of `max_depth` and `min_samples_leaf`. Use the + same train test split, we defined earlier. + 1. Manually change the values. - 2. Fit the model. - 3. Evaluate the model. - 4. Plot the model. - 5. Repeat! :repeat: + 1. Fit the model. + 1. Evaluate the model. + 1. Plot the model. + 1. Repeat! :repeat: Can you get an \(R^2\) higher than `#!python 0.7`? ### Classification -Next, we switch to a classification task. We will re-use the breast cancer -data set introduced in the previous Classification chapter. +Next, we switch to a classification task. We will re-use the breast cancer data +set introduced in the previous Classification chapter. #### Load data @@ -510,7 +502,7 @@ X_train, X_test, y_train, y_test = train_test_split( #### Fit and evaluate the model -For classification trees, `scikit-learn` provides the class +For classification trees, `scikit-learn` provides the class `DecisionTreeClassifier`. ```python hl_lines="1" @@ -518,21 +510,23 @@ from sklearn.tree import DecisionTreeClassifier model = DecisionTreeClassifier( # again, set max_depth and min_samples_leaf to prevent growing a huge tree - random_state=784, max_depth=7, min_samples_leaf=5 + random_state=784, + max_depth=7, + min_samples_leaf=5, ) ``` ???+ question "Fit and evaluate the model" Now it is your time to fit and evaluate the model. Although, you have never - used an instance of `DecisionClassifier` before, you can use the same - methods as with other models in `scikit-learn`. Simply refer to the - previous regression example. - + used an instance of `DecisionClassifier` before, you can use the same methods + as with other models in `scikit-learn`. Simply refer to the previous regression + example. + 1. Fit the model on `X_train` and `y_train`. - 2. Evaluate the model on `X_test` and `y_test`. - 3. Print the model's performance. - 4. Plot the tree. + 1. Evaluate the model on `X_test` and `y_test`. + 1. Print the model's performance. + 1. Plot the tree. Lastly answer following quiz question to evaluate your result. @@ -550,14 +544,14 @@ from the logistic regression. ## Recap -We comprehensively explored decision trees, focusing on the CART algorithm. -The theory section illuminated its core mechanisms, while practical -examples demonstrated building and evaluating decision trees for regression and +We comprehensively explored decision trees, focusing on the CART algorithm. The +theory section illuminated its core mechanisms, while practical examples +demonstrated building and evaluating decision trees for regression and classification tasks. Key takeaways include: - Algorithm insights into tree construction -- Practical implementation skills +- Practical implementation skills - Understanding of decision trees' interpretability and overfitting risks -Next, we'll extend our knowledge to Random Forests, an ensemble method +Next, we'll extend our knowledge to Random Forests, an ensemble method combining multiple decision trees to enhance predictive performance. diff --git a/docs/data-science/algorithms/supervised/tree-based/forest.md b/docs/data-science/algorithms/supervised/tree-based/forest.md index 4e270cfa..2afe8c3f 100644 --- a/docs/data-science/algorithms/supervised/tree-based/forest.md +++ b/docs/data-science/algorithms/supervised/tree-based/forest.md @@ -12,65 +12,65 @@ CART (Classification and Regression Trees) algorithm, we can dive right in. ???+ info - Random forests were introduced by Leo Breiman in 2001. The following - section closely follows the original paper. + Random forests were introduced by Leo Breiman in 2001. The following section + closely follows the original paper. ^^Breiman, L. Random Forests. *Machine Learning 45*, 5–32 (2001).^^ [https://doi.org/10.1023/A:1010933404324](https://doi.org/10.1023/A:1010933404324) A random forest combines multiple decision trees to create an ensemble model. -The idea is to grow multiple trees and average their predictions. Thus, +The idea is to grow multiple trees and average their predictions. Thus, resulting in a more robust model that improves generalization and reduces overfitting. The randomness in a random forest stems from two techniques: 1. Bootstrap sampling -2. Random feature selection +1. Random feature selection ### Bootstrap sampling -The first technique is known as **bootstrap sampling**. Given a -training set of size $N$, we draw $N$ samples ==with replacement==. This means -that some samples may be repeated, while others may not be included at all. -This results in a new training set of the same size as the original, but with -some samples missing and others duplicated. +The first technique is known as **bootstrap sampling**. Given a training set of +size $N$, we draw $N$ samples ==with replacement==. This means that some +samples may be repeated, while others may not be included at all. This results +in a new training set of the same size as the original, but with some samples +missing and others duplicated. -Each tree is fit on a different bootstrap sample. Intuitively speaking, this +Each tree is fit on a different bootstrap sample. Intuitively speaking, this means that each tree sees a slightly different "version" of the training data. ### Random feature selection -The second technique is **random feature selection**. -Remember, that a CART is grown by selecting the best split at each node. -This is done by considering all features. Contrary when growing trees for a -random forest, we only consider a random subset of features at each split. +The second technique is **random feature selection**. Remember, that a CART is +grown by selecting the best split at each node. This is done by considering all +features. Contrary when growing trees for a random forest, we only consider a +random subset of features at each split. ---- +______________________________________________________________________ ### Putting it all together Each tree in a random forest is fit on a bootstrap sample and uses a random -subset of features at each split. -In case of regression, the predictions of all trees are simply averaged. In -case of classification, the majority vote is taken. The majority vote in a -random forest classification means that the class predicted most frequently by -the individual trees is selected as the final prediction. - -No matter the task, classification or regression: it was observed that -introducing randomness in the tree-growing process improves the model +subset of features at each split. In case of regression, the predictions of all +trees are simply averaged. In case of classification, the majority vote is +taken. The majority vote in a random forest classification means that the class +predicted most frequently by the individual trees is selected as the final +prediction. + +No matter the task, classification or regression: it was observed that +introducing randomness in the tree-growing process improves the model performance. ???+ info - Contrary to the classic CART, random forests do not constrain the tree - growth. I.e., trees are fully grown and not pruned. + Contrary to the classic CART, random forests do not constrain the tree growth. + I.e., trees are fully grown and not pruned. ## Examples -With a basic understanding of random forests we take a look at some -examples. As always, we'll use our favorite machine learning package -`scikit-learn` (at least that of the author :wink:). +With a basic understanding of random forests we take a look at some examples. +As always, we'll use our favorite machine learning package `scikit-learn` (at +least that of the author :wink:). In order to focus on the random forest implementation and its parameters, we'll reuse the California housing data (for regression) and the breast cancer data @@ -82,8 +82,8 @@ Let's start with building a random forest to predict California housing prices. #### Load data -As usual, we load the data and split it into a training and test set in -order to evaluate the model later on. +As usual, we load the data and split it into a training and test set in order +to evaluate the model later on. ```python from sklearn.datasets import fetch_california_housing @@ -98,8 +98,8 @@ X_train, X_test, y_train, y_test = train_test_split( #### Fit the model -Just like with decision trees, `scikit-learn` provides two separate classes -for regression and classification, namely `RandomForestRegressor` and +Just like with decision trees, `scikit-learn` provides two separate classes for +regression and classification, namely `RandomForestRegressor` and `RandomForestClassifier`. Both are part of the `ensemble` module. ```python @@ -109,8 +109,8 @@ model = RandomForestRegressor(random_state=784) # (1)! model.fit(X_train, y_train) ``` -1. As a random forest is well random :sweat_smile:, we set the - `random_state` to ensure the reproducibility of our results. +1. As a random forest is well random :sweat_smile:, we set the `random_state` + to ensure the reproducibility of our results. Depending on your setup, the fitting process might take a couple of seconds. @@ -127,18 +127,18 @@ Model performance (R²): 0.81 ???+ info - Remember, that the `score()` method of a decision tree regressor - (`DecisionTreeRegressor`) returned the coefficient of determination - \(R^2\). The same applies to random forests regressors. + Remember, that the `score()` method of a decision tree regressor + (`DecisionTreeRegressor`) returned the coefficient of determination \(R^2\). + The same applies to random forests regressors. Compared to a single tree with an \(R^2\) of 0.61, the random forest performs -considerably better with an \(R^2\) of 0.81. You can re-visit the according +considerably better with an \(R^2\) of 0.81. You can re-visit the according section [here](cart.md#fit-and-evaluate-the-model). ???+ question "How many trees are in the forest?" - - Consult the `scikit-learn` docs to find out how many trees are in the - forest by default. Use the following question for self-assessment. + + Consult the `scikit-learn` docs to find out how many trees are in the forest by + default. Use the following question for self-assessment. How many trees form a forest by default? @@ -152,16 +152,13 @@ The parameter `n_estimators` defaults to 100 trees. ???+ info - If you want to get closer to the original definition of a random forest - regressor by Breiman, you have to set the `max_features` parameter. - Specifically, with \(m\) features, the number of features considered at - each split should be \(\frac{m}{3}\) for regression. + If you want to get closer to the original definition of a random forest + regressor by Breiman, you have to set the `max_features` parameter. + Specifically, with \(m\) features, the number of features considered at each + split should be \(\frac{m}{3}\) for regression. ```python hl_lines="2" - RandomForestRegressor( - max_features=len(X_train.columns) // 3, - random_state=784 - ) + RandomForestRegressor(max_features=len(X_train.columns) // 3, random_state=784) ``` By default, `scikit-learn` considers \(m\) features for each split. @@ -169,9 +166,9 @@ The parameter `n_estimators` defaults to 100 trees. ???+ tip If you're unsure how to set parameters of a model (such as `max_features`), - stick to the defaults. `scikit-learn` provides sensible defaults - that work well. In later chapters, we will explore methods to - automatically tune these hyperparameters. + stick to the defaults. `scikit-learn` provides sensible defaults that work + well. In later chapters, we will explore methods to automatically tune these + hyperparameters. ### Classification @@ -180,14 +177,14 @@ Next, we switch to a classification task. ???+ question Load the breast cancer data, fit and evaluate a random forest. - + 1. Load the data and split it into a training and test set. - 2. Load the appropriate random forest class. - 3. Fit the model. - 4. Evaluate the model on the test set. + 1. Load the appropriate random forest class. + 1. Fit the model. + 1. Evaluate the model on the test set. - Hint: This and the previous chapter should provide all necessary - information, to solve the tasks. + Hint: This and the previous chapter should provide all necessary information, + to solve the tasks. #### Inspecting the forest @@ -211,24 +208,20 @@ print(model.estimators_) # (1)! `estimators_` is a list of individual tree instances. If you're dealing with a `RandomForestRegressor`, `estimators_` is a list of `DecisionTreeRegressor`. -In most cases, you won't need to inspect the individual trees. Nevertheless, -we can utilize this information to solidify our understanding of random -forests. +In most cases, you won't need to inspect the individual trees. Nevertheless, we +can utilize this information to solidify our understanding of random forests. ---- +______________________________________________________________________ ### Stronger together -We fit a random forest classifier on a synthetic data set to -==literally== illustrate the different trees. First, we generate the data. +We fit a random forest classifier on a synthetic data set to ==literally== +illustrate the different trees. First, we generate the data. ```python from sklearn.datasets import make_classification -X, y = make_classification( - random_state=42, - n_clusters_per_class=1 -) +X, y = make_classification(random_state=42, n_clusters_per_class=1) ``` Next, we initialize and fit a random forest classifier. @@ -240,13 +233,13 @@ classifier = RandomForestClassifier( classifier.fit(X, y) ``` -Note, that we set the number of trees to `#!python 4`. We keep the number -small as we visualize them later on. The `max_depth` parameter limits the -depth of each tree to `#!python 3`. This is done to perform pruning and thus -keep the trees simple and easier to plot. +Note, that we set the number of trees to `#!python 4`. We keep the number small +as we visualize them later on. The `max_depth` parameter limits the depth of +each tree to `#!python 3`. This is done to perform pruning and thus keep the +trees simple and easier to plot. Finally, we visualize all trees. We access the trees via the `estimators_` -attribute and plot them using the familiar `plot_tree()` function. Everything +attribute and plot them using the familiar `plot_tree()` function. Everything else is just plot customization. ```python hl_lines="5 7" @@ -276,26 +269,26 @@ plt.show()
-Although there is a lot of information cramped inside one figure, at first -glance it is obvious that all four trees are different. Each of them differs -in splits (feature and threshold), number of nodes and predictions. +Although there is a lot of information cramped inside one figure, at first +glance it is obvious that all four trees are different. Each of them differs in +splits (feature and threshold), number of nodes and predictions. Each one of these trees on their own might not generalize well, hence they are -often referred to as weak learners. However, when combined, they form a +often referred to as weak learners. However, when combined, they form a "strong" model. That's the essence of an ensemble method! ### Feature importance -One of the most powerful attribute of random forests is their ability to -assess feature importance: measuring how much each input variable contributes -to predicting the target variable. +One of the most powerful attribute of random forests is their ability to assess +feature importance: measuring how much each input variable contributes to +predicting the target variable. -Remember that trees are fitted on a [bootstrap](forest.md#bootstrap-sampling) -training set. Since some samples are left out during this process, we can use -these to measure the importance of each feature. These unused observations are -called "out-of-bag" (OOB) samples. For each feature, the OOB samples are -randomly permuted (shuffled) and the increase in prediction error is measured. -Features that lead to larger increases in error when permuted are considered +Remember that trees are fitted on a [bootstrap](forest.md#bootstrap-sampling) +training set. Since some samples are left out during this process, we can use +these to measure the importance of each feature. These unused observations are +called "out-of-bag" (OOB) samples. For each feature, the OOB samples are +randomly permuted (shuffled) and the increase in prediction error is measured. +Features that lead to larger increases in error when permuted are considered more important. Let's examine feature importance using the breast cancer dataset: @@ -318,27 +311,26 @@ print(rf.feature_importances_) To keep the example concise, we did not perform a train test split. -Feature importance values are a `#!python list` of `#!python float`s. -Each value corresponds to a feature in the order they were passed to the -model. The values are normalized and sum to `#!python 1.0`. -A higher value indicates that the feature contributes more to making correct -predictions. +Feature importance values are a `#!python list` of `#!python float`s. Each +value corresponds to a feature in the order they were passed to the model. The +values are normalized and sum to `#!python 1.0`. A higher value indicates that +the feature contributes more to making correct predictions. Feature importance can help with: 1. Feature selection: Identifying which features are most relevant for - predictions -2. Model interpretation: Understanding which features drive the model's - decisions -3. Data collection: Guiding future data collection efforts by highlighting - important measurements + predictions +1. Model interpretation: Understanding which features drive the model's + decisions +1. Data collection: Guiding future data collection efforts by highlighting + important measurements ???+ question "Visualize the feature importance" - Generate a bar plot to visualize the feature importance. - Use any package of your choice. For convenience, you can use the - following code snippet to get started. - + Generate a bar plot to visualize the feature importance. Use any package of + your choice. For convenience, you can use the following code snippet to get + started. + ```python import pandas as pd @@ -367,6 +359,6 @@ sensitivity to data changes. While slightly less interpretable than single trees, random forests provide better generalization, more robust predictions, and useful insights through feature importance measures. -With `scikit-learn`, you are now able to build a random forest for regression +With `scikit-learn`, you are now able to build a random forest for regression and classification tasks. You have also learned how to inspect individual trees and assess feature importance. diff --git a/docs/data-science/algorithms/unsupervised/clustering.md b/docs/data-science/algorithms/unsupervised/clustering.md index 1e006a7d..789e300b 100644 --- a/docs/data-science/algorithms/unsupervised/clustering.md +++ b/docs/data-science/algorithms/unsupervised/clustering.md @@ -1,16 +1,16 @@ # Clustering -In this section, we will start to explore unsupervised learning, where we work -with data that isn't accompanied by labels. One of the primary techniques -within this realm is clustering, which aims to uncover patterns or structures -in the data by grouping similar data points together. A popular method for -achieving this is k-means clustering, which aims to identify clusters of +In this section, we will start to explore unsupervised learning, where we work +with data that isn't accompanied by labels. One of the primary techniques +within this realm is clustering, which aims to uncover patterns or structures +in the data by grouping similar data points together. A popular method for +achieving this is k-means clustering, which aims to identify clusters of similar observations. ## K-means -K-means was briefly introduced in the [Introduction](../index.md#example_1) to -Supervised vs. Unsupervised Learning and used to segment customers based on +K-means was briefly introduced in the [Introduction](../index.md#example_1) to +Supervised vs. Unsupervised Learning and used to segment customers based on their annual spending and average basket size.
@@ -22,37 +22,36 @@ their annual spending and average basket size.
The algorithm groups similar data points together based on their attributes -without being told what these groups should be. +without being told what these groups should be. To get a better understanding of k-means, we will explore the theory behind it -and employ the algorithm to cluster data from Spotify and a semiconductor +and employ the algorithm to cluster data from Spotify and a semiconductor manufacturer. ### Theory ???+ info - The theoretical part is adapted from: - ^^Christopher M. Bishop. 2006. *Pattern Recognition and Machine - Learning*[^1]^^ + The theoretical part is adapted from: ^^Christopher M. Bishop. 2006. *Pattern + Recognition and Machine Learning*[^1]^^ - [^1]: - Christopher M. Bishop. Pattern Recognition and Machine Learning. - Springer, 2006. [Link](https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf) + [^1]: Christopher M. Bishop. Pattern Recognition and Machine Learning. + Springer, 2006. + [Link](https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf) Assume a set of features \(x_1, x_2, ..., x_n\). K-means partitions the data -into \(K\) number of clusters. Each cluster is represented by \(\mu_k\), -which can be seen as the center of a cluster \(k\). +into \(K\) number of clusters. Each cluster is represented by \(\mu_k\), which +can be seen as the center of a cluster \(k\). -Intuitively speaking, the goal is to assign each data point \(x_n\) to the -cluster with the closest center \(\mu_k\). +Intuitively speaking, the goal is to assign each data point \(x_n\) to the +cluster with the closest center \(\mu_k\). #### The objective Since, the optimal assignment of data points to specific clusters is not known, -the objective is to minimize the sum of squared distances between data -points and their assigned cluster centers. -This is known as the **distortion measure**: +the objective is to minimize the sum of squared distances between data points +and their assigned cluster centers. This is known as the **distortion +measure**: ???+ defi "Distortion measure" @@ -61,33 +60,33 @@ This is known as the **distortion measure**: \] where: - + - \(N\) is the number of data points, - \(K\) being the number of clusters, - - \(r_{nk}\) is a binary indicator of whether data point \(x_n\) is - assigned to cluster \(k\), + - \(r_{nk}\) is a binary indicator of whether data point \(x_n\) is assigned to + cluster \(k\), - \(\mu_k\) representing the cluster center. -In short, we want to find the optimal \(r_{nk}\) and \(\mu_k\) that minimize +In short, we want to find the optimal \(r_{nk}\) and \(\mu_k\) that minimize the distortion measure \(J\). -\(J\) is minimized in an iterative process. First, we initialize \(\mu_k\) -with some random values. Then we alternate between two steps: +\(J\) is minimized in an iterative process. First, we initialize \(\mu_k\) with +some random values. Then we alternate between two steps: -1. **Assignment step**: Keep \(\mu_k\) fixed. Minimize \(J\) with respect - to \(r_{nk}\). This is done by assigning each data point to the closest +1. **Assignment step**: Keep \(\mu_k\) fixed. Minimize \(J\) with respect to + \(r_{nk}\). This is done by assigning each data point to the closest cluster center. -2. **Update step**: Keep \(r_{nk}\) fixed. Minimize \(J\) with respect to - \(\mu_k\). This is done by updating the cluster centers to the mean of - the data points assigned to the cluster. +1. **Update step**: Keep \(r_{nk}\) fixed. Minimize \(J\) with respect to + \(\mu_k\). This is done by updating the cluster centers to the mean of the + data points assigned to the cluster. Step 1 can be seen as re-assigning the data points to clusters, while step 2 re-computes the cluster centers. ???+ info - Since \(\mu_k\) is the mean of the data points assigned to cluster \(k\), - we speak of the k-means algorithm. + Since \(\mu_k\) is the mean of the data points assigned to cluster \(k\), we + speak of the k-means algorithm. The optimization of \(J\) is guaranteed to converge, but it might not find the global minimum. The final solution depends on the initial cluster centers. @@ -95,11 +94,11 @@ global minimum. The final solution depends on the initial cluster centers. ???+ question "Get a better understanding" To improve your understanding of the k-means algorithm, either watch the - following video or visit the interactive visualization. - Both variants illustrate the iterative process of k-means. + following video or visit the interactive visualization. Both variants + illustrate the iterative process of k-means. === "Option 1: :fontawesome-brands-youtube: Video" - +
-The goal of this exercise is to recommend a song based on a previous -track. The idea is to pick a song as recommendation that is in the same -cluster as the previous one. To do so, we can use the `cluster_indices` to -recommend similar songs. +The goal of this exercise is to recommend a song based on a previous track. The +idea is to pick a song as recommendation that is in the same cluster as the +previous one. To do so, we can use the `cluster_indices` to recommend similar +songs. -Since the `cluster_indices` are in the same order as our initial `data`, we -can simply assign them as a new column. +Since the `cluster_indices` are in the same order as our initial `data`, we can +simply assign them as a new column. ```python data["cluster"] = cluster_indices @@ -396,12 +399,12 @@ print(data.head()) 4 6dOtVTDdiauQNBQEDOtlAB BIRDS OF A FEATHER Billie Eilish ... 0.438 104.978 4 ``` -Now, that we assigned a cluster to all `#!python 11320` tracks, we can easily -recommend a song based on a given `spotify_id` (the unique identifier of a -song on the platform). +Now, that we assigned a cluster to all `#!python 11320` tracks, we can easily +recommend a song based on a given `spotify_id` (the unique identifier of a song +on the platform). -Use the below functions to see your recommender system in action. Don't -worry about the details of these functions. +Use the below functions to see your recommender system in action. Don't worry +about the details of these functions. ```python def print_track_info(track): @@ -463,57 +466,52 @@ Cluster index: 4 recommendation. Try it out! 1. Pick another `spotify_id` and recommend a song. - 2. Repeat the process a couple of times. - + 1. Repeat the process a couple of times. #### Are the recommendations good? As you've tried the recommender system a couple of times, you might have -wondered if the recommendations are actually good?! -:thinking_face: +wondered if the recommendations are actually good?! :thinking_face: -Simply put, you have to be the judge if we were actually able to cluster +Simply put, you have to be the judge if we were actually able to cluster similar songs together and build a good recommendation system. -In this application, it's quite intuitive: If you as a user like the -recommendations and keep listening to the recommended songs, the system is +In this application, it's quite intuitive: If you as a user like the +recommendations and keep listening to the recommended songs, the system is successful. - ???+ info - - When talking about supervised tasks, we were able to measure the - performance of our models. However, in unsupervised learning, like - clustering, we do not have labels to compare our results to. Thus, - evaluating the performance of unsupervised learning methods is challenging. - - In practice, you have to rely on domain knowledge to interpret the - results and assess the quality of the model. ---- + When talking about supervised tasks, we were able to measure the performance of + our models. However, in unsupervised learning, like clustering, we do not have + labels to compare our results to. Thus, evaluating the performance of + unsupervised learning methods is challenging. + + In practice, you have to rely on domain knowledge to interpret the results and + assess the quality of the model. + +______________________________________________________________________ ### Semiconductor data -K-means is not only useful for recommendation systems, but also for -anomaly detection. The idea is to form clusters which in turn can be used to -detect the outliers/anomalies. +K-means is not only useful for recommendation systems, but also for anomaly +detection. The idea is to form clusters which in turn can be used to detect the +outliers/anomalies. ???+ info The data is adapted from the UCI Machine Learning Repository.[^2] - - [^2]: - McCann, M. & Johnston, A. (2008). SECOM [Dataset]. - UCI Machine Learning Repository. - [https://doi.org/10.24432/C54305](https://doi.org/10.24432/C54305) + + [^2]: McCann, M. & Johnston, A. (2008). SECOM [Dataset]. UCI Machine Learning + Repository. [https://doi.org/10.24432/C54305](https://doi.org/10.24432/C54305) In this example, you will apply k-means to semiconductor data. ???+ question "Download and read data" 1. Download the below data set. - 2. Read it with `pandas`. - 3. Have a look at the data. + 1. Read it with `pandas`. + 1. Have a look at the data.
[Download semiconductor data :fontawesome-solid-download:](../../../assets/data-science/algorithms/clustering/semiconductor.csv){ .md-button } @@ -522,47 +520,48 @@ In this example, you will apply k-means to semiconductor data. Each row in the data set > represents a single production entity with associated measured features [...] -> -> -- UCI Machine Learning Repository +> +> UCI Machine Learning Repository ???+ question "Apply k-means" Solve the following tasks to apply k-means to the semiconductor data: 1. Are there any missing values in the data? - 2. Deal with potential missing values; choose any suitable strategy. We - recommend to utilize the [`SimpleImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) with your chosen strategy. The application - of the `SimpleImputer` should be straightforward as it implements the - methods you already know, e.g., `fit_transform()`. - 3. Do you need to scale the features? If so, apply a `StandardScaler`. - 4. Use the elbow method to determine the number of clusters. - 5. Fit the k-means algorithm with the optimal number of clusters. - - Hint: You can reuse the functions and code snippets from the Spotify - example. + 1. Deal with potential missing values; choose any suitable strategy. We + recommend to utilize the + [`SimpleImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) + with your chosen strategy. The application of the `SimpleImputer` should + be straightforward as it implements the methods you already know, e.g., + `fit_transform()`. + 1. Do you need to scale the features? If so, apply a `StandardScaler`. + 1. Use the elbow method to determine the number of clusters. + 1. Fit the k-means algorithm with the optimal number of clusters. + + Hint: You can reuse the functions and code snippets from the Spotify example. ??? info - If you have solved the above tasks, you might wonder how to interpret - your clustering results. Moreover, how can you detect potential anomalies? + If you have solved the above tasks, you might wonder how to interpret your + clustering results. Moreover, how can you detect potential anomalies? - Again, it all depends on domain knowledge. If you're a expert in the - semiconductor industry you might be able to tell if the clusters - make sense and if there are any anomalies in the data. Otherwise, - interpretation can be quite challenging. + Again, it all depends on domain knowledge. If you're a expert in the + semiconductor industry you might be able to tell if the clusters make sense and + if there are any anomalies in the data. Otherwise, interpretation can be quite + challenging. ## Recap -In this chapter, we introduced k-means clustering. We covered the theory +In this chapter, we introduced k-means clustering. We covered the theory followed by two practical examples: building a recommendation system for Spotify tracks and clustering semiconductor data. We employed the elbow method to determine the optimal number of clusters and discussed the challenges of evaluating clustering results. -In the upcoming chapter, we introduce another unsupervised method, -namely Principal Component Analysis (PCA) to reduce the dimensionality of data. -PCA can be useful in various ways: +In the upcoming chapter, we introduce another unsupervised method, namely +Principal Component Analysis (PCA) to reduce the dimensionality of data. PCA +can be useful in various ways: - reducing the computational complexity of algorithms - visualizing high-dimensional data in a 2D or 3D space diff --git a/docs/data-science/algorithms/unsupervised/dim-reduction.md b/docs/data-science/algorithms/unsupervised/dim-reduction.md index 8a6e4b07..fc581eb9 100644 --- a/docs/data-science/algorithms/unsupervised/dim-reduction.md +++ b/docs/data-science/algorithms/unsupervised/dim-reduction.md @@ -2,73 +2,73 @@ ## Principal Component Analysis (PCA) -In data science and machine learning, we often encounter data sets with -hundreds or even thousands of features. We speak of high-dimensional data -sets. While these features may contain valuable information, working with -such high-dimensional data can be computationally expensive, prone to -overfitting, and difficult to visualize. This is where another -unsupervised method, dimensionality reduction comes in — a technique used to -simplify data sets, while retaining much of the critical information. - -One of the most widely used methods for dimensionality reduction is -Principal Component Analysis (PCA). PCA transforms a high-dimensional (= -lots of features) data set into a smaller set of features (components). In -practice, PCA can reduce hundreds of features down to just 2 or 3 -features, making PCA an ideal tool for visualization, preprocessing, and -feature extraction. - -In this section, we will explain the inner workings of PCA and apply it to -the semiconductor data set. +In data science and machine learning, we often encounter data sets with +hundreds or even thousands of features. We speak of high-dimensional data sets. +While these features may contain valuable information, working with such +high-dimensional data can be computationally expensive, prone to overfitting, +and difficult to visualize. This is where another unsupervised method, +dimensionality reduction comes in — a technique used to simplify data sets, +while retaining much of the critical information. + +One of the most widely used methods for dimensionality reduction is Principal +Component Analysis (PCA). PCA transforms a high-dimensional (= lots of +features) data set into a smaller set of features (components). In practice, +PCA can reduce hundreds of features down to just 2 or 3 features, making PCA an +ideal tool for visualization, preprocessing, and feature extraction. + +In this section, we will explain the inner workings of PCA and apply it to the +semiconductor data set. ### What is PCA? -PCA is a **linear transformation technique** that identifies the directions -(also called **principal components**) in which the data varies the most. -These principal components capture as much variance as possible. PCA has a -variety of applications, such as: +PCA is a **linear transformation technique** that identifies the directions +(also called **principal components**) in which the data varies the most. These +principal components capture as much variance as possible. PCA has a variety of +applications, such as: - **Data visualization**: Plot a dimensionality reduced data set in 2D. - **Preprocessing**: Removing noise or redundant features while retaining the - essential patterns in data. + essential patterns in data. - **Feature engineering**: Summarizing high-dimensional data into a smaller set - of meaningful features. + of meaningful features. ### How does it work? PCA follows these essential steps: 1. **Compute the covariance matrix**: PCA captures relationships between - features by calculating the covariance between them. + features by calculating the covariance between them. ???+ info - - Think of the covariance matrix as the "spread" of the data. PCA looks - at the interaction :fontawesome-solid-arrow-right: the correlation of - features with each other. Visit the + + Think of the covariance matrix as the "spread" of the data. PCA looks at the + interaction :fontawesome-solid-arrow-right: the correlation of features with + each other. Visit the [correlation chapter](../../../statistics/bivariate/Correlation.md#covariance) in the statistics course to learn more about covariance. -2. **Eigen decomposition**: Identify the eigenvalues and eigenvectors of the - covariance matrix. The eigenvectors represent the directions of the - principal components, while the eigenvalues represent the amount of variance - captured by each component. +1. **Eigen decomposition**: Identify the eigenvalues and eigenvectors of the + covariance matrix. The eigenvectors represent the directions of the + principal components, while the eigenvalues represent the amount of + variance captured by each component. ???+ info - - If you want to know more about eigenvalues and eigenvectors, check out - this [site](https://www.mathsisfun.com/algebra/eigenvalue.html). -3. **Rank components**: Components are ranked by their eigenvalues. The first - principal component captures the most variance, the second captures the - next-most, and so on. -4. **Transform the data**: Project the original data onto the top principal - components to reduce its dimensionality. + If you want to know more about eigenvalues and eigenvectors, check out this + [site](https://www.mathsisfun.com/algebra/eigenvalue.html). + +1. **Rank components**: Components are ranked by their eigenvalues. The first + principal component captures the most variance, the second captures the + next-most, and so on. + +1. **Transform the data**: Project the original data onto the top principal + components to reduce its dimensionality. ### The mathematical objective -Let’s assume we have a data set \(X\) with \(p\) features (dimensions). We -aim to transform \(X\) into a new matrix \(Z\) with \(k\) features such -that \(k < p\), while retaining as much variance as possible. +Let’s assume we have a data set \(X\) with \(p\) features (dimensions). We aim +to transform \(X\) into a new matrix \(Z\) with \(k\) features such that +\(k < p\), while retaining as much variance as possible. The transformation (described previously under point 4) is defined as: @@ -79,24 +79,23 @@ The transformation (described previously under point 4) is defined as: \] Where: - + - \(Z\) is the transformed data set in the lower-dimensional space, - \(W\) is a matrix whose columns are the top \(k\) eigenvectors of the covariance matrix of \(X\). ???+ tip - Dimensionality reduction helps in combating the *curse of dimensionality*, - a phenomenon where the performance of algorithms deteriorates with an - increase in the number of features. Algorithms like clustering - often struggle to find meaningful patterns when working with a - high-dimensional data set. + Dimensionality reduction helps in combating the *curse of dimensionality*, a + phenomenon where the performance of algorithms deteriorates with an increase in + the number of features. Algorithms like clustering often struggle to find + meaningful patterns when working with a high-dimensional data set. ## Example -It’s time to apply PCA to real-world data. We'll revisit the semiconductor -data set that we used in the previous clustering chapter. The first goal -is to use PCA to reduce the data set's dimensions and visualize them. +It’s time to apply PCA to real-world data. We'll revisit the semiconductor data +set that we used in the previous clustering chapter. The first goal is to use +PCA to reduce the data set's dimensions and visualize them. ### Prepare the data @@ -132,8 +131,8 @@ scaled_data = scaler.fit_transform(data) ### Apply PCA -We now apply PCA to reduce the dimensions. First, we fit the PCA model on -the `scaled_data`: +We now apply PCA to reduce the dimensions. First, we fit the PCA model on the +`scaled_data`: ```python from sklearn.decomposition import PCA @@ -142,11 +141,11 @@ pca = PCA(n_components=2, random_state=42) # (1)! components = pca.fit_transform(scaled_data) ``` -1. Although the above definition of PCA is deterministic, the actual - implementation can be stochastic (depending on the solver used). Since - `svd_solver` is set to `#!python "auto"` by default, the results can - vary slightly. Long story short, setting `random_state` ensures - reproducibility in all cases. +1. Although the above definition of PCA is deterministic, the actual + implementation can be stochastic (depending on the solver used). Since + `svd_solver` is set to `#!python "auto"` by default, the results can vary + slightly. Long story short, setting `random_state` ensures reproducibility + in all cases. `n_components=2` specifies that we want to reduce the data set to 2 dimensions. @@ -167,7 +166,7 @@ plt.show() ``` 1. The `alpha` parameter controls the transparency of the points. A value of - `#!python 0.5` makes the points semi-transparent. + `#!python 0.5` makes the points semi-transparent.
![PCA visualized](../../../assets/data-science/algorithms/dim-reduction/pca.svg) @@ -178,21 +177,20 @@ plt.show()
-To quickly recap so far: -We were able to reduce the semiconductor data set from `#!python 590` -features to just `#!python 2`. +To quickly recap so far: We were able to reduce the semiconductor data set from +`#!python 590` features to just `#!python 2`. #### Plot interpretation -The scatter plot shows the data set in a 2D space with each observation as -a point. Additionally, we can observe clusters. Since, principal -components are ranked by the amount of variance they capture, the first -component (PC1) is "more important" than the second component (PC2). +The scatter plot shows the data set in a 2D space with each observation as a +point. Additionally, we can observe clusters. Since, principal components are +ranked by the amount of variance they capture, the first component (PC1) is +"more important" than the second component (PC2). Therefore, differences along the x-axis (PC1) are more significant than -differences along the y-axis (PC2). As we are interested in potential -anomalies in semiconductor products, we can detect some observations that might -be well worth some further investigation: +differences along the y-axis (PC2). As we are interested in potential anomalies +in semiconductor products, we can detect some observations that might be well +worth some further investigation:
![Potential anomalies](../../../assets/data-science/algorithms/dim-reduction/potential-anomalies.png) @@ -201,31 +199,31 @@ be well worth some further investigation:
-A majority of the data points are clustered in the upper left corner. -Contrary, these single observations with a high difference on the x-axis -(PC1) might be anomalies (annotated by these arrows). Although, samples -within the encircled area have their differences on the y-axis (PC2), -they are still worth investigating. +A majority of the data points are clustered in the upper left corner. Contrary, +these single observations with a high difference on the x-axis (PC1) might be +anomalies (annotated by these arrows). Although, samples within the encircled +area have their differences on the y-axis (PC2), they are still worth +investigating. ???+ question "Re-apply PCA on unscaled data" What would happen if you apply PCA to the unscaled data? - + 1. Create a new PCA instance with `n_components=2`. - 2. Fit the PCA model on the `data` (unscaled) and transform it. - 3. Visualize the new components in a 2D scatter plot. - 4. Compare the results with the previous PCA visualization. + 1. Fit the PCA model on the `data` (unscaled) and transform it. + 1. Visualize the new components in a 2D scatter plot. + 1. Compare the results with the previous PCA visualization. ???+ tip PCA is sensitive to the scale of the data. Thus, the scaled data nicely - separates the clusters, while the unscaled data does not. So be sure to - pick the right preprocessing steps for your data. + separates the clusters, while the unscaled data does not. So be sure to pick + the right preprocessing steps for your data. ### Explained variance When evaluating a PCA model, it is crucial to understand how much variance is -captured by each principal component. Simply access the +captured by each principal component. Simply access the `explained_variance_ratio_` attribute: ```python @@ -244,38 +242,37 @@ capture roughly `10%` of the variance. ???+ tip - Put simply, our two principal components capture `10%` of the variance - of the original `#!python 590` features which is not that great. + Put simply, our two principal components capture `10%` of the variance of the + original `#!python 590` features which is not that great. :slightly_frowning_face: Unfortunately, when dealing with real world data, results may not be as -promising as expected. In this case, we might need to consider more -components to capture a higher percentage of the variance. +promising as expected. In this case, we might need to consider more components +to capture a higher percentage of the variance. ???+ info "Choosing the number of components" - + It is essential to choose the right number of components. For example, you - could use the components as features for another machine learning model, - hence you want to retain as much information as possible. - - However, the choice of how many components to keep is subjective. - A common approach is to retain enough components to explain 90-95% of - the variance. + could use the components as features for another machine learning model, hence + you want to retain as much information as possible. + + However, the choice of how many components to keep is subjective. A common + approach is to retain enough components to explain 90-95% of the variance. -???+ question "Number of components to exceed 95% variance" +???+ question "Number of components to exceed 95% variance" Using the *scaled* semiconductor dataset: - + 1. Create a PCA model to analyze the variance in the data - 2. Determine the minimum number of principal components needed to explain - at least 95% of the total variance - + 1. Determine the minimum number of principal components needed to explain at + least 95% of the total variance + Solution approaches: - You can use the `explained_variance_ratio_` attribute, OR - - There is an alternative approach that requires only 3 lines of code - maximum (hint: google and check the PCA documentation) - + - There is an alternative approach that requires only 3 lines of code maximum + (hint: google and check the PCA documentation) + Use the following quiz question to evaluate your answer. @@ -303,17 +300,17 @@ solution. def elbow_method(X, max_clusters=15): inertia = [] K = range(1, max_clusters + 1) - + for k in K: model = KMeans(n_clusters=k, random_state=42) model.fit(X) inertia.append(model.inertia_) - + # for convenience store in a DataFrame distortions = pd.DataFrame( {"k (number of cluster)": K, "inertia (J)": inertia} ) - + return distortions ``` @@ -371,12 +368,12 @@ components.plot( plt.show() ``` -To summarize, we applied the same preprocessing steps, reduced the data to -2 dimensions using PCA. Afterward, we called the elbow method on the 2 -components to determine the optimal number of clusters. Then we applied -k-means with `#!python n_clusters=5`. Finally, we plot the 2 components and -color the observations according to their corresponding clusters. Have a look -at the resulting plots. +To summarize, we applied the same preprocessing steps, reduced the data to 2 +dimensions using PCA. Afterward, we called the elbow method on the 2 components +to determine the optimal number of clusters. Then we applied k-means with +`#!python n_clusters=5`. Finally, we plot the 2 components and color the +observations according to their corresponding clusters. Have a look at the +resulting plots. === "Clustered components" @@ -387,12 +384,12 @@ at the resulting plots.
- The plot shows the semiconductor data set clustered into 5 groups. - Each color represents a different cluster. The clusters are well - separated in the 2D space. + The plot shows the semiconductor data set clustered into 5 groups. Each color + represents a different cluster. The clusters are well separated in the 2D + space. === "Elbow method" - +
![Elbow method on 2 principal components](../../../assets/data-science/algorithms/dim-reduction/elbow-pca-kmeans.svg)
@@ -400,21 +397,19 @@ at the resulting plots.
- The plot shows the distortion (inertia) for different numbers of - clusters. This time around, we can distinctly see an elbow at `k=5` - clusters. :flexed_biceps: + The plot shows the distortion (inertia) for different numbers of clusters. This + time around, we can distinctly see an elbow at `k=5` clusters. :flexed_biceps: ---- +______________________________________________________________________ ## Recap -In this chapter, we concluded the Supervised vs. Unsupervised Learning -portion of this course and introduced **Principal Component Analysis -(PCA)**, a linear technique for dimensionality reduction. +In this chapter, we concluded the Supervised vs. Unsupervised Learning portion +of this course and introduced **Principal Component Analysis (PCA)**, a linear +technique for dimensionality reduction. -We discussed the inner workings of PCA and applied it to the semiconductor -data set, where we could identify potential anomalies in the data. We also +We discussed the inner workings of PCA and applied it to the semiconductor data +set, where we could identify potential anomalies in the data. We also visualized the data set in a 2D space, making it easier to interpret and -analyze. -Lastly, a combination of PCA and k-means revealed distinct clusters in the -semiconductor data set. +analyze. Lastly, a combination of PCA and k-means revealed distinct clusters in +the semiconductor data set. diff --git a/docs/data-science/basics/intro.md b/docs/data-science/basics/intro.md index 4d0e6178..4404f062 100644 --- a/docs/data-science/basics/intro.md +++ b/docs/data-science/basics/intro.md @@ -6,21 +6,19 @@ The terms data science and machine learning are often used interchangeably. Let's explore them to get a better understanding of this course's content. === ":bar_chart: Data Science" - - **Data Science** is an interdisciplinary field that combines statistics, - programming and domain knowledge to extract insights from data. As a data - scientist, you could work in vastly different domains, from healthcare and - finance to manufacturing and entertainment. The core skills remain the - same, but the questions you answer and the data you work with vary greatly. + **Data Science** is an interdisciplinary field that combines statistics, + programming and domain knowledge to extract insights from data. As a data + scientist, you could work in vastly different domains, from healthcare and + finance to manufacturing and entertainment. The core skills remain the same, + but the questions you answer and the data you work with vary greatly. === ":robot: Machine Learning" - **Machine Learning (ML)** is a subset of Data Science that focuses on - building algorithms that learn patterns from data to make predictions or - decisions. + **Machine Learning (ML)** is a subset of Data Science that focuses on building + algorithms that learn patterns from data to make predictions or decisions. ---- +______________________________________________________________________
The primary focus of this course is the data science workflow, from @@ -29,13 +27,13 @@ Let's explore them to get a better understanding of this course's content.
---- +______________________________________________________________________ ## What to Expect Before diving into examples and workflows, let's set realistic expectations. -Data science is fundamentally about **understanding and insight**, not +Data science is fundamentally about **understanding and insight**, not perfection. You won't find models that are 100% accurate and that's okay - it's not the goal. Instead, data science helps us: @@ -48,27 +46,27 @@ not the goal. Instead, data science helps us: Chances are you've already used services built by data scientists today: -- :material-currency-usd: **Dynamic Pricing**: Airlines and concert platforms +- :material-currency-usd: **Dynamic Pricing**: Airlines and concert platforms adjust prices based on demand, time and user behavior -- :material-movie: **Recommendation Systems**: Netflix suggests movies based - on your viewing history; Instagram curates your feed -- :material-email: **Spam Detection**: Your email provider filters unwanted +- :material-movie: **Recommendation Systems**: Netflix suggests movies based on + your viewing history; Instagram curates your feed +- :material-email: **Spam Detection**: Your email provider filters unwanted messages automatically In this course, we'll build models for tasks like: -- :material-home: **Price Prediction**: Estimating house prices based on +- :material-home: **Price Prediction**: Estimating house prices based on features like size and location - :material-hospital: **Medical Diagnosis**: Classifying tumors as malignant or benign -- :material-alert: **Anomaly Detection**: Identifying faulty products in +- :material-alert: **Anomaly Detection**: Identifying faulty products in manufacturing data ## Building blocks -A typical data science project includes several stages, from collecting raw -data to deploying models in production. This course focuses on the -**core workflow**: +A typical data science project includes several stages, from collecting raw +data to deploying models in production. This course focuses on the **core +workflow**:
@@ -84,23 +82,22 @@ data to deploying models in production. This course focuses on the
| Stage | What You'll Learn | -|------------------------|------------------------------------------------| +| ---------------------- | ---------------------------------------------- | | **Data Preparation** | Inspect, clean and structure datasets | | **Data Preprocessing** | Transform features (encoding, scaling, etc., ) | | **Modeling** | Train different machine learning algorithms | | **Evaluation** | Measure performance and interpret results | - ???+ tip "Iterative Process" - Data science is rarely linear. You’ll repeatedly cycle through collecting - data, preparing it, training models and evaluating results. Each evaluation - highlights new issues (e.g., missing data or unrealistic assumptions) that - send you back to earlier stages to improve your approach. + Data science is rarely linear. You’ll repeatedly cycle through collecting data, + preparing it, training models and evaluating results. Each evaluation + highlights new issues (e.g., missing data or unrealistic assumptions) that send + you back to earlier stages to improve your approach. ---- +______________________________________________________________________ -Throughout the course, we'll use hands-on Python examples. By the end, you'll +Throughout the course, we'll use hands-on Python examples. By the end, you'll apply these skills to a complete project from start to finish. Let's start by setting up your computer for the data science journey. diff --git a/docs/data-science/basics/setup.md b/docs/data-science/basics/setup.md index cf5e0493..db481c14 100644 --- a/docs/data-science/basics/setup.md +++ b/docs/data-science/basics/setup.md @@ -1,20 +1,20 @@ # Setup -To get started, we setup the programming environment. Follow these couple -of steps to get ready, no prerequisites needed. +To get started, we setup the programming environment. Follow these couple of +steps to get ready, no prerequisites needed. ## Visual Studio Code -First, install a code editor. We urge you to instal Visual Studio Code -(VS Code) a free and open-source editor developed by Microsoft +First, install a code editor. We urge you to instal Visual Studio Code (VS +Code) a free and open-source editor developed by Microsoft :fontawesome-brands-windows:. -If you don't have Visual Studio Code already installed, download it from their +If you don't have Visual Studio Code already installed, download it from their website: . ### Profile -To quickstart your VS Code setup, download our profile that includes essential +To quickstart your VS Code setup, download our profile that includes essential plugins and convenient settings tailored for data science work.
@@ -29,20 +29,20 @@ The profile comes with the following essential extensions: - **Python Debugger** - Debug your Python code - **Jupyter** - Work with Jupyter Notebooks directly in VS Code -Additionally, stylistic plugins are included for a more pleasant coding -experience and auto-save is enabled by default so you never lose your work. +Additionally, stylistic plugins are included for a more pleasant coding +experience and auto-save is enabled by default so you never lose your work. :rocket: ## `uv` From the Python course you should already be familiar with the package manager -`pip`. That background will help you quickly understand `uv`, a modern tool -that not only replaces `pip` for package management but also handles Python +`pip`. That background will help you quickly understand `uv`, a modern tool +that not only replaces `pip` for package management but also handles Python installations. -**Why the switch?** While `pip` remains widely used and important to -understand, this course aims to prepare you for modern real-world projects. -`uv` has become a popular, state-of-the-art tool in modern Python development +**Why the switch?** While `pip` remains widely used and important to +understand, this course aims to prepare you for modern real-world projects. +`uv` has become a popular, state-of-the-art tool in modern Python development and learning it now will give you a competitive advantage. ???+ tip "No prior Python install necessary" @@ -53,28 +53,30 @@ and learning it now will give you a competitive advantage. === ":fontawesome-brands-windows: Windows" - Open Windows Powershell. Visit the `uv` documentation under under - "Standalone installer" [link](https://docs.astral.sh/uv/getting-started/installation/#__tabbed_1_2). + Open Windows Powershell. Visit the `uv` documentation under under "Standalone + installer" + [link](https://docs.astral.sh/uv/getting-started/installation/#__tabbed_1_2). Make sure the Windows tab is selected. - + Return to PowerShell and paste the installer command shown in the docs. ![uv standalone installation](../../assets/data-science/basics/setup/uv-win-install.png) === ":fontawesome-brands-apple: MacOS / :fontawesome-brands-linux: Linux" - On macOS or Linux, open Terminal. Visit the `uv` documentation under - "Standalone installer", [link](https://docs.astral.sh/uv/getting-started/installation/). - Make sure the macOS or Linux tab is selected. - + On macOS or Linux, open Terminal. Visit the `uv` documentation under + "Standalone installer", + [link](https://docs.astral.sh/uv/getting-started/installation/). Make sure the + macOS or Linux tab is selected. + Return to your terminal and paste the installer command. Press ++enter++ to execute the command ---- +______________________________________________________________________ -Regardless of your operating system, upon completion you should see -something like: +Regardless of your operating system, upon completion you should see something +like: ``` Downloading uv @@ -84,14 +86,14 @@ Downloading uv everything's installed! ``` -You can now close the Terminal (:fontawesome-brands-apple: macOS / -:fontawesome-brands-linux: Linux) or PowerShell (:fontawesome-brands-windows: +You can now close the Terminal (:fontawesome-brands-apple: macOS / +:fontawesome-brands-linux: Linux) or PowerShell (:fontawesome-brands-windows: Windows). ???+ info - The following steps are OS-agnostic; they are the same for Windows, macOS - and Linux. + The following steps are OS-agnostic; they are the same for Windows, macOS and + Linux. ### 1. Create a project @@ -99,19 +101,18 @@ Now, we will cover a typical workflow to set up and initialize a new project. ???+ info - A project is a folder that contains all scripts, configuration and data - files that belong together. Everything for the project lives in that - folder. + A project is a folder that contains all scripts, configuration and data files + that belong together. Everything for the project lives in that folder. -Create a new folder named `data-science` in an easy-to-find location you’ll -use throughout this course. +Create a new folder named `data-science` in an easy-to-find location you’ll use +throughout this course. -Open VS Code. Go to File → Open Folder…, select the `data-science` folder. -VS Code will open a new window. +Open VS Code. Go to File → Open Folder…, select the `data-science` folder. VS +Code will open a new window. ???+ tip - For more on navigating VS Code, see the Python course chapter: + For more on navigating VS Code, see the Python course chapter: [link](../../python-extensive/ide.md) ### 2. Initialize the project @@ -123,11 +124,11 @@ uv init --vcs none # (1)! ``` 1. With the `--vcs` flag a **v**ersion **c**ontrol **s**ystem can be specified. - By default `--vcs git` is set, which initializes a git repository. Since + By default `--vcs git` is set, which initializes a git repository. Since git is not within the scope of this project, we set `--vcs` to none. -This initializes the project. `uv` creates a few files in your folder. -Your workspace should look like this: +This initializes the project. `uv` creates a few files in your folder. Your +workspace should look like this:
` for example with `pandas`: uv add pandas ``` -After a successful installation, take some time to open the `pyproject.toml` +After a successful installation, take some time to open the `pyproject.toml` file. Under dependencies you should find the `pandas` package. ```toml title="pyproject.toml" hl_lines="7-9" linenums="1" @@ -259,17 +261,17 @@ dependencies = [ ] ``` -The content of `uv.lock` was changed as well, the file contains more info on -the installed packages such as `pandas` and its dependencies as well -(i.e., `numpy`, `python-dateutil`, `six` and `tzdata`). +The content of `uv.lock` was changed as well, the file contains more info on +the installed packages such as `pandas` and its dependencies as well (i.e., +`numpy`, `python-dateutil`, `six` and `tzdata`). ???+ tip "Share a project" If you share your project, be sure to include the files `.python-version`, - `pyproject.toml` and `uv.lock`. These allow for a recreation of your - virtual environment. + `pyproject.toml` and `uv.lock`. These allow for a recreation of your virtual + environment. ---- +______________________________________________________________________ Let's remove the package with the `remove` command: @@ -277,18 +279,20 @@ Let's remove the package with the `remove` command: uv remove pandas ``` -Again, you can check both `pyproject.toml` and `uv.lock` which are +Again, you can check both `pyproject.toml` and `uv.lock` which are automatically updated accordingly. ???+ question "Get a script running" 1. Create a new script called `plot.py` - 2. Paste following example (taken from [matplotlib docs](https://matplotlib.org/stable/gallery/lines_bars_and_markers/curve_error_band.html)) within your script: + + 1. Paste following example (taken from + [matplotlib docs](https://matplotlib.org/stable/gallery/lines_bars_and_markers/curve_error_band.html)) + within your script: ```python title="plot.py" linenums="1" import matplotlib.pyplot as plt import numpy as np - from matplotlib.patches import PathPatch from matplotlib.path import Path @@ -302,22 +306,23 @@ automatically updated accordingly. ax.set(aspect=1) plt.show() ``` - 3. Determine necessary packages to get this script running and install - them with `uv`. - 4. Lastly, the script with `uv`. + 1. Determine necessary packages to get this script running and install them + with `uv`. + + 1. Lastly, the script with `uv`. ## Python Scripts or Jupyter Notebooks? -For this course, you can work with Python scripts (`.py` files) and/or -Jupyter Notebooks (`.ipynb` files). Both are supported in VS Code and each has -its strengths. +For this course, you can work with Python scripts (`.py` files) and/or Jupyter +Notebooks (`.ipynb` files). Both are supported in VS Code and each has its +strengths.
-- :fontawesome-brands-python:{ .lg .middle } __Python Scripts__ +- :fontawesome-brands-python:{ .lg .middle } __Python Scripts__ - --- + ______________________________________________________________________ :fontawesome-regular-thumbs-up: Advantages @@ -326,7 +331,7 @@ its strengths. - Runs faster without cell-by-cell overhead - Cleaner debugging with standard tools - --- + ______________________________________________________________________ :fontawesome-regular-thumbs-down: Disadvantages @@ -334,9 +339,9 @@ its strengths. - Need to rerun entire script for changes - Harder to visualize intermediate results -- :simple-jupyter:{ .lg .middle } __Jupyter Notebooks__ +- :simple-jupyter:{ .lg .middle } __Jupyter Notebooks__ - --- + ______________________________________________________________________ :fontawesome-regular-thumbs-up: Advantages @@ -345,7 +350,7 @@ its strengths. - Combines documentation and code - Easier to share findings with non-programmers - --- + ______________________________________________________________________ :fontawesome-regular-thumbs-down: Disadvantages @@ -358,38 +363,38 @@ its strengths. ???+ tip "Our recommendation" - Many data scientists use both: notebooks for exploration, scripts for - production. Simply experiment with both. For quick prototyping lean towards - a :simple-jupyter: Jupyter Notebook. For more refined code switch to - :fontawesome-brands-python: Python scripts. + Many data scientists use both: notebooks for exploration, scripts for + production. Simply experiment with both. For quick prototyping lean towards a + :simple-jupyter: Jupyter Notebook. For more refined code switch to + :fontawesome-brands-python: Python scripts. ---- +______________________________________________________________________ ## Wrap-Up -You've successfully set up your development environment! Throughout this -course, you'll create multiple projects using the workflow covered in -sections 1-4. Don't worry about memorizing every step—just refer back to this -page when needed. +You've successfully set up your development environment! Throughout this +course, you'll create multiple projects using the workflow covered in sections +1-4. Don't worry about memorizing every step—just refer back to this page when +needed. For quick reference, here's a cheat sheet: ???+ note "Cheat Sheet - Project Setup" 1. Create a new folder for your project - 2. Open the folder in VS Code - 3. In the terminal, run: - ```bash - uv init --vcs none - uv sync - ``` - 4. Install packages as needed: - ```bash - uv add - ``` - 5. Run your code: - ```bash - uv run .py - ``` - + 1. Open the folder in VS Code + 1. In the terminal, run: + ```bash + uv init --vcs none + uv sync + ``` + 1. Install packages as needed: + ```bash + uv add + ``` + 1. Run your code: + ```bash + uv run .py + ``` + **Need help?** Run `uv --help` for more commands and options. diff --git a/docs/data-science/data/basics.md b/docs/data-science/data/basics.md index 7ef23f93..9274e0b8 100644 --- a/docs/data-science/data/basics.md +++ b/docs/data-science/data/basics.md @@ -1,48 +1,47 @@ # Data Basics -This chapter kicks off the foundational building blocks of a data science -pipeline. We start by taking a closer look at data itself. Understanding -different attribute types is crucial for choosing appropriate visualizations, +This chapter kicks off the foundational building blocks of a data science +pipeline. We start by taking a closer look at data itself. Understanding +different attribute types is crucial for choosing appropriate visualizations, preprocessing techniques and machine learning algorithms. ???+ question "Create a new project" - 1. For this chapter create a new project. Revisit the + 1. For this chapter create a new project. Revisit the [wrap-up](../basics/setup.md#wrap-up) section from the setup guide. - 2. Install the packages `seaborn` and `pandas` + 1. Install the packages `seaborn` and `pandas` ## Tabular Data -Throughout this course, we will primarily work with **tabular data**, simply -think of spreadsheets. Tabular data is organized in a rectangular -format with: +Throughout this course, we will primarily work with **tabular data**, simply +think of spreadsheets. Tabular data is organized in a rectangular format with: - **Rows**: Individual observations or samples (e.g., one student) - **Columns**: Attributes or features describing each observation (e.g., name, - age, average grade) + age, average grade) | Name | Age | Average Grade | -|---------|-----|---------------| +| ------- | --- | ------------- | | Claudia | 19 | 1.45 | | Stefan | 22 | 3.4 | | Max | 20 | 2.12 | -Each row represents one student, while each column contains a specific +Each row represents one student, while each column contains a specific attribute about that student. -Understanding the structure of tabular data is essential because most machine -learning algorithms expect data in this format. Now let's explore what types -of information each column can contain. +Understanding the structure of tabular data is essential because most machine +learning algorithms expect data in this format. Now let's explore what types of +information each column can contain. ## Attribute Types -Not all data is created equal. The type of data in each column determines -what operations we can perform and which visualizations make sense. We -distinguish between two main categories: numerical and categorical data. +Not all data is created equal. The type of data in each column determines what +operations we can perform and which visualizations make sense. We distinguish +between two main categories: numerical and categorical data. ### Numerical (Quantitative) -Numerical data represents measurable quantities, i.e., values you can perform +Numerical data represents measurable quantities, i.e., values you can perform mathematical operations on. ```python @@ -61,20 +60,20 @@ Maximum temperature: 25.1°C Numerical data comes in two types: -**Continuous**: Can take any value within a range, including decimals. -Examples include temperature (22.5°C), body mass (3750.5g) or height (1.75m). +**Continuous**: Can take any value within a range, including decimals. Examples +include temperature (22.5°C), body mass (3750.5g) or height (1.75m). -**Discrete**: Can only take specific, countable values, typically integers. +**Discrete**: Can only take specific, countable values, typically integers. Examples include number of students (5) or age (22). ???+ tip - A simple rule of thumb: If you can meaningfully have fractional values, - it's continuous. If counting whole units makes more sense, it's discrete. + A simple rule of thumb: If you can meaningfully have fractional values, it's + continuous. If counting whole units makes more sense, it's discrete. ### Categorical (Qualitative) -Categorical data represents qualities or characteristics that place +Categorical data represents qualities or characteristics that place observations into groups or categories. ```python @@ -84,7 +83,7 @@ print(f"Unique colors: {colors.nunique()}") print(f"Most common: {colors.mode().squeeze()}") # (1)! ``` -1. The `mode()` method returns a `pd.Series` with a single value, hence we +1. The `mode()` method returns a `pd.Series` with a single value, hence we `squeeze()` the value. ```title=">>> Output" @@ -96,24 +95,24 @@ Categorical data can be further divided into two types: #### Nominal -Nominal data has no inherent order, the categories are just different names -or labels. Examples include colors or country names. +Nominal data has no inherent order, the categories are just different names or +labels. Examples include colors or country names. #### Ordinal -Ordinal data has a meaningful order or ranking between categories, but the -distance between categories isn't necessarily equal. Examples include t-shirt -sizes (XS, S, M, L, XL) or education levels (High School, Bachelor's, Master's, +Ordinal data has a meaningful order or ranking between categories, but the +distance between categories isn't necessarily equal. Examples include t-shirt +sizes (XS, S, M, L, XL) or education levels (High School, Bachelor's, Master's, PhD). ---- +______________________________________________________________________ -Now that we understand different data types, let's see them in action with -real data. +Now that we understand different data types, let's see them in action with real +data. ## Penguins -We'll use the Palmer Penguins dataset, which contains measurements of three +We'll use the Palmer Penguins dataset, which contains measurements of three penguin species observed on islands in the Palmer Archipelago, Antarctica.
@@ -128,13 +127,13 @@ penguin species observed on islands in the Palmer Archipelago, Antarctica. ???+ info - The Palmer Penguins dataset was collected and made available by - Dr. Kristen Gorman and the Palmer Station, Antarctica LTER.[^1] It's - become a popular dataset for education. + The Palmer Penguins dataset was collected and made available by Dr. Kristen + Gorman and the Palmer Station, Antarctica LTER.[^1] It's become a popular + dataset for education. - [^1]: - Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. - R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218. + [^1]: Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago + (Antarctica) penguin data. R package version 0.1.0. + https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218.

@@ -126,45 +126,45 @@ saw that coming 🤯).
-Although, it seems like we don't have to bother with missing values, they -are simply a bit more hidden. +Although, it seems like we don't have to bother with missing values, they are +simply a bit more hidden. ### Missing values in disguise -`pandas` considers types like `#!python None` or `#!python np.nan` as -missing. However in practice, missing values are encoded in various ways. -For instance, strings like `#!python "NA"` or integers like `#!python -999` -are used. Consequently, we can't detect these ways of encoding with -simply calling `#!python isna()`. +`pandas` considers types like `#!python None` or `#!python np.nan` as missing. +However in practice, missing values are encoded in various ways. For instance, +strings like `#!python "NA"` or integers like `#!python -999` are used. +Consequently, we can't detect these ways of encoding with simply calling +`#!python isna()`. -Since we have to manually detect these encoded missing values, it is -essential to have a good understanding of the data. Let's get more -familiarized with the data. +Since we have to manually detect these encoded missing values, it is essential +to have a good understanding of the data. Let's get more familiarized with the +data. -Visit the UCI Machine Learning Repository -[here](https://archive.ics.uci.edu/dataset/222/bank+marketing) which also -hosts the data set and some additional information. Interestingly, the section +Visit the UCI Machine Learning Repository +[here](https://archive.ics.uci.edu/dataset/222/bank+marketing) which also hosts +the data set and some additional information. Interestingly, the section *Dataset Information* states: > **Has Missing Values?** > > No -Although that might be technical correct (the data contains no empty values), +Although that might be technical correct (the data contains no empty values), we have to dig deeper. ???+ question "Detect the encoding of missing values" - - Open the [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/222/bank+marketing). - Look at the *Variables Table*. How are the missing values encoded in the - data set? + + Open the + [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/222/bank+marketing). + Look at the *Variables Table*. How are the missing values encoded in the data + set? Use the following quiz question to validate your answer. - Remember, the bigger picture :fontawesome-solid-arrow-right: - by getting more familiar with the data, we can train a better fitting - model to predict the target variable `y` (subscribed to term deposit or - not). + Remember, the bigger picture :fontawesome-solid-arrow-right: by getting more + familiar with the data, we can train a better fitting model to predict the + target variable `y` (subscribed to term deposit or not). How are missing values encoded in this specific data set? @@ -180,16 +180,16 @@ education). ### Missing values uncovered -Now that we uncovered the encoding of missing values, we replace them with +Now that we uncovered the encoding of missing values, we replace them with `#!python None` to properly detect them and handle them more easily. ???+ question "Replace encoding with `#!python None`" - - Since, you've detected the particular encoding of missing values, replace - them with `#!python None` across the whole data frame. - - Use the `DataFrame.replace()` method and read the - [docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html), + + Since, you've detected the particular encoding of missing values, replace them + with `#!python None` across the whole data frame. + + Use the `DataFrame.replace()` method and read the + [docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html), especially the *Examples* section for usage guidance. After solving the question, we (again) sum up the missing values per column. @@ -201,7 +201,7 @@ print(data.isna().sum()) A truncated version of the output: | Column | Missing Values | -|-----------|----------------| +| --------- | -------------- | | id | 0 | | age | 0 | | default | 760 | @@ -212,8 +212,8 @@ A truncated version of the output: | education | 161 | | ... | ... | -At first glance, a lot of columns contain missing values. Let's calculate -the ratio to get a better feeling. +At first glance, a lot of columns contain missing values. Let's calculate the +ratio to get a better feeling. ```python hl_lines="4" count_missing = data.isna().sum() @@ -224,7 +224,7 @@ print(missing_ratio.round(2)) ``` | Column | Missing Values (%) | -|-----------|--------------------| +| --------- | ------------------ | | id | 0.00 | | age | 0.00 | | default | 19.35 | @@ -235,28 +235,28 @@ print(missing_ratio.round(2)) | education | 4.10 | | ... | ... | -Compared to the initial observation where we found `#!python 0` -missing values across the whole data set, it's a stark contrast. +Compared to the initial observation where we found `#!python 0` missing values +across the whole data set, it's a stark contrast. -Looking at the attribute *default*, nearly a fifth of the observations are -missing (19.35 %). Other attributes contain less missing values, yet we still -need to handle them. Therefore, we explore different strategies to deal -with missing values. +Looking at the attribute *default*, nearly a fifth of the observations are +missing (19.35 %). Other attributes contain less missing values, yet we still +need to handle them. Therefore, we explore different strategies to deal with +missing values. ???+ info - Though it might not seem much, being able to detect these missing values - will prove invaluable in the future. + Though it might not seem much, being able to detect these missing values will + prove invaluable in the future. - By identifying and properly handling these gaps, we might be able to - train a better fitting model as unaddressed missing values can lead to - biased predictions. Most importantly, most algorithms can't handle - missing values at all. + By identifying and properly handling these gaps, we might be able to train a + better fitting model as unaddressed missing values can lead to biased + predictions. Most importantly, most algorithms can't handle missing values at + all. ### Sources -We have extensively covered how to detect missing values but have not -talked about their possible origins. +We have extensively covered how to detect missing values but have not talked +about their possible origins. The reasons for missing values can be manifold: @@ -273,8 +273,8 @@ The reasons for missing values can be manifold: ### Drop columns/rows -One simple way to handle missing values is to drop (i.e. remove) the -respective columns which contain any missing values. +One simple way to handle missing values is to drop (i.e. remove) the respective +columns which contain any missing values. ```python data_dropped = data.dropna(axis=1) @@ -282,33 +282,35 @@ data_dropped = data.dropna(axis=1) `#!python axis=1` specified the columns to be dropped. -To comprehend the impact of this operation, we calculate the number of -columns that were removed. +To comprehend the impact of this operation, we calculate the number of columns +that were removed. ```python print(data.shape[1] - data_dropped.shape[1]) ``` + This operation removed `#!python 6` out of `#!python 21` columns/attributes. ???+ question "Remove rows with missing values" - Contrary, we can leave all columns and instead drop the rows containing - missing values. + Contrary, we can leave all columns and instead drop the rows containing missing + values. - 1. Use the [`DataFrame.dropna()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) - method to remove rows with missing values. - 2. Calculate the number of rows that were removed. + 1. Use the + [`DataFrame.dropna()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) + method to remove rows with missing values. + 1. Calculate the number of rows that were removed. #### Set a threshold -Instead of dropping all rows/columns with gaps, we can set a threshold to only +Instead of dropping all rows/columns with gaps, we can set a threshold to only drop columns/rows with a certain amount of missing values. To specify a threshold, make use of the `thresh` parameter, which takes an -`#!python int` value of ^^non-missing^^ values that a column/row must have, -to ^^not^^ be dropped. +`#!python int` value of ^^non-missing^^ values that a column/row must have, to +^^not^^ be dropped. -As an example, we would like to remove all columns holding more than 10 % +As an example, we would like to remove all columns holding more than 10 % missing values. ```python hl_lines="7 10" @@ -326,8 +328,8 @@ diff = data.shape[1] - data_dropped_threshold.shape[1] print(f"Number of columns dropped: {diff}") ``` -1. The `#!python math.ceil()` function is used to round up the threshold - value to the next integer. +1. The `#!python math.ceil()` function is used to round up the threshold value + to the next integer. ```title=">>> Output" 3535.2000000000003 @@ -335,34 +337,35 @@ print(f"Number of columns dropped: {diff}") Number of columns dropped: 1 ``` -A single column was dropped and therefore exceeded the 10 % threshold of +A single column was dropped and therefore exceeded the 10 % threshold of missing values. ---- +______________________________________________________________________ -Depending on the data at hand, dropping rows or columns might be a valid -option, if you're dealing with a small number of missing values. However, in +Depending on the data at hand, dropping rows or columns might be a valid +option, if you're dealing with a small number of missing values. However, in other cases these operations might lead to a significant loss of information. -Since, we are dealing with a substantial amount of missing values, we are +Since, we are dealing with a substantial amount of missing values, we are looking for more sophisticated ways to handle them. ### Imputation techniques -What about filling in the missing values? The process of replacing missing +What about filling in the missing values? The process of replacing missing values is called imputation. ![](../../assets/data-science/data/imputation.gif) +
Data imputation
-There are various imputation techniques available, each with its own -advantages and disadvantages. +There are various imputation techniques available, each with its own advantages +and disadvantages. ##### Fill manually -Of course, there is always the option to fill the values manually which -could be time-consuming and infeasible for large data sets. +Of course, there is always the option to fill the values manually which could +be time-consuming and infeasible for large data sets. ##### Global constant @@ -373,17 +376,17 @@ constant, i.e., filling gaps across ^^all^^ columns with the same value. data_filled = data.fillna("no") ``` -This method is straightforward and easy to implement. However, there are -some drawbacks: +This method is straightforward and easy to implement. However, there are some +drawbacks: - how to choose the global constant? -- introduces further challenges with mixed attributes (i.e., - nominal/ordinal and numerical attributes) +- introduces further challenges with mixed attributes (i.e., nominal/ordinal + and numerical attributes) ##### Central tendency -Another common approach is to replace missing values with the mean, median, -or mode of the respective column. +Another common approach is to replace missing values with the mean, median, or +mode of the respective column. Fill a nominal attribute with the mode: @@ -413,31 +416,31 @@ np.float64(40.1433299389002) ???+ info - Since the bank data does not contain any numerical attribute with - missing values, the above code snippet assumed gaps in *age*. As there - are none, the operation did not change the data. + Since the bank data does not contain any numerical attribute with missing + values, the above code snippet assumed gaps in *age*. As there are none, the + operation did not change the data. #### Machine Learning Lastly, we can use machine learning algorithms to predict the missing values. -The idea is to estimate the missing values based on the other attributes. +The idea is to estimate the missing values based on the other attributes. Linear regression, k-nearest neighbors, or decision trees are common choices. ???+ info - As we have not covered machine learning yet, we won't get into the details. - But feel free to return to this section. Especially, + As we have not covered machine learning yet, we won't get into the details. But + feel free to return to this section. Especially, [this](https://scikit-learn.org/stable/auto_examples/impute/plot_missing_values.html) - scikit-learn comparison of imputation techniques (including k-nearest + scikit-learn comparison of imputation techniques (including k-nearest neighbors) is a good starting point for further exploration. ## Transformation -Step by step, we are getting closer to actually training a machine learning +Step by step, we are getting closer to actually training a machine learning model. Beforehand, we introduce data transformations that are commonly applied to improve the fit of the model. -For starters, install the `scikit-learn` package within your activated +For starters, install the `scikit-learn` package within your activated environment. ```bash @@ -456,20 +459,20 @@ From now on, we will heavily use `scikit-learn`'s functionalities. ### Discretize numerical attributes -When dealing with noisy data, it is often beneficial to discretize -numerical (continuous) attributes. +When dealing with noisy data, it is often beneficial to discretize numerical +(continuous) attributes. ???+ info "Noise in data" - Noise is a random error or variance in a measured variable. It is - meaningless information that can distort the data. - - Noise can be identified using basic statistical methods and - visualization techniques like boxplots or scatter plots. + Noise is a random error or variance in a measured variable. It is meaningless + information that can distort the data. + + Noise can be identified using basic statistical methods and visualization + techniques like boxplots or scatter plots. -The process of discretizing is called binning. I.e., the continuous data -is separated into intervals (bins). -Bins can generally lead to a smoothing effect which in turn reduce the noise. +The process of discretizing is called binning. I.e., the continuous data is +separated into intervals (bins). Bins can generally lead to a smoothing effect +which in turn reduce the noise. As an example, we pick the attribute *age* and visualize it with a boxplot. @@ -481,21 +484,21 @@ As an example, we pick the attribute *age* and visualize it with a boxplot. ??? tip "Create a static boxplot" To create a static version of the boxplot, perfect for a quick overview: - + ```python import matplotlib.pyplot as plt - + data["age"].plot(kind="box") # (1)! plt.show() ``` - - 1. The `#!python plot()` method uses `matplotlib` as backend. - + + 1. The `#!python plot()` method uses `matplotlib` as backend. +
Age boxplot
-Since, *age* contains outliers, we discretize the attribute *age* into five +Since, *age* contains outliers, we discretize the attribute *age* into five bins with the same width. ```python @@ -506,12 +509,12 @@ bins.fit(data[["age"]]) age_binned = bins.transform(data[["age"]]) # (1)! ``` -1. The additional square brackets in `#!python data[["age"]]` are used to - select the column *age* as a `DataFrame` (instead of a `Series`). - This is necessary for the `#!python transform()` method as a - two-dimensional input is required. +1. The additional square brackets in `#!python data[["age"]]` are used to + select the column *age* as a `DataFrame` (instead of a `Series`). This is + necessary for the `#!python transform()` method as a two-dimensional input + is required. -The above snippet returns 5 bins with a width of 14 years. Inspect the bin +The above snippet returns 5 bins with a width of 14 years. Inspect the bin edges with: ```python @@ -522,34 +525,33 @@ print(bins.bin_edges_) [array([18., 32., 46., 60., 74., 88.])] ``` -Though the actual binning is just two three lines of code, we have a couple of +Though the actual binning is just two three lines of code, we have a couple of things to dissect. ???+ tip "Working with `scikit-learn`" - Although the package is named `scikit-learn`, it is imported as - `#!python import sklearn`. Package names on - [PyPI (Python Package Index)](../../python/packages.md/#pypi) - can be different from the import name. + Although the package is named `scikit-learn`, it is imported as + `#!python import sklearn`. Package names on + [PyPI (Python Package Index)](../../python/packages.md/#pypi) can be different + from the import name. - --- + ______________________________________________________________________ - `scikit-learn` frequently uses classes (e.g., `KBinsDiscretizer`) - to represent different models and preprocessing techniques. Two important - methods that many of these classes implement are `fit` and `transform`. + `scikit-learn` frequently uses classes (e.g., `KBinsDiscretizer`) to represent + different models and preprocessing techniques. Two important methods that many + of these classes implement are `fit` and `transform`. - - `#!python fit(X)`: This method is used to learn the parameters from the - data (referred to as `X`). - - - `#!python transform(X)`: This method is used to apply the learned - parameters to the data :fontawesome-solid-arrow-right: `X`. + - `#!python fit(X)`: This method is used to learn the parameters from the data + (referred to as `X`). - Put simply, think about the `#!python fit(X)` method as scikit-learn takes - a look at the data and learns from it. The `#!python transform(X)` - method then transfers this knowledge and applies it to the data. + - `#!python transform(X)`: This method is used to apply the learned parameters + to the data :fontawesome-solid-arrow-right: `X`. - The `#!python fit_transform()` method combines both of these steps in one. + Put simply, think about the `#!python fit(X)` method as scikit-learn takes a + look at the data and learns from it. The `#!python transform(X)` method then + transfers this knowledge and applies it to the data. + The `#!python fit_transform()` method combines both of these steps in one. Alternatively, use `#!python strategy="quantile"` to bin the data based on quantiles and thus create bins with the same number of observations. @@ -565,12 +567,13 @@ print(bins.bin_edges_) [array([18., 31., 36., 41., 50., 88.])] ``` -No matter the strategy `#!python "uniform"` or `#!python "quantile"`, a -matrix is returned with the +No matter the strategy `#!python "uniform"` or `#!python "quantile"`, a matrix +is returned with the > bin identifier encoded as an integer value. -> -> [`KBinsDiscretizer` docs](https://scikit-learn.> org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html) +> +> [`KBinsDiscretizer` docs](https://scikit-learn.> +> org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html) ### Normalization @@ -589,10 +592,10 @@ Min-Max normalization scales the data to a fixed range, usually [0, 1]. X' = \frac{X - X_{min}}{X_{max} - X_{min}} \] - where \(X\) is the original value, \(X_{min}\) is the minimum value of the + where \(X\) is the original value, \(X_{min}\) is the minimum value of the feature, and \(X_{max}\) is the maximum value of the feature. -This technique is useful when you want to ensure that all features have the +This technique is useful when you want to ensure that all features have the same scale without distorting differences in the ranges of values. To illustrate the normalization, we use the attribute *euribor3m* (3 month @@ -601,7 +604,7 @@ Euribor rate). > Euribor is short for Euro Interbank Offered Rate. The Euribor rates are based > on the average interest rates at which a large panel of European banks borrow > funds from one another. -> +> > [euribor-rates.eu](https://www.euribor-rates.eu/en/) ```python @@ -634,17 +637,17 @@ Min: 0.0, Max: 1.0 ???+ question "Normalization of new data" Assume new data is added: - + ```python new_data = pd.DataFrame({"euribor3m": [0.5, 5.0, 2.5]}) ``` - We would like to transform these three new interest rates using the Min - Max normalization. - Remember that the `MinMaxScaler` was already fitted on the original - data with \(X_{min}=0.635\) and \(X_{max}=4.97\). - Answer the following quiz question. Look at the formula again and try - to answer the question without executing code. + We would like to transform these three new interest rates using the Min Max + normalization. Remember that the `MinMaxScaler` was already fitted on the + original data with \(X_{min}=0.635\) and \(X_{max}=4.97\). + + Answer the following quiz question. Look at the formula again and try to answer + the question without executing code. What happens if you call `#!python transform(new_data)`? @@ -652,19 +655,20 @@ What happens if you call `#!python transform(new_data)`? - [ ] The new data is normalized. - [x] The normalization works, but the range [0, 1] is not preserved. -Since the newly added Euribor rates of 0.5 and 5.0, are lower or -higher than the previous minimum and maximum respectively, the normalization -will not preserve the range [0, 1], i.e. resulting in the normalized values: +Since the newly added Euribor rates of 0.5 and 5.0, are lower or higher than +the previous minimum and maximum respectively, the normalization will not +preserve the range [0, 1], i.e. resulting in the normalized values: ```python [[-0.03114187], [1.00692042], [0.43021915]] ``` + #### Z-Score Normalization Z-Score normalization, also known as standardization, scales the data based on -the mean and standard deviation of an attribute. +the mean and standard deviation of an attribute. ???+ defi "Definition: Z-Score Normalization" @@ -672,32 +676,32 @@ the mean and standard deviation of an attribute. X' = \frac{X - \mu}{\sigma} \] - where \(\mu\) is the mean of the feature and \(\sigma\) is the standard + where \(\mu\) is the mean of the feature and \(\sigma\) is the standard deviation. -This technique centers the data around zero with a standard deviation of one, +This technique centers the data around zero with a standard deviation of one, which is useful for algorithms assuming normally distributed data. ???+ question "Apply Z-Score normalization" - Use the [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) - from `scikit-learn` to apply Z-Score normalization to the attribute - *campaign* (number of times a customer was contacted). + Use the + [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) + from `scikit-learn` to apply Z-Score normalization to the attribute *campaign* + (number of times a customer was contacted). 1. Fit the `StandardScaler` on the data. - 2. Transform the data. - 3. Calculate and print the mean and standard deviation of the transformed - data. + 1. Transform the data. + 1. Calculate and print the mean and standard deviation of the transformed data. ### One-Hot Encoding -So far we have focused on numerical attributes. But what about -categorical variables? Since, many machine learning algorithms can't handle -categorical attributes directly, they need to be encoded. One common technique -is to one-hot encode these attributes. +So far we have focused on numerical attributes. But what about categorical +variables? Since, many machine learning algorithms can't handle categorical +attributes directly, they need to be encoded. One common technique is to +one-hot encode these attributes. -Imagine the toy example below to illustrate the concept of one-hot encoding -on the feature *job*. +Imagine the toy example below to illustrate the concept of one-hot encoding on +the feature *job*.