diff --git a/datacleaning.qmd b/datacleaning.qmd
index 1b9a890..eb9da0f 100644
--- a/datacleaning.qmd
+++ b/datacleaning.qmd
@@ -6,7 +6,7 @@ manydogs_data <- read.csv("manydogs_etal_2024_data.csv")
## Data Set-up: Tidying and Feature Selection
-Before you can begin working with your data you must make sure that each row is a single observation, and each column is a single variable/predictor. This type of data set-up or "wrangling" is known as "tidy data". There are many readily available tutorials and textbooks that help you understand tidy data and how to clean and wrangle data to make it tidy. I recommend the [Tidyverse](https://r4ds.had.co.nz/tidy-data.html) chapter in the R data science textbook to start. Thankfully, the dataset from ManyDogs is already tidy so for this tutorial we can skip this step.
+Before you can begin working with your data you must make sure that each row is a single observation, and each column is a single variable/predictor. This type of data set-up or "wrangling" is known as "tidy data". There are many readily available tutorials and textbooks that help you understand tidy data and how to clean and wrangle data to make it tidy. I recommend the tidy data chapter in the R for Data Science textbook to start. Thankfully, the dataset from ManyDogs is already tidy so for this tutorial we can skip this step.
After your data is tidy, the next step before is to complete feature selection. [*Feature selection*](https://www.geeksforgeeks.org/feature-selection-techniques-in-machine-learning/) is a fancy term for removing variables you aren't going to analyze and creating new ones by computing any necessary variables. Many of the changes you complete in this step are decided on through [domain knowledge](https://corporatefinanceinstitute.com/resources/data-science/domain-knowledge-data-science/): applying what you know about the field of research to make judgement calls on what is and isn't important to the model/data. The rest of our decisions depend on our specific research hypotheses. Anything in the data set that does not specifically pertain to our research hypotheses need to be eliminated to increase statistical power and to compute the model faster to save computing resources. Commonly deleted variables at this stage might include meta-data such as time of day when a survey was completed, or individual scale items when we have calculated the total scores. When domain knowledge doesn't suffice because you are working in a relatively new field, or past studies have conflicting information, you will want to let machine learning algorithms help choose what to keep and eliminate. See the [regularization](regularization.qmd#regularization) section for more information.
diff --git a/datadescription.qmd b/datadescription.qmd
index 7fd9b68..6d8090c 100644
--- a/datadescription.qmd
+++ b/datadescription.qmd
@@ -6,4 +6,4 @@ The data being used in this tutorial is from the [ManyDogs Project](https://many
knitr::include_graphics("md1_setup.jpg")
```
-This dataset is a great example to use when investigating machine learning predictive classification models, as it has many possible predictors to investigate with a discrete binary dependent variable (i.e. whether the dog chose correctly). Furthermore, all the data from this project is available to anyone to share or adapt with attribution, which makes it ideal to use as a learning tool. To work through this tutorial with me, you will first need to download the data from the GitHub repository associated with the project and/or create a local clone of the project repo on your computer. I recommend using the [GitHub desktop application](https://docs.github.com/en/desktop/overview/getting-started-with-github-desktop) to easily get and give information from/to a GitHub repository using a point and click method instead of the command line.
+This dataset is a great example to use when investigating machine learning predictive classification models, as it has many possible predictors to investigate with a discrete binary dependent variable (i.e. whether the dog chose correctly). Furthermore, all the data from this project is available to anyone to share or adapt with attribution, which makes it ideal to use as a learning tool. To work through this tutorial with me, you will first need to [download the data](https://github.com/ManyDogsProject/md1_data/blob/main/manydogs_etal_2024_data.csv) from the GitHub repository associated with the project and/or create a local clone of the project repo on your computer. I recommend using the [GitHub desktop application](https://docs.github.com/en/desktop/overview/getting-started-with-github-desktop) to easily get and give information from/to a GitHub repository using a point and click method instead of the command line.
diff --git a/glossary.qmd b/glossary.qmd
index 5710bf0..a301240 100644
--- a/glossary.qmd
+++ b/glossary.qmd
@@ -7,7 +7,7 @@
- Feature Selection - [Feature Selection](https://domino.ai/data-science-dictionary/feature-selection) is a process that eliminates unnecessary features (AKA predictors or variables) from the data to help a model perform as well as possible with given data.
- Generalization error - [Generalization error](https://medium.com/@yixinsun_56102/understanding-generalization-error-in-machine-learning-e6c03b203036) is a measures of how well an algorithm performs on the testing data. Reminder: testing data is held out at the train and test split step at the beginning of a machine learning project.
- Hyperparameters - [Hyperparameters](https://towardsdatascience.com/parameters-and-hyperparameters-aa609601a9ac) are any variable in a model that changes the accuracy and precision of a model that are not learned from the data. This is in contrast to variables in the model that are learned from the data (i.e. parameters). For example, in a random forest model the number of decision trees that you want the computer to run is a hyperparameter but the best predictor to sample at the first node of each tree is a parameter as that is what the model learns using the data. Thinking more about the best start value for different types of hyperparameters is a whole book/tutorial in itself! In fact someone wrote a [book](https://library.oapen.org/viewer/web/viewer.html?file=/bitstream/handle/20.500.12657/60840/978-981-19-5170-1.pdf?sequence=1&isAllowed=y) on how to tune hyperparameters in R that is freely available. Please see this book and seek out additional resources on your own when tuning hyperparameters for your own analyses.
-- Lost/Cost Function - A [loss/cost function](https://www.enjoyalgorithms.com/blog/loss-and-cost-functions-in-machine-learning) quantifies the difference between the predicted values and the actual values, measuring the model's performance. The goal of machine learning is to minimize the loss function to improve the model's accuracy and generalizability.There are many different types of functions that measure model performance so loss/cost function is an umbrella term. A loss function refers to the performance of a single data point while a cost function refers to the average performance across a dataset.
+- Lost Function - A [loss function](https://www.enjoyalgorithms.com/blog/loss-and-cost-functions-in-machine-learning) (also called a cost function) quantifies the difference between the predicted values and the actual values, measuring the model's performance. The goal of machine learning is to minimize the loss function to improve the model's accuracy and generalizability.There are many different types of functions that measure model performance so loss/cost function is an umbrella term. A loss function refers to the performance of a single data point while a cost function refers to the average performance across a dataset.
## Package Versions
@@ -37,5 +37,5 @@ Below I have listed the R and package versions I am using. If you are reading th
- Version 4.3.3
- randomForest
- Version 4.7.1.1
-
+
# References
diff --git a/otherassumptions.qmd b/otherassumptions.qmd
index beb28bf..1c7bb3e 100644
--- a/otherassumptions.qmd
+++ b/otherassumptions.qmd
@@ -25,7 +25,7 @@ assumptiontable <- data.frame(
High_Dimensionality = c("No", "Yes", "No", "No", "No"),
Feature_Scaling = c("Yes","Yes","No","No","Yes"))
-kable(assumptiontable)
+kable(assumptiontable, col.names = sub("_", " ", names(assumptiontable)))
```
The above table lists the model assumptions that need to be understood and investigated further. Make sure you investigate these assumptions per *research question*, as different questions will use different data. If an assumption is violated, you can either transform your data to meet the assumption or eliminate that test from your analyses. Below I will show you how to run checks for independence, normality, strong outliers, linearity, [high dimensionality](#np), and how to [feature scale](#featurescaling) your data.
@@ -65,31 +65,34 @@ aggression_qqplot <- qqnorm(manydogs_missing_handled$aggression_score)
qqline(manydogs_missing_handled$aggression_score)
aggression_density <- plot(density(manydogs_missing_handled$aggression_score), main = "Density Plot of Aggression")
+
+```
+
+```{r eval=FALSE}
#Plots for Attachment
-#attachment_qqplot <- qqnorm(manydogs_missing_handled$attachment_score)
-#qqline(manydogs_missing_handled$attachment_score)
-#attachment_density <- plot(density(manydogs_missing_handled$attachment_score), main = "Density Plot of Attachment")
+attachment_qqplot <- qqnorm(manydogs_missing_handled$attachment_score)
+qqline(manydogs_missing_handled$attachment_score)
+attachment_density <- plot(density(manydogs_missing_handled$attachment_score), main = "Density Plot of Attachment")
#Plots for Excitability
-#excitability_qqplot <- qqnorm(manydogs_missing_handled$excitability_score)
-#qqline(manydogs_missing_handled$excitability_score)
-#excitability_density <- plot(density(manydogs_missing_handled$excitability_score), main = "Density Plot of Excitability")
+excitability_qqplot <- qqnorm(manydogs_missing_handled$excitability_score)
+qqline(manydogs_missing_handled$excitability_score)
+excitability_density <- plot(density(manydogs_missing_handled$excitability_score), main = "Density Plot of Excitability")
#Plots for Fear
-#fear_qqplot <- qqnorm(manydogs_missing_handled$fear_score)
-#qqline(manydogs_missing_handled$fear_score)
-#fear_density <- plot(density(manydogs_missing_handled$fear_score), main = "Density Plot of Fear")
+fear_qqplot <- qqnorm(manydogs_missing_handled$fear_score)
+qqline(manydogs_missing_handled$fear_score)
+fear_density <- plot(density(manydogs_missing_handled$fear_score), main = "Density Plot of Fear")
#Plots for Miscellaneous
-#miscellaneous_qqplot <- qqnorm(manydogs_missing_handled$miscellaneous_score)
-#qqline(manydogs_missing_handled$miscellaneous_score)
-#miscellaneous_density <- plot(density(manydogs_missing_handled$miscellaneous_score), main = "Density Plot of Miscellaneous")
+miscellaneous_qqplot <- qqnorm(manydogs_missing_handled$miscellaneous_score)
+qqline(manydogs_missing_handled$miscellaneous_score)
+miscellaneous_density <- plot(density(manydogs_missing_handled$miscellaneous_score), main = "Density Plot of Miscellaneous")
#Plots for Separation
-#separation_qqplot <- qqnorm(manydogs_missing_handled$separation_score)
-#qqline(manydogs_missing_handled$separation_score)
-#separation_density <- plot(density(manydogs_missing_handled$separation_score), main = "Density Plot of Separation")
-
+separation_qqplot <- qqnorm(manydogs_missing_handled$separation_score)
+qqline(manydogs_missing_handled$separation_score)
+separation_density <- plot(density(manydogs_missing_handled$separation_score), main = "Density Plot of Separation")
```
Unfortunately, none of the plots for our research question 3 fit the normality assumption. We know this because the density plots do not show a smooth bell curve - there are multiple peaks, instead of just 1 in the center - and graphs aren't symmetrical. For the Q-Q plots, the data has large deviations from the normality line with much more data than 25% not touching the line. There are also deviations on the tail ends, with both tails going off in different directions. Therefore, we do not have normality and cannot run a Naive Bayes model with this data.
@@ -122,7 +125,7 @@ One last note on outliers: The Z-score method is also used to detect outliers. H
### Linearity
-Linearity is our next assumption. [Linearity](https://www.bookdown.org/rwnahhas/RMPH/mlr-linearity.html) assumes that *each* predictor in the model, when holding all the other predictors in the model constant, will change in a linear way with the outcome variable. (In other words, don't put a line through something that is not a line). Unlike in linear regression, in logistic regression, we are looking at the log odds of the **probability** of the outcome (i.e., being in a particular category). To diagnose if this is true or not, we need to make a graph of this relationship and see if the plot satisfies the assumption. These plots are called a component plus resistance or CR plot, which can be made with the `crPlots` function in the `car` package.
+Linearity is our next assumption. [Linearity](https://www.bookdown.org/rwnahhas/RMPH/mlr-linearity.html) assumes that *each* predictor in the model, when holding all the other predictors in the model constant, will change in a linear way with the outcome variable. (In other words, don't put a line through something that is not a line). Unlike in linear regression, in logistic regression, we are looking at the log odds of the **probability** of the outcome (i.e., being in a particular category). To diagnose if this is true or not, we need to make a graph of this relationship and see if the plot satisfies the assumption. These plots are called a component plus resistance or CR plot, which can be made with the `crPlots()` function in the `car` package.
We will make a plot for each continuous predictor per research question. We will break code into three sections, one for each research question. You do not need to check the categorical predictors as they are always linear. To read more about why this is see [here](https://www.bookdown.org/rwnahhas/RMPH/mlr-linearity.html).
@@ -162,33 +165,37 @@ age_crplot <- crPlots(model_RQ_3, terms = ~age,
pch=20, col="gray",
smooth = list(smoother=car::gamLine))
-#training_crplot <- crPlots(model_RQ_3, terms = ~training_score,
- #pch=20, col="gray",
- #smooth = list(smoother=car::gamLine))
+```
+
+```{r eval=FALSE}
+
+training_crplot <- crPlots(model_RQ_3, terms = ~training_score,
+ pch=20, col="gray",
+ smooth = list(smoother=car::gamLine))
-#aggression_crplot <- crPlots(model_RQ_3, terms = ~aggression_score,
- #pch=20, col="gray",
- #smooth = list(smoother=car::gamLine))
+aggression_crplot <- crPlots(model_RQ_3, terms = ~aggression_score,
+ pch=20, col="gray",
+ smooth = list(smoother=car::gamLine))
-#fear_crplot <- crPlots(model_RQ_3, terms = ~fear_score,
- #pch=20, col="gray",
- #smooth = list(smoother=car::gamLine))
+fear_crplot <- crPlots(model_RQ_3, terms = ~fear_score,
+ pch=20, col="gray",
+ smooth = list(smoother=car::gamLine))
-#separation_crplot <- crPlots(model_RQ_3, terms = ~separation_score,
- #pch=20, col="gray",
- #smooth = list(smoother=car::gamLine))
+separation_crplot <- crPlots(model_RQ_3, terms = ~separation_score,
+ pch=20, col="gray",
+ smooth = list(smoother=car::gamLine))
-#excitability_crplot <- crPlots(model_RQ_3, terms = ~excitability_score,
- #pch=20, col="gray",
- #smooth = list(smoother=car::gamLine))
+excitability_crplot <- crPlots(model_RQ_3, terms = ~excitability_score,
+ pch=20, col="gray",
+ smooth = list(smoother=car::gamLine))
-#attachment_crplot <- crPlots(model_RQ_3, terms = ~attachment_score,
- #pch=20, col="gray",
- #smooth = list(smoother=car::gamLine))
+attachment_crplot <- crPlots(model_RQ_3, terms = ~attachment_score,
+ pch=20, col="gray",
+ smooth = list(smoother=car::gamLine))
-#miscellaneous_crplot <- crPlots(model_RQ_3, terms = ~miscellaneous_score,
- #pch=20, col="gray",
- #smooth = list(smoother=car::gamLine))
+miscellaneous_crplot <- crPlots(model_RQ_3, terms = ~miscellaneous_score,
+ pch=20, col="gray",
+ smooth = list(smoother=car::gamLine))
```
@@ -202,7 +209,7 @@ Now that you know why we transform predictors by scaling them, let's scale our c
To see explanations of other types of transformations see this great [guide](https://rpubs.com/zubairishaq9/how-to-normalize-data-r-my-data) on R pubs.
-We will use the function `scale` to apply the z-score function to our continuous predictor columns. This function comes with base R, so no need to install another package.
+We will use the function `scale()` to apply the z-score function to our continuous predictor columns. This function comes with base R, so no need to install another package.
```{r}
manydogs_transformed <- manydogs_missing_handled %>%
diff --git a/regularization.qmd b/regularization.qmd
index 97b49f5..c368b9f 100644
--- a/regularization.qmd
+++ b/regularization.qmd
@@ -4,11 +4,11 @@ Now, if you were to consult other machine learning texts, you will notice one wo
Regularization helps us to solve two problems that plague statistics: collinearity and overfitting. Often after you collect your data, you will find that some of your predictors are [colinear](https://www.stratascratch.com/blog/a-beginner-s-guide-to-collinearity-what-it-is-and-how-it-affects-our-regression-model/), meaning that two or more of your predictor variables are highly correlated with each other. [Overfitting](https://www.geeksforgeeks.org/underfitting-and-overfitting-in-machine-learning/) means that your model too closely maps onto your training data and isn't flexible enough to do well or make accurate predictions for new unseen data. All data represents two sources of variation: the true relationships, and the error in your sample (i.e., what is true of your sample but is not true of the population). The smaller your sample size, the more likely it is that you have captured noise rather than the true relationship between variables. The graphic below helps illustrate how overfitting creates a model reliant on error rather than the true relationship.
-
+
Source: MathWorks (https://www.mathworks.com/discovery/overfitting)
-You can solve both of these issues at once with something called regularization. Regularization simply adds a penalty term, lambda, to the slope of the fitted line. This additional value added to the slope adds in a little bit of error to the model. This user-added error is a necessary evil to combat some of the noise that occur in models when the data used to construct the model is not totally representative of the underlying truth of the relationship between variables. The choice to regularize is totally up to you, and is based on how much risk tolerance or uncertainty you want to include in your model outcomes. If you have millions of data points from a highly representative sample, adding in that extra error will probably get you further from the truth of the relationship. If you have, like we do in this example, less than 1000 subjects and are trying to generalize to the species level you probably should do regularization as the sample is small and not totally representative of your total population.
+You can solve both of these issues at once with something called regularization. Regularization simply adds a penalty term, lambda, to the slope of the fitted line. This additional value added to the slope adds in a little bit of error to the model. This user-added error is a necessary evil to combat some of the noise that occur in models when the data used to construct the model is not totally representative of the underlying truth of the relationship between variables. The choice to regularize is totally up to you, and is based on how much risk tolerance or uncertainty you want to include in your model outcomes. If you have millions of data points from a highly representative sample, adding in that extra error will probably get you further from the truth of the relationship. If you have, like we do in this example, less than 1000 subjects and are trying to generalize to the species level, you probably should do regularization, as the sample is small and not totally representative of your total population.
There are three types of regularization that are commonly used:
@@ -16,11 +16,11 @@ There are three types of regularization that are commonly used:
2. Ridge Regularization - best to use if you know all of the predictors in your model are useful
3. Elastic Net Regularization - best to use if you don't know how useful/useless your predictors are because it has the best of both worlds ([Chill it out take it slow than you'll rock out the show](https://www.youtube.com/watch?v=uVjRe8QXFHY)) - it does the math from lasso + math from ridge to get a new number that is called elastic net.
-In each case, applying regularization creates a value that adds together the loss/cost function and a lambda value, then multiples this number by a transformation of the slope of the fitted line. How the slope is transformed is where the three types of regularization differ. If you got a little lost in that last sentence, don't despair; we will describe each of the three parts that contribute to regularization in detail. I'll clarify this concept using an example you are likely familiar with, linear regression.
+In each case, applying regularization creates a value that adds together the [loss function](glossary.qmd#losscostfunction) (also called cost function) and a lambda value, then multiplies this number by a transformation of the slope of the fitted line. How the slope is transformed is where the three types of regularization differ. If you got a little lost in that last sentence, don't despair; we will describe each of the three parts that contribute to regularization in detail. I'll clarify this concept using an example you are likely familiar with, linear regression.
-[Loss/cost functions](glossary.qmd#losscostfunction) provide a single value that quantifies the difference between the predicted values and the actual values, measuring the model's performance. In a linear regression model, our loss/cost function is the sum of squared residuals (calculated by adding together all the squared values of the differences between each data point and the estimate of where the line says the data point should be), which is a value that represents how well the line predicted the data. If you have a small loss/cost value than you have data that is very good at accurately predicting outcome values. Therefore, you want to minimize the loss/cost value as much as possible. A small loss/cost function is the goal.
+Loss functions provide a single value that quantifies the difference between the predicted values and the actual values, measuring the model's performance. In a linear regression model, our loss function is the sum of squared residuals (calculated by adding together all the squared values of the differences between each data point and the estimate of where the line says the data point should be), which is a value that represents how well the line predicted the data. If you have a small loss value, then you have data that is very good at accurately predicting outcome values. Therefore, you want to minimize the loss value as much as possible. A small loss function is the goal.
-The second term, Lambda, is a penalty term that adds a small amount of error to the slope of your line. This error term helps combat overfitting by making the slope slightly more or less steep. By adjusting the slope slightly you make it less dependent on the current training data and more likely to accurately predict the testing data (AKA you prevent overfitting). Lambda can be any positive number. The larger the lambda, the larger the penalty or error you are adding to the model. The value of lambda for a particular model is determined by iteratively going thru various values of lambda, testing each one with the data, until you find the lambda value that minimizes the loss/cost function. In this way, you are balancing the trade-off between fitting data well and overfitting.
+The second term, Lambda, is a penalty term that adds a small amount of error to the slope of your line. This error term helps combat overfitting by making the slope slightly more or less steep. By adjusting the slope slightly you make it less dependent on the current training data and more likely to accurately predict the testing data (that is, you prevent overfitting). Lambda can be any positive number. The larger the lambda, the larger the penalty or error you are adding to the model. The value of lambda for a particular model is determined by iteratively going thru various values of lambda, testing each one with the data, until you find the lambda value that minimizes the loss function. In this way, you are balancing the trade-off between fitting data well and overfitting.
The difference between the three types of regularization occur in the third term, which represents how they transform the slope of each predictor. In Lasso regularization, you multiply by the absolute value of the slope (i.e., \|slope\|). In Ridge regularization you multiply by the squared slope (i.e., slope^2^). In elastic net, you multiply by adding together both penalties (i.e., \|slope\| + slope^2^).
diff --git a/runningmodels.qmd b/runningmodels.qmd
index 6eedef2..870a64c 100644
--- a/runningmodels.qmd
+++ b/runningmodels.qmd
@@ -14,8 +14,8 @@ Now after all that explanation we finally get to the good part, running the actu
In 2016 a package came out called `mlr` (machine learning in R) that created a suite of tools to create machine learning models quickly and efficiently in R. This package is the only R package we will need for this section of the tutorial. The `mlr` package breaks creating machine learning models into three smaller steps:
-1. Define the [task]{#task}: This step consists of passing the data into an object and labeling what the outcome variable is.
-2. Define the [learner]{#learner}/model: A learner defines the structure of the algorithm. In this step you define/list the model type you want to use (i.e., logistic regression, KNN, random forest, etc.) and then the specific [hyperparameters](glossary.qmd#hyperparameters) you want the model to use. (i.e., what k to use in a KNN model) Learner is not a general term used throughout machine learning but instead is specific to the `mlr` package. You can also think of this step as defining your model.
+1. Define the [[task](#task)]{#task}: This step consists of passing the data into an object and labeling what the outcome variable is.
+2. Define the [[learner](#learner)]{#learner}/model: A learner defines the structure of the algorithm. In this step you define/list the model type you want to use (i.e., logistic regression, KNN, random forest, etc.) and then the specific [hyperparameters](glossary.qmd#hyperparameters) you want the model to use. (i.e., what k to use in a KNN model) Learner is not a general term used throughout machine learning but instead is specific to the `mlr` package. You can also think of this step as defining your model.
3. [Train]{#train} the model with the appropriate cross validation approach: Put it all together along with the specified cross validation approach. At this step you create an object that takes the task, learner, and any other additional features and uses that information to output the trained model you can use to make predictions.
### Creating the Task
@@ -127,7 +127,7 @@ KNN_model_RQ_1_loocv <- suppressMessages({
saveRDS(KNN_model_RQ_1, file = "KNN_model_RQ_1.rds")
```
-Now that we made the model and then cross validated it using Leave one out cross validation, we can look at the loss/cost function to see how our model is performing. In this case our loss/cost functions are accuracy (acc) and mean misclassification error (mmce). The `aggr()` function gives you the two performance metrics I asked the model to pull out in the `measures` argument. A performance metric tells you how well the model predicted the correct category. As noted above, the **mmce** is mean misclassifications error or the proportion of incorrectly classified cases, and **acc** is the accuracy or the proportion of correctly classified cases. You will notice that the two together add up to 100%, so you really only need one or the other.
+Now that we made the model and then cross validated it using Leave one out cross validation, we can look at the loss function to see how our model is performing. In this case our loss functions are accuracy (acc) and mean misclassification error (mmce). The `aggr()` function gives you the two performance metrics I asked the model to pull out in the `measures` argument. A performance metric tells you how well the model predicted the correct category. As noted above, the **mmce** is mean misclassifications error or the proportion of incorrectly classified cases, and **acc** is the accuracy or the proportion of correctly classified cases. You will notice that the two together add up to 100%, so you really only need one or the other.
```{r}
#Performance
@@ -165,7 +165,7 @@ decision_tree_model_RQ_1_loocv <- suppressMessages({
saveRDS(decision_tree_model_RQ_1, file = "decision_tree_model_RQ_1.rds")
```
-Now let's check how well the decision tree model is working with the data by looking at our loss/cost function.
+Now let's check how well the decision tree model is working with the data by looking at our loss function.
```{r}
#Performance of the Decision Tree
@@ -250,7 +250,7 @@ SVM_model_RQ_1_loocv <- suppressMessages({
measures = list(mmce, acc))
})
-#Output the Lost/Cost functions for this model
+#Output the Loss functions for this model
SVM_model_RQ_1_loocv$aggr
```
diff --git a/step3.qmd b/step3.qmd
index f5d1cf4..d8ba72c 100644
--- a/step3.qmd
+++ b/step3.qmd
@@ -7,16 +7,16 @@ training_data <- read.csv("training_data.csv")
# Step 3: Repeated Cross Validation {#crossval}
-Repeated cross validation is a little like data Russian nesting dolls. It is also simpler than most people expect it to be. Repeated cross validation simply refers to the concept of breaking your data into smaller sections to get an average of how well a model does on different subsets of the data. That way you have an average value for whatever you are interested in - whether than be your loss/cost function or predictive accuracy. An average allows a more accurate estimate for outcomes than relying on one partition of the data alone. If you only cut the data once, there is a likelihood that the training partition you randomly selected could have a disproportionately high number of outliers, or didn't have a good distribution of observation values. Cross validation is a way of making the randomization component of training a dataset less subject to chance.
+Repeated cross validation is a little like data Russian nesting dolls. It is also simpler than most people expect it to be. Repeated cross validation simply refers to the concept of breaking your data into smaller sections to get an average of how well a model does on different subsets of the data. That way you have an average value for whatever you are interested in - whether than be your loss function or predictive accuracy. An average allows a more accurate estimate for outcomes than relying on one partition of the data alone. If you only cut the data once, there is a likelihood that the training partition you randomly selected could have a disproportionately high number of outliers, or didn't have a good distribution of observation values. Cross validation is a way of making the randomization component of training a dataset less subject to chance.
```{r, echo=FALSE}
knitr::include_graphics("nesting_dolls.png")
```
-The three main types of cross validation are k-fold, leave one out, and nested cross validation. You use them in the following instances:
+The three main types of cross validation are leave one out, k-fold, and nested cross validation. You use them in the following instances:
- [Leave one out cross validation (LOOCV)](#loocv) - best to use when your dataset is very small (n \~ 500).
-- [K-fold](#kfold) cross validation - usually best to use when your dataset is large (n \> 1000). The K stands for any number you would like, usually 10-fold cross validation. K stands for how many times you would like the data to be partitioned.
+- [K-fold](#kfold) cross validation - usually best to use when your dataset is large (n \> 1000). K stands for how many times you would like the data to be partitioned, usually 10-fold cross validation.
- [Nested cross validation](#nested) - best to use when you need to tune [hyperparameters](glossary.qmd#hyperparameters) like k in KNN models or lambda in regularization techniques.
The value of 1000 observations as a small cutoff isn't a universal standard. Instead, what constitutes large or small means different things to different people depending on the field, the type of model you are running, and if you need to tune hyperparameters. Yes, this is a vague answer for what is small and what is large. It's also the most accurate answer. But, to help you out: In psychology, most experiments not involving neuroimaging data from large consortiums or genome-wide association studies are usually considered "small" in the data science world. Large language learning models can have billions of observations.
@@ -27,7 +27,7 @@ Now, let's break down these three types of cross validation to explain them furt
[K-fold cross validation](#kfold) cuts the data into k sections with k being any positive integer you like. The smallest value k can be is 2, and the largest option being the number of observations. If your k is the number of observations, you have reinvented Leave one out cross validation! (Fun, right?) Once the data is broken into k sections it then runs as many models as the number k iteratively until all folds have been used as the validation/testing set.
-[Nested cross validation](#nested) runs two different cross validations with one within the other (like our nesting dolls!). The outer loop of a nested cross validation runs exactly like a k fold cross validation. The inner fold also runs exactly like a k fold cross validation, but instead of running on the entire data set but leaving 1 fold out, it runs on each fold of the outer loop's training set. Each loop run in the inner loop uses a different set of hyperparameters, like choosing what value of lambda to use in regularization or what k to use in a KNN model. When the inner loop concludes it can tell you what the optimal hyperparameter is based on which value gives the lowest loss/cost function. The outerloop then runs using the chosen [hyperparameter](glossary.qmd#hyperparameters) to return the best performing model given your data.
+[Nested cross validation](#nested) runs two different cross validations with one within the other (like our nesting dolls!). The outer loop of a nested cross validation runs exactly like a k fold cross validation. The inner fold also runs exactly like a k fold cross validation, but instead of running on the entire data set but leaving 1 fold out, it runs on each fold of the outer loop's training set. Each loop run in the inner loop uses a different set of hyperparameters, like choosing what value of lambda to use in regularization or what k to use in a KNN model. When the inner loop concludes it can tell you what the optimal hyperparameter is based on which value gives the lowest loss function. The outerloop then runs using the chosen [hyperparameter](glossary.qmd#hyperparameters) to return the best performing model given your data.
There are a few other types of cross validation, but they are all variations on cutting things into parts and running small data sections through the model. The more computing power and time you have access to, the fancier you can make your cross-validation procedure. I won't go into detail on some of the more complicated cross validation practices as the fancier ones do best on very large datasets in the millions where all the cutting/partitioning can still be meaningful. If your data, like in this tutorial's example, has less than 1000 observations, it is less likely that increasing the complexity of your cross validation would be meaningful, helpful, or worth the time and computing power.
@@ -39,13 +39,13 @@ There are multiple ways in R to create code to run a cross validation. We will u
loocv <- makeResampleDesc(method = "LOO") #define parameters for cross validation
-#10fold_cross_validation <- makeResampleDesc(method = "RepCV", folds = 10, reps = 10, stratify = TRUE) #If you want a different number of folds you can change the number to anything you like. If your number of folds is the same number as your observations than you have remade LOOCV!
+ten_fold_cross_validation <- makeResampleDesc(method = "RepCV", folds = 10, reps = 10, stratify = TRUE) #If you want a different number of folds you can change the number to anything you like. If your number of folds is the same number as your observations than you have remade LOOCV!
```
Above is the only code you need for now. When you run your model, you will set the `resampling` argument to `loocv`. We will add `loocv` to our models when we actually run the model, but to show you what it will look like, here is some dummy code:
-```{r}
+```{r eval=FALSE}
-#model <- resample(learner = knn, task = data, resampling = loocv)
+model <- resample(learner = knn, task = data, resampling = loocv)
```
diff --git a/step4.qmd b/step4.qmd
index 8324a06..4a08d87 100644
--- a/step4.qmd
+++ b/step4.qmd
@@ -22,14 +22,14 @@ Throughout this tutorial so far, I have introduced you to 5 main model types: Lo
- Outcome variable type: Takes only categorical outcome variables
- Predictor variable type: Takes both categorical and continuous predictor variables
-3. Naive Bayes - A [Naiva Bayes](https://www.youtube.com/watch?v=O2L2Uv9pdDA&t=228s) algorithm multiples the probabilities of each feature being found in a class/category assuming independence. For example, if a dog is less than 2 years old in 6 out of 10 observations than the probability of the feature "the dog is less than 2" is 60%. The probabilities of each feature are multiplied with all other probabilities, and the category with the higher value per observation tells the computer which bin to classify the observation in.
+3. Naive Bayes - A [Naive Bayes](https://www.youtube.com/watch?v=O2L2Uv9pdDA&t=228s) algorithm multiples the probabilities of each feature being found in a class/category assuming independence. For example, if a dog is less than 2 years old in 6 out of 10 observations than the probability of the feature "the dog is less than 2" is 60%. The probabilities of each feature are multiplied with all other probabilities, and the category with the higher value per observation tells the computer which bin to classify the observation in.
- Pros: Doesn't need to be tuned for hyperparameters; computationally inexpensive; can handle minimal amount of missing data if need be (good rule of thumb is less than 5% per column); works best on classifying based on words
- Cons: Assumes predictor variables are normally distributed; assumes predictors are independent and suffers a lot when they aren't
- Outcome variable type: Takes only categorical outcome variables
- Predictor variable type: Takes both categorical and continuous predictor variables
-4. Decision Tress/Random Forest - [A decision tree](https://www.youtube.com/watch?v=_L39rN6gz7Y) is a flow chart made by the algorithm that follows a line of logic to its conclusion. You start at the **root** and you go down a branch based on a feature of the data (i.e. high or low score on training) until you reach a **leaf** (bin) that has your final guess for which category a new datapoint should belong to. **Node's** are any point between the start of the tree and a leaf where the algorithm makes a decision to move from branch to branch to leaf. Each node (decision point) partitions predictors based on values that best categorize outcome variables.
+4. Decision Tress/Random Forest - A [decision tree](https://www.youtube.com/watch?v=_L39rN6gz7Y) is a flow chart made by the algorithm that follows a line of logic to its conclusion. You start at the **root** and you go down a branch based on a feature of the data (i.e. high or low score on training) until you reach a **leaf** (bin) that has your final guess for which category a new datapoint should belong to. **Node's** are any point between the start of the tree and a leaf where the algorithm makes a decision to move from branch to branch to leaf. Each node (decision point) partitions predictors based on values that best categorize outcome variables.
- Pros: Flexible; easily interpretable; no assumptions; more robust to missing data than most other algorithms (good rule of thumb is no more than 10% per column though as predictive accuracy gets worse the more missingness you have); good with outliers. Running multiple decesion trees on different subsections of the data and averaging prediction outcomes is called a random forest. This is easy to remember as many trees creates a forest.
- Cons: Individual trees WILL most likely overfit data, so they are rarely used. Instead, I have grouped decision trees with how they are most often used: in a random forest.
diff --git a/whatisclassML.qmd b/whatisclassML.qmd
index 9c5e65f..dd05b2c 100644
--- a/whatisclassML.qmd
+++ b/whatisclassML.qmd
@@ -25,7 +25,7 @@ To conceptualize what you should do on a classification machine learning journey
So simple, right? Okay, so maybe you aren't ready to go run models on your own yet ;) Let's discuss what each step does and why before we jump into how to do it:
1. **Explore data and check assumptions**: Before you can run any models you need to explore your dataset to understand what types of variables you have, clean up the data, and deal with any missing values. Once these steps are complete you must check if the properties of your dataset meet the necessary criteria for each potential algorithm. Each algorithm has its own set of assumptions that must be met for it to work as intended. If the assumptions of an algorithm are violated in the working dataset than models created by that specific algorithm are often useless[\*\*](#footnote3). If your model assumes condition x, but your data violates condition x, then the model outcomes will be less predictive and, in some cases, essentially worthless. When assumptions are not met, the model's outputs can be biased, inefficient, or inaccurate, leading to poor performance or incorrect conclusions. Once you know which algorithms assumptions are met or violated you can choose the appropriate algorithms to use in step 4.
-2. **Randomly partition the data**: Breaking data into training and test sets allows you to reserve the testing data to evaluate how the model you create performs on new, unseen data. The test set is the unseen data, and the training data is what you use to estimate model parameters. TO partition the data into training and test sets, it makes sense to randomly partition them to avoid any weird order effect that may go along with data collection. But a problem with randomly partitioning data is that every time you run the analysis, you'll use a different random partition, which can lead to a slightly different result. This means that your analysis is not fully reproducible. So anyone running your code (including you, when you comeback to rerun your analyses!) will not get exactly the same results. To ensure reproducible results, you should set a seed (i.e., a fixed starting point for the randomization process). Anyone who uses that seed will use the same randomization process.
+2. **Randomly partition the data**: Breaking data into training and test sets allows you to reserve the testing data to evaluate how the model you create performs on new, unseen data. The test set is the unseen data, and the training data is what you use to estimate model parameters. To partition the data into training and test sets, it makes sense to randomly partition them to avoid any weird order effect that may go along with data collection. But a problem with randomly partitioning data is that every time you run the analysis, you'll use a different random partition, which can lead to a slightly different result. This means that your analysis is not fully reproducible. So anyone running your code (including you, when you comeback to rerun your analyses!) will not get exactly the same results. To ensure reproducible results, you should set a seed (i.e., a fixed starting point for the randomization process). Anyone who uses that seed will use the same randomization process.
3. **Run repeated cross validation**: Above you partitioned the data once, which is the most basic form of cross validation. However, to partitioning the data only once can result in weird, uneven distributions of data between the training and testing sets. Instead, we run a bunch of these partitions (called repeated cross validation) and average over them to reduce bias and variance. If you forgo this step, you will likely get a more biased estimation of which model is going to accurately predict new data because of random noise in the training data, rather than the true relationship between your predictors and outcome variables as they exist in the world outside your sample.
4. **Choose an algorithm and run with hyperparameters**: Using the characteristics of your data (number of predictors, type of outcome variables, etc.) and [model assumptions](#assumptions) (collinearity, outliers, normality, etc.) you can narrow down your analysis to a number of models that you *could* use to make predictions with a specific dataset. However, you cannot know what single model will run best until you actually run the models and compare the outcomes. This step also includes adding in any necessary [hyperparameters](#hyperparameters) the computer needs to run a specific model.
5. **Assess model performance**: After you have run all the models that you are interested in, you then need to determine how well each model predicted the test set or unseen data. [Generalization error](https://medium.com/@yixinsun_56102/understanding-generalization-error-in-machine-learning-e6c03b203036) is the term for this. It is a measure of how well a models predictions match the true values in the training data. In things like a logistic regression this can be as simple as calculating the percent of categories predicted correctly, or as complex as creating an [ROC](https://www.youtube.com/watch?v=4jRBRDbJemM) graph. Basically, in this step we are using standardized metrics to evaluate how well each possible model predicted the testing data.