Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion datacleaning.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ manydogs_data <- read.csv("manydogs_etal_2024_data.csv")

## Data Set-up: Tidying and Feature Selection

Before you can begin working with your data you must make sure that each row is a single observation, and each column is a single variable/predictor. This type of data set-up or "wrangling" is known as "tidy data". There are many readily available tutorials and textbooks that help you understand tidy data and how to clean and wrangle data to make it tidy. I recommend the [Tidyverse](https://r4ds.had.co.nz/tidy-data.html) chapter in the R data science textbook to start. Thankfully, the dataset from ManyDogs is already tidy so for this tutorial we can skip this step.
Before you can begin working with your data you must make sure that each row is a single observation, and each column is a single variable/predictor. This type of data set-up or "wrangling" is known as "tidy data". There are many readily available tutorials and textbooks that help you understand tidy data and how to clean and wrangle data to make it tidy. I recommend the tidy data chapter in the R for Data Science textbook to start. Thankfully, the dataset from ManyDogs is already tidy so for this tutorial we can skip this step.

After your data is tidy, the next step before is to complete feature selection. [*Feature selection*](https://www.geeksforgeeks.org/feature-selection-techniques-in-machine-learning/) is a fancy term for removing variables you aren't going to analyze and creating new ones by computing any necessary variables. Many of the changes you complete in this step are decided on through [domain knowledge](https://corporatefinanceinstitute.com/resources/data-science/domain-knowledge-data-science/): applying what you know about the field of research to make judgement calls on what is and isn't important to the model/data. The rest of our decisions depend on our specific research hypotheses. Anything in the data set that does not specifically pertain to our research hypotheses need to be eliminated to increase statistical power and to compute the model faster to save computing resources. Commonly deleted variables at this stage might include meta-data such as time of day when a survey was completed, or individual scale items when we have calculated the total scores. When domain knowledge doesn't suffice because you are working in a relatively new field, or past studies have conflicting information, you will want to let machine learning algorithms help choose what to keep and eliminate. See the [regularization](regularization.qmd#regularization) section for more information.

Expand Down
2 changes: 1 addition & 1 deletion datadescription.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,4 @@ The data being used in this tutorial is from the [ManyDogs Project](https://many
knitr::include_graphics("md1_setup.jpg")
```

This dataset is a great example to use when investigating machine learning predictive classification models, as it has many possible predictors to investigate with a discrete binary dependent variable (i.e. whether the dog chose correctly). Furthermore, all the data from this project is available to anyone to share or adapt with attribution, which makes it ideal to use as a learning tool. To work through this tutorial with me, you will first need to download the data from the GitHub repository associated with the project and/or create a local clone of the project repo on your computer. I recommend using the [GitHub desktop application](https://docs.github.com/en/desktop/overview/getting-started-with-github-desktop) to easily get and give information from/to a GitHub repository using a point and click method instead of the command line.
This dataset is a great example to use when investigating machine learning predictive classification models, as it has many possible predictors to investigate with a discrete binary dependent variable (i.e. whether the dog chose correctly). Furthermore, all the data from this project is available to anyone to share or adapt with attribution, which makes it ideal to use as a learning tool. To work through this tutorial with me, you will first need to [download the data](https://github.com/ManyDogsProject/md1_data/blob/main/manydogs_etal_2024_data.csv) from the GitHub repository associated with the project and/or create a local clone of the project repo on your computer. I recommend using the [GitHub desktop application](https://docs.github.com/en/desktop/overview/getting-started-with-github-desktop) to easily get and give information from/to a GitHub repository using a point and click method instead of the command line.
4 changes: 2 additions & 2 deletions glossary.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
- Feature Selection <a id="feature"></a> - [Feature Selection](https://domino.ai/data-science-dictionary/feature-selection) is a process that eliminates unnecessary features (AKA predictors or variables) from the data to help a model perform as well as possible with given data.
- Generalization error <a id="generror"></a> - [Generalization error](https://medium.com/@yixinsun_56102/understanding-generalization-error-in-machine-learning-e6c03b203036) is a measures of how well an algorithm performs on the testing data. Reminder: testing data is held out at the train and test split step at the beginning of a machine learning project.
- Hyperparameters <a id="hyperparameters"></a> - [Hyperparameters](https://towardsdatascience.com/parameters-and-hyperparameters-aa609601a9ac) are any variable in a model that changes the accuracy and precision of a model that are not learned from the data. This is in contrast to variables in the model that are learned from the data (i.e. parameters). For example, in a random forest model the number of decision trees that you want the computer to run is a hyperparameter but the best predictor to sample at the first node of each tree is a parameter as that is what the model learns using the data. Thinking more about the best start value for different types of hyperparameters is a whole book/tutorial in itself! In fact someone wrote a [book](https://library.oapen.org/viewer/web/viewer.html?file=/bitstream/handle/20.500.12657/60840/978-981-19-5170-1.pdf?sequence=1&isAllowed=y) on how to tune hyperparameters in R that is freely available. Please see this book and seek out additional resources on your own when tuning hyperparameters for your own analyses.
- Lost/Cost Function <a id="losscostfunction"></a> - A [loss/cost function](https://www.enjoyalgorithms.com/blog/loss-and-cost-functions-in-machine-learning) quantifies the difference between the predicted values and the actual values, measuring the model's performance. The goal of machine learning is to minimize the loss function to improve the model's accuracy and generalizability.There are many different types of functions that measure model performance so loss/cost function is an umbrella term. A loss function refers to the performance of a single data point while a cost function refers to the average performance across a dataset.
- Lost Function <a id="losscostfunction"></a> - A [loss function](https://www.enjoyalgorithms.com/blog/loss-and-cost-functions-in-machine-learning) (also called a cost function) quantifies the difference between the predicted values and the actual values, measuring the model's performance. The goal of machine learning is to minimize the loss function to improve the model's accuracy and generalizability.There are many different types of functions that measure model performance so loss/cost function is an umbrella term. A loss function refers to the performance of a single data point while a cost function refers to the average performance across a dataset.

## Package Versions

Expand Down Expand Up @@ -37,5 +37,5 @@ Below I have listed the R and package versions I am using. If you are reading th
- Version 4.3.3
- randomForest
- Version 4.7.1.1

# References
87 changes: 47 additions & 40 deletions otherassumptions.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ assumptiontable <- data.frame(
High_Dimensionality = c("No", "Yes", "No", "No", "No"),
Feature_Scaling = c("Yes","Yes","No","No","Yes"))

kable(assumptiontable)
kable(assumptiontable, col.names = sub("_", " ", names(assumptiontable)))
```

The above table lists the model assumptions that need to be understood and investigated further. Make sure you investigate these assumptions per *research question*, as different questions will use different data. If an assumption is violated, you can either transform your data to meet the assumption or eliminate that test from your analyses. Below I will show you how to run checks for independence, normality, strong outliers, linearity, [high dimensionality](#np), and how to [feature scale](#featurescaling) your data.
Expand Down Expand Up @@ -65,31 +65,34 @@ aggression_qqplot <- qqnorm(manydogs_missing_handled$aggression_score)
qqline(manydogs_missing_handled$aggression_score)
aggression_density <- plot(density(manydogs_missing_handled$aggression_score), main = "Density Plot of Aggression")


```

```{r eval=FALSE}
#Plots for Attachment
#attachment_qqplot <- qqnorm(manydogs_missing_handled$attachment_score)
#qqline(manydogs_missing_handled$attachment_score)
#attachment_density <- plot(density(manydogs_missing_handled$attachment_score), main = "Density Plot of Attachment")
attachment_qqplot <- qqnorm(manydogs_missing_handled$attachment_score)
qqline(manydogs_missing_handled$attachment_score)
attachment_density <- plot(density(manydogs_missing_handled$attachment_score), main = "Density Plot of Attachment")

#Plots for Excitability
#excitability_qqplot <- qqnorm(manydogs_missing_handled$excitability_score)
#qqline(manydogs_missing_handled$excitability_score)
#excitability_density <- plot(density(manydogs_missing_handled$excitability_score), main = "Density Plot of Excitability")
excitability_qqplot <- qqnorm(manydogs_missing_handled$excitability_score)
qqline(manydogs_missing_handled$excitability_score)
excitability_density <- plot(density(manydogs_missing_handled$excitability_score), main = "Density Plot of Excitability")

#Plots for Fear
#fear_qqplot <- qqnorm(manydogs_missing_handled$fear_score)
#qqline(manydogs_missing_handled$fear_score)
#fear_density <- plot(density(manydogs_missing_handled$fear_score), main = "Density Plot of Fear")
fear_qqplot <- qqnorm(manydogs_missing_handled$fear_score)
qqline(manydogs_missing_handled$fear_score)
fear_density <- plot(density(manydogs_missing_handled$fear_score), main = "Density Plot of Fear")

#Plots for Miscellaneous
#miscellaneous_qqplot <- qqnorm(manydogs_missing_handled$miscellaneous_score)
#qqline(manydogs_missing_handled$miscellaneous_score)
#miscellaneous_density <- plot(density(manydogs_missing_handled$miscellaneous_score), main = "Density Plot of Miscellaneous")
miscellaneous_qqplot <- qqnorm(manydogs_missing_handled$miscellaneous_score)
qqline(manydogs_missing_handled$miscellaneous_score)
miscellaneous_density <- plot(density(manydogs_missing_handled$miscellaneous_score), main = "Density Plot of Miscellaneous")

#Plots for Separation
#separation_qqplot <- qqnorm(manydogs_missing_handled$separation_score)
#qqline(manydogs_missing_handled$separation_score)
#separation_density <- plot(density(manydogs_missing_handled$separation_score), main = "Density Plot of Separation")

separation_qqplot <- qqnorm(manydogs_missing_handled$separation_score)
qqline(manydogs_missing_handled$separation_score)
separation_density <- plot(density(manydogs_missing_handled$separation_score), main = "Density Plot of Separation")
```

Unfortunately, none of the plots for our research question 3 fit the normality assumption. We know this because the density plots do not show a smooth bell curve - there are multiple peaks, instead of just 1 in the center - and graphs aren't symmetrical. For the Q-Q plots, the data has large deviations from the normality line with much more data than 25% not touching the line. There are also deviations on the tail ends, with both tails going off in different directions. Therefore, we do not have normality and cannot run a Naive Bayes model with this data.
Expand Down Expand Up @@ -122,7 +125,7 @@ One last note on outliers: The Z-score method is also used to detect outliers. H

### Linearity

Linearity is our next assumption. [Linearity](https://www.bookdown.org/rwnahhas/RMPH/mlr-linearity.html) assumes that *each* predictor in the model, when holding all the other predictors in the model constant, will change in a linear way with the outcome variable. (In other words, don't put a line through something that is not a line). Unlike in linear regression, in logistic regression, we are looking at the log odds of the **probability** of the outcome (i.e., being in a particular category). To diagnose if this is true or not, we need to make a graph of this relationship and see if the plot satisfies the assumption. These plots are called a component plus resistance or CR plot, which can be made with the `crPlots` function in the `car` package.
Linearity is our next assumption. [Linearity](https://www.bookdown.org/rwnahhas/RMPH/mlr-linearity.html) assumes that *each* predictor in the model, when holding all the other predictors in the model constant, will change in a linear way with the outcome variable. (In other words, don't put a line through something that is not a line). Unlike in linear regression, in logistic regression, we are looking at the log odds of the **probability** of the outcome (i.e., being in a particular category). To diagnose if this is true or not, we need to make a graph of this relationship and see if the plot satisfies the assumption. These plots are called a component plus resistance or CR plot, which can be made with the `crPlots()` function in the `car` package.

We will make a plot for each continuous predictor per research question. We will break code into three sections, one for each research question. You do not need to check the categorical predictors as they are always linear. To read more about why this is see [here](https://www.bookdown.org/rwnahhas/RMPH/mlr-linearity.html).

Expand Down Expand Up @@ -162,33 +165,37 @@ age_crplot <- crPlots(model_RQ_3, terms = ~age,
pch=20, col="gray",
smooth = list(smoother=car::gamLine))

#training_crplot <- crPlots(model_RQ_3, terms = ~training_score,
#pch=20, col="gray",
#smooth = list(smoother=car::gamLine))
```

```{r eval=FALSE}

training_crplot <- crPlots(model_RQ_3, terms = ~training_score,
pch=20, col="gray",
smooth = list(smoother=car::gamLine))

#aggression_crplot <- crPlots(model_RQ_3, terms = ~aggression_score,
#pch=20, col="gray",
#smooth = list(smoother=car::gamLine))
aggression_crplot <- crPlots(model_RQ_3, terms = ~aggression_score,
pch=20, col="gray",
smooth = list(smoother=car::gamLine))

#fear_crplot <- crPlots(model_RQ_3, terms = ~fear_score,
#pch=20, col="gray",
#smooth = list(smoother=car::gamLine))
fear_crplot <- crPlots(model_RQ_3, terms = ~fear_score,
pch=20, col="gray",
smooth = list(smoother=car::gamLine))

#separation_crplot <- crPlots(model_RQ_3, terms = ~separation_score,
#pch=20, col="gray",
#smooth = list(smoother=car::gamLine))
separation_crplot <- crPlots(model_RQ_3, terms = ~separation_score,
pch=20, col="gray",
smooth = list(smoother=car::gamLine))

#excitability_crplot <- crPlots(model_RQ_3, terms = ~excitability_score,
#pch=20, col="gray",
#smooth = list(smoother=car::gamLine))
excitability_crplot <- crPlots(model_RQ_3, terms = ~excitability_score,
pch=20, col="gray",
smooth = list(smoother=car::gamLine))

#attachment_crplot <- crPlots(model_RQ_3, terms = ~attachment_score,
#pch=20, col="gray",
#smooth = list(smoother=car::gamLine))
attachment_crplot <- crPlots(model_RQ_3, terms = ~attachment_score,
pch=20, col="gray",
smooth = list(smoother=car::gamLine))

#miscellaneous_crplot <- crPlots(model_RQ_3, terms = ~miscellaneous_score,
#pch=20, col="gray",
#smooth = list(smoother=car::gamLine))
miscellaneous_crplot <- crPlots(model_RQ_3, terms = ~miscellaneous_score,
pch=20, col="gray",
smooth = list(smoother=car::gamLine))

```

Expand All @@ -202,7 +209,7 @@ Now that you know why we transform predictors by scaling them, let's scale our c

To see explanations of other types of transformations see this great [guide](https://rpubs.com/zubairishaq9/how-to-normalize-data-r-my-data) on R pubs.

We will use the function `scale` to apply the z-score function to our continuous predictor columns. This function comes with base R, so no need to install another package.
We will use the function `scale()` to apply the z-score function to our continuous predictor columns. This function comes with base R, so no need to install another package.

```{r}
manydogs_transformed <- manydogs_missing_handled %>%
Expand Down
Loading