| output |
|
|---|
Code: CHEP 898
Term: 2026 Winter
Delivery: In person
Location: HLTH 2334
Start Date: January 6th, 2025
Time: Tuesday 1:00- 3:50 pm
This course bridges the gap between data science techniques, biostatistics, and health research, equipping students with advanced tools to analyze complex health data. The course covers both the theoretical and practical aspects of machine learning. Through hands-on experience with R programming language, supervised and unsupervised machine learning techniques, and data visualization, students will learn to process and interpret large datasets to uncover insights into disease patterns, public health trends, and causal relationships. The course is designed to provide students with the theoretical knowledge and practical application of data science techniques. This will prepare learners to tackle health science challenges with modern data-driven approaches.
The official syllabus for this course is available for download here
- CHEP/PUBH 805, or a graduate-level Statistics Course.
- To request permission to take this course, please submit an override request with one of the instructors as approver: https://jira.usask.ca/servicedesk/customer/portal/7/create/291
- On your ticket request, please explain how you meet the course prerequisites and how you hope the course can further your learning and research needs.
- Please include CHEP graduate program administrator, Stephanie Kehrig, on the approval request ticket
- Interested Faculty and Researchers are invited to contact the instructors with content questions, or Stephanie.Kehrig@usask.ca on registration processes.
I acknowledge our shared connection to the land and recognize that Indigenous and Métis peoples on Treaty 6 Territory and all Indigenous peoples have been and continue to be stewards for social justice, equity, and land-based education. In the spirit of reconciliation may we all strive to learn and support the work of Indigenous communities as allies.
This course will follow the general USask Guidelines about AI for Educators and Students (https://leadership.usask.ca/initiatives/ai/index.php). The University has developed high level guidance based on the European Network for Academic Integrity (ENAI) recommendations. The principles are descriptions of USask intentions for, and beliefs about, the use of AI. They include 4 categories: * Ethical and Responsible Use * Literacy * Tool Use * Change and Innovation
In general, my opinion is that you should exploring these tools, what they can do, and how you can integrate them into your work. These tools are great for editing, formatting, generating ideas, and writing very basic code. USask faculty and students have access to Microsoft Co-Pilot (https://teaching.usask.ca/learning-technology/tools/microsoft-copilot.php). It's critical that when you use these tools you are very aware of bias and that you intervene to correct the text. Here are my general rules for AI in this course.
-
You can use AI tools for any or all parts of the work.
-
If you do you must cite your work (as above).
2.1. Acknowledge AI tools: “All persons, sources, and tools that influence the ideas or generate the content should be properly acknowledged” (p. 3). Acknowledgement may be done in different ways, according to context and discipline, and should include the input to the tool.
2.2. Do not list AI tools as authors: Authors must take responsibility and be accountable for content and an AI tool cannot do so.
2.3. Recognize limits and biases of AI tools: Inaccuracies, errors, and bias are reproduced in AI tools in part because of the human produced materials used for training.
-
If you do you must include a 200 word reflective essay about the experience as part of your self-evaluation.
-
Be very careful with reference. Many of these tools just make up random references.
-
I will not use tools like GPTZero to detect whether you have used AI tools or not. We are making an agreement to be honest with each other here. This is small class. We have that luxury.
Dr. Daniel Fuller daniel.fuller@usask.ca
Dr. Erfan Hoque erfan.hoque@usask.ca
- Understand the basics of data wrangling and data management in epidemiology.
- Gain proficiency in using Git and GitHub for version control.
- Learn to leverage high-performance computing resources for epidemiologic data analysis.
- Explore various machine learning techniques and their applications in epidemiology.
- Compare and contrast traditional epidemiological analysis methods with machine learning approaches.
There is not one textbook for this course. We will use various components of different open access resources.
R for Data Science (2e). 2024. Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. https://r4ds.hadley.nz/
An Introduction to Statistical Learning with Applications in R (2e). 2024. Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. https://www.statlearning.com/
The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2e). 2009, Trevor Hastie, Robert Tibshirani and Jerome Friedman. https://hastie.su.domains/ElemStatLearn/
Learn Tidymodels. https://www.tidymodels.org/learn/
Use of a statistical software program (R) is required for this course. You will also be asked to install other software including PostGRES (SQL) and Git.
In this course we will use the CanPath Student Dataset that provides students the unique opportunity to gain hands-on experience working with CanPath data. The CanPath Student Dataset is a synthetic dataset that was manipulated to mimic CanPath’s nationally harmonized data but does not include or reveal actual data of any CanPath participants.
The CanPath Student Dataset is available to instructors at a Canadian university or college for use in an academic course, at no cost. CanPath will provide the Student Dataset and a supporting data dictionary.
- Large sample size (Over 40,000 participants)
- Real-world population-level Canadian data
- Variety of areas of information allowing for a wide range of research topics
- No cost to faculty
- Potential for students to apply for real CanPath data to publish their findings
| Week | Date | Topic | Data Work |
|---|---|---|---|
| 1 | January 6 | Intro to Data Science (Intro to Machine/Statistical Learning) | Data Wrangling and Visualization |
| 2 | January 13 | Data Visualization and Version Control/GitHub | HappyGitwithR |
| 3 | January 20 | Regression – Linear Regression and Optimization | Linear Regression |
| 4 | January 27 | Classification – Logistic Regression and KNN | Logistic Regression and KNN |
| 5 | February 3 | Unsupervised Learning – PCA | Principal Component Analysis |
| 6 | February 10 | Unsupervised Learning – Clustering | Clustering Methods |
| 7 | February 17 | Reading Week | |
| 8 | February 24 | Validation – Cross Validation + Bootstrapping | Cross Validation + Bootstrapping – Applications with Linear Regression |
| 9 | March 3 | Ensemble Methods – Random Forest | Random Forest |
| 10 | March 10 | Causal Inference – Causal Forest | Causal Forest + Matching |
| 11 | March 17 | Ensemble Methods – Artificial Neural Networks | Artificial Neural Networks |
| 12 | March 24 | Ensemble Methods – Transformers/Self-Supervised Learning | Artificial Neural Networks Part 2 |
| 13 | March 31 | Scientific Computing | Scientific Computing + Full ML Implementation |
- Subject to change depending on speed
Attendance and participation and reading ahead are critical to this course. There will a lot of time for discussion and working on assignments allocated in this course but reading ahead is a critical aspect of the learning process.
You can find the detailed descriptions for all assignments below or in the assignments folder here
| # | Assignment | Grade % | Due Date |
|---|---|---|---|
| 1 | Data Wrangling and Visualization | 10% | January 19, 2026 |
| 2 | Github | 5% | January 26, 2026 |
| 3 | Unsupervised Learning | 15% | February 9, 2026 |
| 4 | Independent Analysis – Part 1 | 10% | February 23, 2026 |
| 5 | Supervised Learning | 15% | March 9, 2026 |
| 6 | Causal Forest - CANCELLED! | 10% | March 23, 2026 |
| 7 | Artificial Neural Network | 15% | April 6, 2026 |
| 8 | Scientific Computing | 5% | April 20, 2026 |
| 9 | Independent Analysis – Part 2 | 15% | April 20, 2026 |
| Total | 100% |
Value: 10% of final grade.
Due Date: See Course Schedule.
Type: This assignment will have students work with fundamental skills of data science and submit via an RMarkdown file.
Description: In this assignment you will complete a data wrangling assignment that will involve data cleaning, descriptive statistics, understanding missing data, and joining datasets together.
Value: 5% of final grade.
Due Date: See Course Schedule.
Type: This assignment will have students work version control systems and submit their assignment to their own Github repository.
Description: In this assignment you will create a Github account, install Git on your local computer, create a Github repository and commit and push your work to that Github repository.
Value: 15% of final grade.
Due Date: See Course Schedule.
Type This assignment will have students understand the basic approaches to unsupervised learning.
Description: In this assignment you will apply and compare different methods for unsupervised learning on a large health administrative dataset.
Value: 10% of final grade.
Due Date: See Course Schedule.
Type: This assignment will have students conduct the first part of an independent data science workflow.
Description: This is part 1 of the independent analysis. You will need to find a dataset, develop an analysis plan to include the major components of the course (ie., Github, Scientific Computing), and conduct descriptive statistics and data wrangling on your chosen dataset.
Value: 15% of final grade.
Due Date: See Course Schedule.
Type: This assignment will have students understand the supervised learning approaches.
Description: In this assignment you will apply and compare different methods for supervised learning on a large health administrative dataset, and will use bootstrap and cross-validation techniques.
Value: 10% of final grade.
Due Date: See Course Schedule.
Type: This is a code-based assignment where you conduct a Random Forest analysis.
Description: In this analysis you will complete a Random Forest analysis using the Can Path student dataset. You will need to run the analysis, conduct detailed hyperparameter tuning, and conduct model comparisons.
Value: 15% of final grade.
Due Date: See Course Schedule.
Type: This is a code-based assignment where you conduct an artificial neural network analysis.
Description: In this analysis you will complete a machine learning based artificial neural network using the Can Path student dataset.
Value: 5% of final grade.
Due Date: See Course Schedule.
Type: This is a code-based assignment where you will learn to use an HPC.
Description: In this assignment you will use the USask Plato High Performance Computing to run a large scale machine learning on a large (~1GB) dataset.
Value: 15% of final grade.
Due Date: See Course Schedule.
Type: This assignment will have students conduct the second and final part of an independent data science workflow.
Description: This is part 2 (final part) of the independent analysis. You will need to conduct a complete analysis including data wrangling, missing data handling, and apply at least 2 different machine learning methods to your data.
Value: 0% of final grade (Formative Evaluation).
Due Date: See Course Schedule.
Type: Written report (200 words).
Description: Complete the student self-evaluation form. This is required for each assignment where you use AI.
All assignments should be submitted to the appropriate place in Canvas or Github. All assignments are due at 5pm (CST) on the due date. Please don't stay up until midnight to get the work done. Remember there are no late penalties so just take an extra day if you need and get some sleep.
There is no penalty for late assignments. However, because many assignments have two parts, it is critical to the first assignment of the sections in around the due date. Missing assignments that are not submitted by the end of the course will receive a grade of zero.