Skip to content

walkabilly/ml_methods_health_science

Repository files navigation

output
html_document pdf_document
default
default

CHEP 898: Machine Learning Methods in Health Science

Course Syllabus

Code: CHEP 898

Term: 2026 Winter

Delivery: In person

Location: HLTH 2334

Start Date: January 6th, 2025

Time: Tuesday 1:00- 3:50 pm

Course Description

This course bridges the gap between data science techniques, biostatistics, and health research, equipping students with advanced tools to analyze complex health data. The course covers both the theoretical and practical aspects of machine learning. Through hands-on experience with R programming language, supervised and unsupervised machine learning techniques, and data visualization, students will learn to process and interpret large datasets to uncover insights into disease patterns, public health trends, and causal relationships. The course is designed to provide students with the theoretical knowledge and practical application of data science techniques. This will prepare learners to tackle health science challenges with modern data-driven approaches.

Official Syllabus

The official syllabus for this course is available for download here

Prerequisites

  • CHEP/PUBH 805, or a graduate-level Statistics Course.
  • To request permission to take this course, please submit an override request with one of the instructors as approver: https://jira.usask.ca/servicedesk/customer/portal/7/create/291
    • On your ticket request, please explain how you meet the course prerequisites and how you hope the course can further your learning and research needs.
    • Please include CHEP graduate program administrator, Stephanie Kehrig, on the approval request ticket
  • Interested Faculty and Researchers are invited to contact the instructors with content questions, or Stephanie.Kehrig@usask.ca on registration processes.

Land Acknowledgement

I acknowledge our shared connection to the land and recognize that Indigenous and Métis peoples on Treaty 6 Territory and all Indigenous peoples have been and continue to be stewards for social justice, equity, and land-based education. In the spirit of reconciliation may we all strive to learn and support the work of Indigenous communities as allies.

Artificial Intelligence

This course will follow the general USask Guidelines about AI for Educators and Students (https://leadership.usask.ca/initiatives/ai/index.php). The University has developed high level guidance based on the European Network for Academic Integrity (ENAI) recommendations. The principles are descriptions of USask intentions for, and beliefs about, the use of AI. They include 4 categories: * Ethical and Responsible Use * Literacy * Tool Use * Change and Innovation

AI Rules for this course

In general, my opinion is that you should exploring these tools, what they can do, and how you can integrate them into your work. These tools are great for editing, formatting, generating ideas, and writing very basic code. USask faculty and students have access to Microsoft Co-Pilot (https://teaching.usask.ca/learning-technology/tools/microsoft-copilot.php). It's critical that when you use these tools you are very aware of bias and that you intervene to correct the text. Here are my general rules for AI in this course.

  1. You can use AI tools for any or all parts of the work.

  2. If you do you must cite your work (as above).

    2.1. Acknowledge AI tools: “All persons, sources, and tools that influence the ideas or generate the content should be properly acknowledged” (p. 3). Acknowledgement may be done in different ways, according to context and discipline, and should include the input to the tool.

    2.2. Do not list AI tools as authors: Authors must take responsibility and be accountable for content and an AI tool cannot do so.

    2.3. Recognize limits and biases of AI tools: Inaccuracies, errors, and bias are reproduced in AI tools in part because of the human produced materials used for training.

  3. If you do you must include a 200 word reflective essay about the experience as part of your self-evaluation.

  4. Be very careful with reference. Many of these tools just make up random references.

  5. I will not use tools like GPTZero to detect whether you have used AI tools or not. We are making an agreement to be honest with each other here. This is small class. We have that luxury.

Contact Information

Dr. Daniel Fuller daniel.fuller@usask.ca

Dr. Erfan Hoque erfan.hoque@usask.ca

Learning Outcomes

  1. Understand the basics of data wrangling and data management in epidemiology.
  2. Gain proficiency in using Git and GitHub for version control.
  3. Learn to leverage high-performance computing resources for epidemiologic data analysis.
  4. Explore various machine learning techniques and their applications in epidemiology.
  5. Compare and contrast traditional epidemiological analysis methods with machine learning approaches.

Readings/Textbooks

There is not one textbook for this course. We will use various components of different open access resources.

R for Data Science (2e). 2024. Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. https://r4ds.hadley.nz/

An Introduction to Statistical Learning with Applications in R (2e). 2024. Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. https://www.statlearning.com/

The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2e). 2009, Trevor Hastie, Robert Tibshirani and Jerome Friedman. https://hastie.su.domains/ElemStatLearn/

Learn Tidymodels. https://www.tidymodels.org/learn/

Other Required Materials

Use of a statistical software program (R) is required for this course. You will also be asked to install other software including PostGRES (SQL) and Git.

Dataset

In this course we will use the CanPath Student Dataset that provides students the unique opportunity to gain hands-on experience working with CanPath data. The CanPath Student Dataset is a synthetic dataset that was manipulated to mimic CanPath’s nationally harmonized data but does not include or reveal actual data of any CanPath participants.

The CanPath Student Dataset is available to instructors at a Canadian university or college for use in an academic course, at no cost. CanPath will provide the Student Dataset and a supporting data dictionary.

  • Large sample size (Over 40,000 participants)
  • Real-world population-level Canadian data
  • Variety of areas of information allowing for a wide range of research topics
  • No cost to faculty
  • Potential for students to apply for real CanPath data to publish their findings

General Class Schedule

Week Date Topic Data Work
1 January 6 Intro to Data Science (Intro to Machine/Statistical Learning) Data Wrangling and Visualization
2 January 13 Data Visualization and Version Control/GitHub HappyGitwithR
3 January 20 Regression – Linear Regression and Optimization Linear Regression
4 January 27 Classification – Logistic Regression and KNN Logistic Regression and KNN
5 February 3 Unsupervised Learning – PCA Principal Component Analysis
6 February 10 Unsupervised Learning – Clustering Clustering Methods
7 February 17 Reading Week
8 February 24 Validation – Cross Validation + Bootstrapping Cross Validation + Bootstrapping – Applications with Linear Regression
9 March 3 Ensemble Methods – Random Forest Random Forest
10 March 10 Causal Inference – Causal Forest Causal Forest + Matching
11 March 17 Ensemble Methods – Artificial Neural Networks Artificial Neural Networks
12 March 24 Ensemble Methods – Transformers/Self-Supervised Learning Artificial Neural Networks Part 2
13 March 31 Scientific Computing Scientific Computing + Full ML Implementation
  • Subject to change depending on speed

Attendance and Participation

Attendance and participation and reading ahead are critical to this course. There will a lot of time for discussion and working on assignments allocated in this course but reading ahead is a critical aspect of the learning process.

Assignment Grading Scheme

You can find the detailed descriptions for all assignments below or in the assignments folder here

# Assignment Grade % Due Date
1 Data Wrangling and Visualization 10% January 19, 2026
2 Github 5% January 26, 2026
3 Unsupervised Learning 15% February 9, 2026
4 Independent Analysis – Part 1 10% February 23, 2026
5 Supervised Learning 15% March 9, 2026
6 Causal Forest - CANCELLED! 10% March 23, 2026
7 Artificial Neural Network 15% April 6, 2026
8 Scientific Computing 5% April 20, 2026
9 Independent Analysis – Part 2 15% April 20, 2026
Total 100%

Assignment Descriptions

1. Data Wrangling and Visualization.

Value: 10% of final grade.
Due Date: See Course Schedule.
Type: This assignment will have students work with fundamental skills of data science and submit via an RMarkdown file.
Description: In this assignment you will complete a data wrangling assignment that will involve data cleaning, descriptive statistics, understanding missing data, and joining datasets together.

2. Github

Value: 5% of final grade.
Due Date: See Course Schedule.
Type: This assignment will have students work version control systems and submit their assignment to their own Github repository.
Description: In this assignment you will create a Github account, install Git on your local computer, create a Github repository and commit and push your work to that Github repository.

3. Unsupervised Learning

Value: 15% of final grade.
Due Date: See Course Schedule.
Type This assignment will have students understand the basic approaches to unsupervised learning.
Description: In this assignment you will apply and compare different methods for unsupervised learning on a large health administrative dataset.

4. Independent Analysis 1

Value: 10% of final grade.
Due Date: See Course Schedule.
Type: This assignment will have students conduct the first part of an independent data science workflow.
Description: This is part 1 of the independent analysis. You will need to find a dataset, develop an analysis plan to include the major components of the course (ie., Github, Scientific Computing), and conduct descriptive statistics and data wrangling on your chosen dataset.

5. Supervised Learning

Value: 15% of final grade.
Due Date: See Course Schedule.
Type: This assignment will have students understand the supervised learning approaches.
Description: In this assignment you will apply and compare different methods for supervised learning on a large health administrative dataset, and will use bootstrap and cross-validation techniques.

6. Causal Forest - CANCELLED!

Value: 10% of final grade.
Due Date: See Course Schedule.
Type: This is a code-based assignment where you conduct a Random Forest analysis.
Description: In this analysis you will complete a Random Forest analysis using the Can Path student dataset. You will need to run the analysis, conduct detailed hyperparameter tuning, and conduct model comparisons.

7. Artificial Neural Networks

Value: 15% of final grade.
Due Date: See Course Schedule.
Type: This is a code-based assignment where you conduct an artificial neural network analysis.
Description: In this analysis you will complete a machine learning based artificial neural network using the Can Path student dataset.

8. Scientific Computing/Big Data

Value: 5% of final grade.
Due Date: See Course Schedule.
Type: This is a code-based assignment where you will learn to use an HPC.
Description: In this assignment you will use the USask Plato High Performance Computing to run a large scale machine learning on a large (~1GB) dataset.

9. Independent Analysis 15%

Value: 15% of final grade.
Due Date: See Course Schedule.
Type: This assignment will have students conduct the second and final part of an independent data science workflow.
Description: This is part 2 (final part) of the independent analysis. You will need to conduct a complete analysis including data wrangling, missing data handling, and apply at least 2 different machine learning methods to your data.

Self-Evaluation

Value: 0% of final grade (Formative Evaluation).
Due Date: See Course Schedule.
Type: Written report (200 words).
Description: Complete the student self-evaluation form. This is required for each assignment where you use AI.

Submitting Assignments

All assignments should be submitted to the appropriate place in Canvas or Github. All assignments are due at 5pm (CST) on the due date. Please don't stay up until midnight to get the work done. Remember there are no late penalties so just take an extra day if you need and get some sleep.

Late and Missing Assignments

There is no penalty for late assignments. However, because many assignments have two parts, it is critical to the first assignment of the sections in around the due date. Missing assignments that are not submitted by the end of the course will receive a grade of zero.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors