model-training

In this assignment we have transferred a small kaggle model to a professional development environment. We will be using the following tools:

DVC for data version control and machine learning reproducibility.
Git for version control.
Poetry for dependency management.

🛠️ Installation

Prerequisites:

Poetry
Python 3.11

Then, navigate to the main directory (model-training) and run:

# If the lock file is out of date
poetry lock --no-update

# Install dependencies
poetry install

# Activate the virtual environment
poetry shell

📂 Data Retrieval and Pipeline Execution

Navigate to the remla-group10 folder and run:

dvc fetch
dvc pull  # May not work, use the workaround below if needed
dvc repro

If dvc pull does not work, manually fetch the data files and run dvc repro:

dvc fetch data/raw/train.txt
dvc fetch data/raw/test.txt
dvc fetch data/raw/val.txt
dvc repro

To show metrics run:

dvc metrics show

Public Sharing of the Model

The model trained is shared publicly if desired. On default this is not enabled. To enable this, set the third argument for the dvc train step in the dvc.yaml file to true. To have this working, you do need access to the s3 bucket and this can be requested from the authors.

The model currently saved however is openly available and can be downloaded from the s3 bucket. See how in train.py.

📊 Code Quality Metrics

Ensure code quality with pylint, mypy, bandit, and pre-commit.

Install Mypy Stubs

Install mypy stubs for your dependencies:

mypy --install-types

Using Pre-commit

The goal is to run Black, Pylint, Mypy, and Bandit with the configurations specified in pyproject.toml with one command. To do this, first install pre-commit:

pre-commit install

Then run the checks with:

pre-commit run --all-files

Pre-commit also runs automatically before every commit, which is what we want for this project.

🧪 Running Tests

In the activated virtual environment, run:

pytest

For specific tests:

Quick tests (run automatically in CI, no DVC pull required):

pytest -m fast

Tests requiring all data (requires dvc pull or dvc repro):

pytest -s -m manual

Tests requiring training (expected to take 30 minutes):

pytest -s -m training

📝 Notes on the project structure

The project is structured in a way that the data is stored in the data/raw folder. This is the data that is used for training and testing. The data is not stored in the repository, but is stored in a dvc remote. The data is fetched from the remote using the dvc fetch command. The data is then processed and stored in the data/processed folder. The processed data is used for training and testing.
The Python source code is split up in the src folder. Every stage of the pipeline is stored in a separate file to keep the code clean and maintainable. The code is tested using the pytest framework. The tests are stored in the tests folder. The tests are split up in fast tests and manual tests. The fast tests are ran automatically in the CI pipeline and do not require dvc pull. The manual tests require all data and can be ran using the dvc pull command. The training tests require training on top of the data and can be ran using the dvc pull command.

Name		Name	Last commit message	Last commit date
Latest commit History 156 Commits
.dvc		.dvc
.github/workflows		.github/workflows
data/raw		data/raw
documents		documents
notebooks		notebooks
src		src
tests		tests
.dvcignore		.dvcignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.pylintrc		.pylintrc
README.md		README.md
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

model-training

🛠️ Installation

📂 Data Retrieval and Pipeline Execution

Public Sharing of the Model

📊 Code Quality Metrics

Install Mypy Stubs

Using Pre-commit

🧪 Running Tests

📝 Notes on the project structure

About

Uh oh!

Releases 5

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

model-training

🛠️ Installation

📂 Data Retrieval and Pipeline Execution

Public Sharing of the Model

📊 Code Quality Metrics

Install Mypy Stubs

Using Pre-commit

🧪 Running Tests

📝 Notes on the project structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages