Skip to content

remla24-team10/model-training

Repository files navigation

model-training

Badge

In this assignment we have transferred a small kaggle model to a professional development environment. We will be using the following tools:

  • DVC for data version control and machine learning reproducibility.
  • Git for version control.
  • Poetry for dependency management.

🛠️ Installation

Prerequisites:

  • Poetry
  • Python 3.11

Then, navigate to the main directory (model-training) and run:

# If the lock file is out of date
poetry lock --no-update

# Install dependencies
poetry install

# Activate the virtual environment
poetry shell

📂 Data Retrieval and Pipeline Execution

Navigate to the remla-group10 folder and run:

dvc fetch
dvc pull  # May not work, use the workaround below if needed
dvc repro

If dvc pull does not work, manually fetch the data files and run dvc repro:

dvc fetch data/raw/train.txt
dvc fetch data/raw/test.txt
dvc fetch data/raw/val.txt
dvc repro

To show metrics run:

dvc metrics show

Public Sharing of the Model

The model trained is shared publicly if desired. On default this is not enabled. To enable this, set the third argument for the dvc train step in the dvc.yaml file to true. To have this working, you do need access to the s3 bucket and this can be requested from the authors.

The model currently saved however is openly available and can be downloaded from the s3 bucket. See how in train.py.

📊 Code Quality Metrics

Ensure code quality with pylint, mypy, bandit, and pre-commit.

Install Mypy Stubs

Install mypy stubs for your dependencies:

mypy --install-types

Using Pre-commit

The goal is to run Black, Pylint, Mypy, and Bandit with the configurations specified in pyproject.toml with one command. To do this, first install pre-commit:

pre-commit install

Then run the checks with:

pre-commit run --all-files

Pre-commit also runs automatically before every commit, which is what we want for this project.

🧪 Running Tests

In the activated virtual environment, run:

pytest

For specific tests:

  • Quick tests (run automatically in CI, no DVC pull required):
pytest -m fast
  • Tests requiring all data (requires dvc pull or dvc repro):
pytest -s -m manual
  • Tests requiring training (expected to take 30 minutes):
pytest -s -m training

📝 Notes on the project structure

  • The project is structured in a way that the data is stored in the data/raw folder. This is the data that is used for training and testing. The data is not stored in the repository, but is stored in a dvc remote. The data is fetched from the remote using the dvc fetch command. The data is then processed and stored in the data/processed folder. The processed data is used for training and testing.
  • The Python source code is split up in the src folder. Every stage of the pipeline is stored in a separate file to keep the code clean and maintainable. The code is tested using the pytest framework. The tests are stored in the tests folder. The tests are split up in fast tests and manual tests. The fast tests are ran automatically in the CI pipeline and do not require dvc pull. The manual tests require all data and can be ran using the dvc pull command. The training tests require training on top of the data and can be ran using the dvc pull command.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors