funathon-project2/index.qmd at main · AIML4OS/funathon-project2 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
---
title: "Project 2: Automatic coding for NACE classification"
lang: en-US
author:
- name: Théo FERRY
  email: theo.ferry@insee.fr
  affiliations:
      - name: "[Insee](https://www.insee.fr/fr/accueil)"
- name: Julien PRAMIL
  email: julien.pramil@insee.fr
  affiliations:
      - name: "[Insee](https://www.insee.fr/fr/accueil)"
- name: Meilame TAYEBJEE
  email: meilame.tayebjee@insee.fr
  affiliations:
      - name: "[Insee](https://www.insee.fr/fr/accueil)"
format:
  html:
    number-sections: true
editor: visual
editor_options:
  chunk_output_type: console
---

```{=html}
<table>
  <thead>
    <tr>
      <th>Technical level</th>
      <th>Tasks</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><span class="level-button level-beginner">Beginner</span></td>
      <td>Run the notebook start-to-finish, load the dataset, and connect to Qdrant.</td>
    </tr>
    <tr>
      <td><span class="level-button level-intermediate">Intermediate</span></td>
      <td>Modify the data query, train a simple classifier, and log results with MLflow.</td>
    </tr>
    <tr>
      <td><span class="level-button level-expert">Expert</span></td>
      <td>Extend the pipeline with a new embedding/model, deploy a serving endpoint, or integrate additional data sources.</td>
    </tr>
  </tbody>
</table>
```

# What you will learn {.unnumbered}

By following this tutorial, you will learn how to:

- Set up a Python environment and install dependencies for the project.
- Download and inspect a labeled dataset for NACE classification.
- Connect to a Qdrant vector database from Python.
- Explore the solution scripts in the `solutions/` folder.
- Understand how model training and logging are organized for reproducible experiments.

# Introduction {.unnumbered}

This project demonstrates an end-to-end pipeline for automatically classifying text into NACE codes. It covers:

- generating and loading labeled data,
- preprocessing and model training,
- logging experiments with MLflow,
- and preparing deployment-ready artifacts.

# Structure of the project

This project has five main steps (listed in the banner at the top of the page):

- data generation;
- data preprocessing;
- model fitting and evaluation;
- model logging with MLflow;
- deployment.

# Initialization of the project

## Fork this repository in your own GitHub

On the GitHub page of the project [here](https://github.com/AIML4OS/funathon-project2), click **Fork** to create a copy in your own GitHub account.

Then clone your fork locally using the command below (replace `<YOUR_GITHUB_NAME>` with your username):

```bash
git clone https://github.com/<YOUR_GITHUB_NAME>/funathon-project2.git
cd funathon-project2
```

## Setup the environment (quick start)

Run the following commands in a terminal inside the cloned repo:

```bash
# install/update dependencies
uv sync

# activate the project virtual environment
source .venv/bin/activate
```

> ✅ If you prefer to keep your shell clean, you can also prefix commands with `uv run` (e.g. `uv run python script.py`).

### Select the Python interpreter (VS Code)

If you are using VS Code, make sure the workspace uses the virtual environment you just created:

1. Open the Command Palette (`Ctrl+Shift+P` or `Cmd+Shift+P`).
2. Run `>Python: Select Interpreter`.
3. Choose **Enter interpreter path** and point to:

`/home/onyxia/work/funathon-project2/.venv/bin/python3.31`

## Quick start: run this notebook

1. Open `index.qmd` in VS Code or Quarto.
2. Run the first code cell (the data download cell) and wait for it to finish.
3. Confirm you see a small table output (from `annotations.head()`).

## Troubleshooting (common issues)

- If `uv sync` fails, re-run it and check the error message; a missing system package or network issue is most common.
- If the Qdrant connection fails, ensure your `.env` file contains the correct values for `QDRANT_URL`, `QDRANT_API_KEY`, and `QDRANT_API_PORT`.
- If a terminal command fails, verify you are in the repository root (`pwd` should end with `funathon-project2`).


# TODO:

In the following cell we retrieve a small labeled dataset (text + NACE labels) from a remote parquet file using DuckDB.


con.execute("INSTALL httpfs;")
con.execute("LOAD httpfs;")

# You can use the internal S3 endpoint if you have access (uncomment the line below):
# query_definition = f"SELECT * FROM read_parquet('s3://projet-formation/diffusion/funathon/2026/project2/generation_None_temp08.parquet')"

# Using the public HTTPS endpoint is more likely to work in typical environments:
query_definition = f"SELECT * FROM read_parquet('https://minio.lab.sspcloud.fr/projet-formation/diffusion/funathon/2026/project2/generation_None_temp08.parquet')"
annotations = con.sql(query_definition).to_df()
annotations.head()
```

It is also useful to verify the Qdrant connection before running vector indexing or retrieval steps.

```{python}
#| label: qdrant-client

import os
from dotenv import load_dotenv
from qdrant_client import QdrantClient

load_dotenv()

client_qdrant = QdrantClient(
    url=os.environ["QDRANT_URL"],
    api_key=os.environ["QDRANT_API_KEY"],
    port=os.environ["QDRANT_API_PORT"]
)

collections = client_qdrant.get_collections()
```