Skip to content
This repository was archived by the owner on Sep 9, 2025. It is now read-only.

Commit 94a2a8c

Browse files
committed
specified the details of model training path. fixed command option names
Signed-off-by: Daniele Martinoli <[email protected]>
1 parent 064e3cd commit 94a2a8c

File tree

1 file changed

+95
-43
lines changed

1 file changed

+95
-43
lines changed

docs/cli/ilab-rag-retrieval.md

Lines changed: 95 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ This document proposes enhancements to the `ilab` CLI to support workflows utili
1414
(RAG) artifacts within `InstructLab`. The proposed changes introduce new commands and options for the embedding ingestion
1515
and RAG-based chat pipelines:
1616
* A new `ilab data` sub-command to process customer documentation.
17+
* Either from knowledge taxonomy or from actual user documents.
1718
* A new `ilab data` sub-command to generate and ingest embeddings from pre-processed documents into a configured vector store.
1819
* An option to enhance the chat pipeline by using the stored embeddings to augment the context of conversations, improving relevance and accuracy.
1920

@@ -48,15 +49,15 @@ cases.
4849
To maintain compatibility and simplicity, no new configurations will be introduced for new commands. Instead,
4950
the settings will be defined using the following hierarchy (options higher in the list overriding those below):
5051
* CLI flags (e.g., `--FLAG`).
51-
* Environment variables following a consistent naming convention, such as `ILAB_<UPPERCASE_ARGUMENT_NAME>`.
52+
* Environment variables following a consistent naming convention, such as `ILAB_<UPPERCASE_FLAG_NAME>`.
5253
* Default values, for all the applicable use cases.
5354

5455
For example, the `vectordb-uri` argument can be implemented using the `click` module like this:
5556
```py
5657
@click.option(
57-
"--vectordb-uri",
58+
"--document-store-uri",
5859
default='rag-output.db',
59-
envvar="ILAB_VECTORDB_URI",
60+
envvar="ILAB_DOCUMENT_STORE_URI",
6061
)
6162
```
6263

@@ -72,55 +73,80 @@ If the configured embedding model has not been cached, the command execution wil
7273
consistently to all new and updated commands.
7374

7475
### 2.2 Document Processing Pipeline
75-
The proposal is to add a `process` sub-command to the `data` command group:
76+
The proposal is to add a `process` sub-command to the `data` command group.
77+
78+
For the Taxonomy path (no Model Training):
7679
```
77-
ilab data process --input /path/to/docs/folder --output /path/to/processed/folder
80+
ilab data process /path/to/processed/folder
7881
```
7982

83+
For the Plag-and-Play RAG path:
84+
```
85+
ilab data process --input /path/to/docs/folder /path/to/processed/folder
86+
```
87+
8088
#### Command Purpose
81-
Applies the transformation for the customer documents in `/path/to/docs/folder`. Processed artifacts are stored under `/path/to/processed/folder`.
89+
Applies the docling transformation to the customer documents.
90+
* Original documents are located in the `/path/to/docs/folder` input folder or in the taxonomy knowledge branch.
91+
* In the latter case, the input documents are the knowledge documents retrieved from the installed taxonomy repository
92+
according to the [SDG diff strategy][sdg-diff-strategy], e.g. `the new or changed YAMLs using git diff, including untracked files`.
93+
* Processed artifacts are stored under `/path/to/processed/folder`.
8294

8395
***Notes**:
84-
* In alignment with the current SDG implementation, the folder will not be navigated recursively. Only files located at the root level of the specified
85-
folder will be considered. The same principle applies to all other options outlined below.
86-
* To ensure consistency and avoid issues with document versioning or outdated artifacts, the destination folder will be cleared before execution.
87-
This ensures it contains only the artifacts generated from the most recent run.
96+
* In alignment with the current SDG implementation, the `--input` folder will not be navigated recursively. Only files located at the root
97+
level of the specified folder will be considered. The same principle applies to all other options outlined below.
98+
* To ensure consistency and avoid issues with document versioning or outdated artifacts, the destination folder will be cleared
99+
before execution. This ensures it contains only the artifacts generated from the most recent run.
88100

89-
The trasformation is based on the `instructlab-sdg` modules (the initial step of the `ilab data generate` pipeline)
90-
91-
### Why We Need It
92-
This command streamlines the `ilab data generate` pipeline and eliminates the requirement to define a `qna` document,
93-
which typically includes:
94-
* A minimum of 5×3 question-and-answer pairs.
95-
* Reference documents stored in Git.
96-
97-
The goal is not to generate training data for InstructLab-trained models but to utilize the documents for RAG
98-
workflows with pre-tuned models.
101+
The transformation is based on the latest version of the docling `DocumentConverter` (v2).
102+
The alternative to adopt the `instructlab-sdg` modules (e.g. the initial step of the `ilab data generate` pipeline) has been
103+
discarded because it generates documents according to the so-called legacy docling schema.
99104

100105
#### Usage
101-
The generated artifacts can later be used to generete and ingest the embeddings into a vector database.
106+
The generated artifacts can later be used to generate and ingest the embeddings into a vector database.
102107

103108
### 2.3 Document Processing Pipeline Options
109+
```bash
110+
% ilab data process --help
111+
Usage: ilab data process [OPTIONS] OUTPUT_DIR
104112

113+
The document processing pipeline
114+
115+
Options:
116+
--input DIRECTORY The folder with user documents to process.
117+
--help Show this message and exit.```
118+
```
105119

106120
| Option Description | Default Value | CLI Flag | Environment Variable |
107121
|--------------------|---------------|----------|----------------------|
122+
| Location folder of user documents. In case it's missing, the taxonomy is navigated to look for updated knowledge documents.| | `--input` | `ILAB_PROCESS_INPUT` |
108123
| Base directories where models are stored. | `$HOME/.cache/instructlab/models` | `--model-dir` | `ILAB_MODEL_DIR` |
109124
| Name of the embedding model. | **TBD** | `--embedding-model` | `ILAB_EMBEDDING_MODEL_NAME` |
110125
111126
### 2.4 Embedding Ingestion Pipeline
112-
The proposal is to add an `ingest` sub-command to the `data` command group:
127+
The proposal is to add an `ingest` sub-command to the `data` command group.
128+
129+
For the Model Training path:
113130
```
114-
ilab data ingest /path/to/docs/folder
131+
ilab data ingest
132+
```
133+
134+
For the Taxonomy or Plug-and-Play RAG paths:
135+
```
136+
ilab data ingest /path/to/processed/folder
115137
```
116138
117139
#### Working Assumption
118140
The documents at the specified path have already been processed using the `data process` command or an equivalent method
119141
(see [Getting Started with Knowledge Contributions][ilab-knowledge]).
120142
121143
#### Command Purpose
122-
Generate the embeddings from the pre-processed documents at */path/to/docs/folder* folder and store them in the
123-
configured vector database.
144+
Generate the embeddings from the pre-processed documents.
145+
* In case of Model Training path, the documents are located in the location specified by the `generate.output_dir` configuration key
146+
(e.g. `_HOME_/.local/share/instructlab/datasets`).
147+
* In particular, only the latest folder with name starting by `documents-` will be explored.
148+
* It must include a subfolder `docling-artifacts` with the actual json files.
149+
* In case the */path/to/processed/folder* parameter is provided, it is used to lookup the processed documents to ingest.
124150
125151
**Notes**:
126152
* To ensure consistency and avoid issues with document versioning or outdated embeddings, the ingested collection will be cleared before execution.
@@ -138,17 +164,35 @@ The generated embeddings can later be retrieved from a vector database and conve
138164
context for RAG-based chat pipelines.
139165
140166
### 2.5 Embedding Ingestion Pipeline Options
167+
```bash
168+
% ilab data ingest --help
169+
Usage: ilab data ingest [OPTIONS] INPUT_DIR
170+
171+
The embedding ingestion pipeline
172+
173+
Options:
174+
--document-store-type TEXT The document store type, one of:
175+
`milvuslite`, `milvus`.
176+
--document-store-uri TEXT The document store URI
177+
--document-store-collection-name TEXT
178+
The document store collection name
179+
--model-dir TEXT Base directories where models are stored.
180+
[default: (The default system model location
181+
store, located in the data directory.)]
182+
--embedding-model TEXT The embedding model name
183+
--help Show this message and exit.
184+
```
141185
142186
| Option Description | Default Value | CLI Flag | Environment Variable |
143187
|--------------------|---------------|----------|----------------------|
144-
| Vector DB implementation, one of: `milvuslite`, **TBD** | `milvuslite` | `--vectordb-type` | `ILAB_VECTORDB_TYPE` |
145-
| Vector DB service URI. | `./rag-output.db` | `--vectordb-uri` | `ILAB_VECTORDB_URI` |
146-
| Vector DB collection name. | `IlabEmbeddings` | `--vectordb-collection-name` | `ILAB_VECTORDB_COLLECTION_NAME` |
147-
| Base directories where models are stored. | `$HOME/.cache/instructlab/models` | `--model-dir` | `ILAB_MODEL_DIR` |
148-
| Name of the embedding model. | **TBD** | `--embedding-model` | `ILAB_EMBEDDING_MODEL_NAME` |
188+
| Document store implementation, one of: `milvuslite`, **TBD** | `milvuslite` | `--document-store-type` | `ILAB_DOCUMENT_STORE_TYPE` |
189+
| Document store service URI. | `./embeddings.db` | `--document-store-uri` | `ILAB_DOCUMENT_STORE_URI` |
190+
| Document store collection name. | `IlabEmbeddings` | `--document-store-collection-name` | `ILAB_DOCUMENT_STORE_COLLECTION_NAME` |
191+
| Base directories where models are stored. | `$HOME/.cache/instructlab/models` | `--retriever-embedder-model-dir` | `ILAB_EMBEDDER_MODEL_DIR` |
192+
| Name of the embedding model. | **TBD** | `--retriever-embedder-model-name` | `ILAB_EMBEDDER_MODEL_NAME` |
149193
150194
### 2.6 RAG Chat Pipeline Command
151-
The proposal is to add a `--rag` flag to the `model chat` command, like:
195+
The proposal is to add a `chat.rag.enable` configuration (or the equivalent `--rag` flag) to the `model chat` command, like:
152196
```
153197
ilab model chat --rag
154198
```
@@ -212,21 +256,26 @@ but we'll use flags and environment variables for the options that come from the
212256
213257
| Configuration FQN | Description | Default Value | CLI Flag | Environment Variable |
214258
|-------------------|-------------|---------------|----------|----------------------|
215-
| chat.rag.enabled | Enable or disable the RAG pipeline. | `false` | `--rag` (boolean)| `ILAB_CHAT_RAG_ENABLED` |
216-
| chat.rag.retriever.top_k | The maximum number of documents to retrieve. | `10` | `--retriever-top-k` | `ILAB_CHAT_RAG_RETRIEVER_TOP_K` |
217-
| | Vector DB implementation, one of: `milvuslite`, **TBD** | `milvuslite` | `--vectordb-type` | `ILAB_VECTORDB_TYPE` |
218-
| | Vector DB service URI. | `./rag-output.db` | `--vectordb-uri` | `ILAB_VECTORDB_URI` |
219-
| | Vector DB collection name. | `IlabEmbeddings` | `--vectordb-collection-name` | `ILAB_VECTORDB_COLLECTION_NAME` |
220-
| | Base directories where models are stored. | `$HOME/.cache/instructlab/models` | `--model-dir` | `ILAB_MODEL_DIR` |
221-
| | Name of the embedding model. | **TBD** | `--model` | `ILAB_EMBEDDING_MODEL_NAME` |
259+
| chat.rag.enabled | Enable or disable the RAG pipeline. | `false` | `--rag` (boolean)| `ILAB_RAG` |
260+
| chat.rag.retriever.top_k | The maximum number of documents to retrieve. | `10` | `--retriever-top-k` | `ILAB_RETRIEVER_TOP_K` |
261+
| | Document store implementation, one of: `milvuslite`, **TBD** | `milvuslite` | `--document-store-type` | `ILAB_DOCUMENT_STORE_TYPE` |
262+
| | Document storeservice URI. | `./embeddings.db` | `--document-store-uri` | `ILAB_DOCUMENT_STORE_URI` |
263+
| | Document store collection name. | `IlabEmbeddings` | `--document-store-collection-name` | `ILAB_DOCUMENT_STORE_COLLECTION_NAME` |
264+
| | Base directories where models are stored. | `$HOME/.cache/instructlab/models` | `--retriever-embedder-model-dir` | `ILAB_EMBEDDER_MODEL_DIR` |
265+
| | Name of the embedding model. | **TBD** | `--retriever-embedder-model-name` | `ILAB_EMBEDDER_MODEL_NAME` |
222266
223267
Equivalent YAML document for the newly proposed options:
224268
```yaml
225269
chat:
226-
rag:
227-
enabled: false
270+
enable: false
228271
retriever:
229-
top_k: 10
272+
top_k: 20
273+
embedder:
274+
model_name: sentence-transformers/all-minilm-l6-v2
275+
document_store:
276+
type: milvuslite
277+
uri: embeddings.db
278+
collection_name: Ilab
230279
```
231280
232281
### 2.9 References
@@ -236,7 +285,8 @@ chat:
236285
237286
238287
### 2.10 Workflow Visualization
239-
<!-- https://excalidraw.com/#json=PN2h_LM-Wd2WZYBJfZMDs,WQCq5NDbRXUH2qr8maFFNg -->
288+
(Link to [shared Excalidraw][shared-excalidraw])
289+
240290
Embedding ingestion pipeline:
241291
![ingestion-mvp](../images/ingestion-mvp.png)
242292
RAG-based Chat pipeline:
@@ -300,8 +350,10 @@ ilab model chat --rag --retriever-type api --retriever-uri http://localhost:8123
300350
```
301351
302352
[ilab-knowledge]: https://github.com/instructlab/taxonomy?tab=readme-ov-file#getting-started-with-knowledge-contributions
353+
[sdg-diff-strategy]: https://github.com/instructlab/sdg/blob/main/src/instructlab/sdg/utils/taxonomy.py
303354
[chat_template]: https://github.com/instructlab/instructlab/blob/0a773f05f8f57285930df101575241c649f591ce/src/instructlab/configuration.py#L244
304355
[augment_chat_template]: https://github.com/instructlab/instructlab/blob/48e3f7f1574ae50036d6e342b8d78d8eb9546bd5/src/instructlab/model/backends/llama_cpp.py#L281
305356
[ranking]: https://docs.haystack.deepset.ai/v1.21/reference/ranker-api
306357
[expansion]: https://haystack.deepset.ai/blog/query-expansion
307-
[chunkers]: https://github.com/DS4SD/docling/blob/main/docs/concepts/chunking.md
358+
[chunkers]: https://github.com/DS4SD/docling/blob/main/docs/concepts/chunking.md
359+
[shared-excalidraw]: https://excalidraw.com/#json=D_sPMvwB0XbCVoBL1hyAi,R_rUo6ljInJPrcWnbOO5pQ

0 commit comments

Comments
 (0)