You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Sep 9, 2025. It is now read-only.
Copy file name to clipboardExpand all lines: docs/cli/ilab-rag-retrieval.md
+95-43Lines changed: 95 additions & 43 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,6 +14,7 @@ This document proposes enhancements to the `ilab` CLI to support workflows utili
14
14
(RAG) artifacts within `InstructLab`. The proposed changes introduce new commands and options for the embedding ingestion
15
15
and RAG-based chat pipelines:
16
16
* A new `ilab data` sub-command to process customer documentation.
17
+
* Either from knowledge taxonomy or from actual user documents.
17
18
* A new `ilab data` sub-command to generate and ingest embeddings from pre-processed documents into a configured vector store.
18
19
* An option to enhance the chat pipeline by using the stored embeddings to augment the context of conversations, improving relevance and accuracy.
19
20
@@ -48,15 +49,15 @@ cases.
48
49
To maintain compatibility and simplicity, no new configurations will be introduced for new commands. Instead,
49
50
the settings will be defined using the following hierarchy (options higher in the list overriding those below):
50
51
* CLI flags (e.g., `--FLAG`).
51
-
* Environment variables following a consistent naming convention, such as `ILAB_<UPPERCASE_ARGUMENT_NAME>`.
52
+
* Environment variables following a consistent naming convention, such as `ILAB_<UPPERCASE_FLAG_NAME>`.
52
53
* Default values, for all the applicable use cases.
53
54
54
55
For example, the `vectordb-uri` argument can be implemented using the `click` module like this:
55
56
```py
56
57
@click.option(
57
-
"--vectordb-uri",
58
+
"--document-store-uri",
58
59
default='rag-output.db',
59
-
envvar="ILAB_VECTORDB_URI",
60
+
envvar="ILAB_DOCUMENT_STORE_URI",
60
61
)
61
62
```
62
63
@@ -72,55 +73,80 @@ If the configured embedding model has not been cached, the command execution wil
72
73
consistently to all new and updated commands.
73
74
74
75
### 2.2 Document Processing Pipeline
75
-
The proposal is to add a `process` sub-command to the `data` command group:
76
+
The proposal is to add a `process` sub-command to the `data` command group.
77
+
78
+
For the Taxonomy path (no Model Training):
76
79
```
77
-
ilab data process --input /path/to/docs/folder --output /path/to/processed/folder
80
+
ilab data process /path/to/processed/folder
78
81
```
79
82
83
+
For the Plag-and-Play RAG path:
84
+
```
85
+
ilab data process --input /path/to/docs/folder /path/to/processed/folder
86
+
```
87
+
80
88
#### Command Purpose
81
-
Applies the transformation for the customer documents in `/path/to/docs/folder`. Processed artifacts are stored under `/path/to/processed/folder`.
89
+
Applies the docling transformation to the customer documents.
90
+
* Original documents are located in the `/path/to/docs/folder` input folder or in the taxonomy knowledge branch.
91
+
* In the latter case, the input documents are the knowledge documents retrieved from the installed taxonomy repository
92
+
according to the [SDG diff strategy][sdg-diff-strategy], e.g. `the new or changed YAMLs using git diff, including untracked files`.
93
+
* Processed artifacts are stored under `/path/to/processed/folder`.
82
94
83
95
***Notes**:
84
-
* In alignment with the current SDG implementation, the folder will not be navigated recursively. Only files located at the root level of the specified
85
-
folder will be considered. The same principle applies to all other options outlined below.
86
-
* To ensure consistency and avoid issues with document versioning or outdated artifacts, the destination folder will be cleared before execution.
87
-
This ensures it contains only the artifacts generated from the most recent run.
96
+
* In alignment with the current SDG implementation, the `--input`folder will not be navigated recursively. Only files located at the root
97
+
level of the specified folder will be considered. The same principle applies to all other options outlined below.
98
+
* To ensure consistency and avoid issues with document versioning or outdated artifacts, the destination folder will be cleared
99
+
before execution. This ensures it contains only the artifacts generated from the most recent run.
88
100
89
-
The trasformation is based on the `instructlab-sdg` modules (the initial step of the `ilab data generate` pipeline)
90
-
91
-
### Why We Need It
92
-
This command streamlines the `ilab data generate` pipeline and eliminates the requirement to define a `qna` document,
93
-
which typically includes:
94
-
* A minimum of 5×3 question-and-answer pairs.
95
-
* Reference documents stored in Git.
96
-
97
-
The goal is not to generate training data for InstructLab-trained models but to utilize the documents for RAG
98
-
workflows with pre-tuned models.
101
+
The transformation is based on the latest version of the docling `DocumentConverter` (v2).
102
+
The alternative to adopt the `instructlab-sdg` modules (e.g. the initial step of the `ilab data generate` pipeline) has been
103
+
discarded because it generates documents according to the so-called legacy docling schema.
99
104
100
105
#### Usage
101
-
The generated artifacts can later be used to generete and ingest the embeddings into a vector database.
106
+
The generated artifacts can later be used to generate and ingest the embeddings into a vector database.
102
107
103
108
### 2.3 Document Processing Pipeline Options
109
+
```bash
110
+
% ilab data process --help
111
+
Usage: ilab data process [OPTIONS] OUTPUT_DIR
104
112
113
+
The document processing pipeline
114
+
115
+
Options:
116
+
--input DIRECTORY The folder with user documents to process.
117
+
--help Show this message and exit.```
118
+
```
105
119
106
120
| Option Description | Default Value | CLI Flag | Environment Variable |
| Location folder of user documents. In case it's missing, the taxonomy is navigated to look for updated knowledge documents.| | `--input` | `ILAB_PROCESS_INPUT` |
108
123
| Base directories where models are stored. | `$HOME/.cache/instructlab/models` | `--model-dir` | `ILAB_MODEL_DIR` |
109
124
| Name of the embedding model. | **TBD** | `--embedding-model` | `ILAB_EMBEDDING_MODEL_NAME` |
110
125
111
126
### 2.4 Embedding Ingestion Pipeline
112
-
The proposal is to add an `ingest` sub-command to the `data` command group:
127
+
The proposal is to add an `ingest` sub-command to the `data` command group.
128
+
129
+
For the Model Training path:
113
130
```
114
-
ilab data ingest /path/to/docs/folder
131
+
ilab data ingest
132
+
```
133
+
134
+
For the Taxonomy or Plug-and-Play RAG paths:
135
+
```
136
+
ilab data ingest /path/to/processed/folder
115
137
```
116
138
117
139
#### Working Assumption
118
140
The documents at the specified path have already been processed using the `data process` command or an equivalent method
119
141
(see [Getting Started with Knowledge Contributions][ilab-knowledge]).
120
142
121
143
#### Command Purpose
122
-
Generate the embeddings from the pre-processed documents at */path/to/docs/folder* folder and store them in the
123
-
configured vector database.
144
+
Generate the embeddings from the pre-processed documents.
145
+
* In case of Model Training path, the documents are located in the location specified by the `generate.output_dir` configuration key
0 commit comments