diff --git a/content/patterns/rag-llm-cpu/_index.md b/content/patterns/rag-llm-cpu/_index.md new file mode 100644 index 000000000..7559814c6 --- /dev/null +++ b/content/patterns/rag-llm-cpu/_index.md @@ -0,0 +1,72 @@ +--- +title: RAG LLM chatbot on CPU +date: 2025-10-24 +tier: sandbox +summary: This patterns deploys a CPU-based LLM, your choice of several RAG DB providers, and a simple chatbot UI which exposes the configuration and results of the RAG queries. +rh_products: + - Red Hat OpenShift Container Platform + - Red Hat OpenShift GitOps + - Red Hat OpenShift AI +partners: + - Microsoft +industries: + - General +aliases: /rag-llm-cpu/ +links: + github: https://github.com/validatedpatterns-sandbox/rag-llm-cpu + install: getting-started + bugs: https://github.com/validatedpatterns-sandbox/rag-llm-cpu/issues + feedback: https://docs.google.com/forms/d/e/1FAIpQLScI76b6tD1WyPu2-d_9CCVDr3Fu5jYERthqLKJDUGwqBg7Vcg/viewform +--- + +# **CPU-based RAG LLM chatbot** + +## **Introduction** + +This Validated Pattern deploys a Retrieval-Augmented Generation (RAG) chatbot on Red Hat OpenShift by using Red Hat OpenShift AI. The pattern runs entirely on CPU nodes without requiring GPU hardware, making it a cost-effective and accessible solution for environments where GPU resources are limited or unavailable. +It provides a secure, flexible, and production-ready starting point for building and deploying on-premise generative AI applications. + +## **Target audience** + +This pattern is designed for: + +- **Developers and data scientists** looking to build and experiment with RAG-based LLM applications. +- **MLOps and DevOps engineers** responsible for deploying and managing AI/ML workloads on OpenShift. +- **Architects** evaluating cost-effective methods for delivering generative AI capabilities on-premise. + +## **Why use this pattern?** + +- **Cost-effective:** Runs entirely on CPU, removing the need for expensive and often scarce GPU resources. +- **Flexible:** Supports multiple vector database backends (Elasticsearch, PGVector, Microsoft SQL Server) to integrate with your existing data infrastructure. +- **Transparent:** The Gradio front end is designed to expose the internals of the RAG query and LLM prompts, giving you clear insight into the generation process. +- **Extensible:** Built on open source standards (KServe, OpenAI-compatible API) to serve as a robust foundation for more complex applications. + +## **Architecture overview** + +At a high level, the components work together as follows: + +1. A user enters a query into the **Gradio UI**. +2. The backend application, using **LangChain**, first queries a configured **Vector database** to retrieve relevant documents (the "R" in RAG). +3. These documents are combined with the user's original query into a prompt. +4. The prompt is sent to the **KServe-deployed LLM** (running via llama.cpp on a CPU node). +5. The LLM generates a response, which is streamed back to the Gradio UI for the user. +6. **Vault** securely provides the necessary credentials for the vector database and HuggingFace token at runtime. + +![Overview](/images/rag-llm-cpu/rag-augmented-query.png) + +_Figure 1. Overview of RAG Query from User's perspective._ + +## **Prerequisites** + +Before you begin, ensure you have access to the following: + +- A Red Hat OpenShift cluster (version 4.x). (Recommended size of at least 2 `m5.4xlarge` nodes.) +- A HuggingFace API token. +- Command-line tools: Podman. + +## **What this pattern provides** + +- A [kserve](https://github.com/kserve/kserve)-based LLM deployed to [RHOAI](https://www.redhat.com/en/products/ai/openshift-ai) that runs entirely on a CPU-node with a [llama.cpp](https://github.com/ggml-org/llama.cpp) runtime. +- A choice of one (or multiple) Vector DB providers to serve as a RAG-backend with configurable web-based or git repo-based sources. Vector embedding and document retrieval are implemented with [LangChain](https://docs.langchain.com/oss/python/langchain/overview). +- [Vault](https://developer.hashiCorp.com/vault)-based secret management for HuggingFace API token and credentials for supported databases ([Elasticsearch](https://www.elastic.co/docs/solutions/search/vector), [PGVector](https://github.com/pgvector/pgvector), [Microsoft SQL Server](https://learn.microsoft.com/en-us/sql/sql-server/ai/vectors?view=sql-server-ver17)). +- A [gradio](https://www.gradio.app/)-based front end for connecting to multiple [OpenAI API-compatible](https://github.com/openai/openai-openapi) LLMs which exposes the internals of the RAG query and LLM prompts so that users have better insight into what is running. diff --git a/content/patterns/rag-llm-cpu/configure.md b/content/patterns/rag-llm-cpu/configure.md new file mode 100644 index 000000000..eff6fb411 --- /dev/null +++ b/content/patterns/rag-llm-cpu/configure.md @@ -0,0 +1,301 @@ +--- +title: Configuring the pattern +weight: 20 +aliases: /rag-llm-cpu/configure/ +--- + +# **Configuring the pattern** + +This guide covers common customizations, such as changing the default LLM, adding new models, and configuring RAG data sources. +We assume you have already completed the [Getting Started](/rag-llm-cpu/getting-started/) guide. + +## **How configuration works** + +This pattern is managed by ArgoCD (GitOps). All application configurations are defined in `values-prod.yaml`. +To customize a component, you will typically: + +1. **Enable an override:** In `values-prod.yaml`, find the application you want to change (e.g., `llm-inference-service`) and add an `extraValueFiles:` entry pointing to a new override file (e.g., `$patternref/overrides/llm-inference-service.yaml`). +2. **Create the override file:** Create the new .yaml file inside the `/overrides` directory. +3. **Add your settings:** Add _only_ the specific values you want to change into this new file. +4. **Commit and sync:** Commit your changes and let ArgoCD sync the application. + +## **Task: Change the Default LLM** + +By default, the pattern deploys the `mistral-7b-instruct-v0.2.Q5_0.gguf model`. You might want to change this to a different model (e.g., a different quantization) or adjust its resource usage. +You can do this by creating an override file for the _existing_ `llm-inference-service` application. + +1. **Enable the override**: + In `values-prod.yaml`, update the llm-inference-service application to use an override file: + + ```yaml + clusterGroup: + # ... + applications: + # ... + llm-inference-service: + name: llm-inference-service + namespace: rag-llm-cpu + chart: llm-inference-service + chartVersion: 0.3.* + extraValueFiles: # <-- ADD THIS BLOCK + - $patternref/overrides/llm-inference-service.yaml + ``` + +2. **Create the override file:** + Create a new file `overrides/llm-inference-service.yaml`. Here is an example that switches to a different model file (Q8_0) and increases the CPU/memory requests: + + ```yaml + inferenceService: + resources: # <-- Increased allocated resources + requests: + cpu: "8" + memory: 12Gi + limits: + cpu: "12" + memory: 24Gi + + servingRuntime: + args: + - --model + - /models/mistral-7b-instruct-v0.2.Q8_0.gguf # <-- Changed model file + + model: + repository: TheBloke/Mistral-7B-Instruct-v0.2-GGUF + files: + - mistral-7b-instruct-v0.2.Q8_0.gguf # <-- Changed file to download + ``` + +## **Task: add a second LLM** + +You can also deploy an entirely separate, second LLM and add it to the demo user interface (UI). This example deploys a different runtime, HuggingFace TGI, instead of `llama.cpp`. + +This is a two-step process: + +1. Deploy the new LLM. +2. Tell the front end UI about it. + +### **Step 1: Deploy the new LLM service** + +1. **Define the new application:** + In `values-prod.yaml`, add a new application to the applications list. We'll call it `another-llm-inference-service`. + + ```yaml + clusterGroup: + # ... + applications: + # ... + another-llm-inference-service: # <-- ADD THIS NEW APPLICATION + name: another-llm-inference-service + namespace: rag-llm-cpu + chart: llm-inference-service + chartVersion: 0.3.* + extraValueFiles: + - $patternref/overrides/another-llm-inference-service.yaml + ``` + +2. **Create the override file:** + Create the new file `overrides/another-llm-inference-service.yaml`. This file needs to define the new model and disable resource creation, such as secrets, that the first LLM already created. + + ```yaml + dsc: + initialize: false + externalSecret: + create: false + + # Define the new InferenceService + inferenceService: + name: hf-inference-service # <-- New service name + minReplicas: 1 + maxReplicas: 1 + resources: + requests: + cpu: "8" + memory: 32Gi + limits: + cpu: "12" + memory: 32Gi + + # Define the new runtime (HuggingFace TGI) + servingRuntime: + name: hf-runtime + port: 8080 + image: docker.io/kserve/huggingfaceserver:latest + modelFormat: huggingface + args: + - --model_dir + - /models + - --model_name + - /models/Mistral-7B-Instruct-v0.3 + - --http_port + - "8080" + + # Define the new model to download + model: + repository: mistralai/Mistral-7B-Instruct-v0.3 + files: + - generation_config.json + - config.json + - model.safetensors.index.json + - model-00001-of-00003.safetensors + - model-00002-of-00003.safetensors + - model-00003-of-00003.safetensors + - tokenizer.model + - tokenizer.json + - tokenizer_config.json + ``` + + > **Warning:** There is currently a bug in the model-downloading container that requires you to explicitly list _all_ files you want to download from the HuggingFace repository. Make sure you list every file needed for the model to run. + +### **Step 2: Add the new LLM to the demo UI** + +Now, tell the front end that this new LLM exists. + +1. **Edit the front end overrides**: + Open `overrides/rag-llm-frontend-values.yaml` (this file should already exist from the initial setup). +2. **Update LLM_URLS:** + Add the URL of your new service to the `LLM_URLS` environment variable. The URL follows the format _http://-predictor/v1_ (or _http://-predictor/openai/v1_ for the HF runtime). + + In `overrides/rag-llm-frontend-values.yaml`: + + ```yaml + env: + # ... + - name: LLM_URLS + value: '["http://cpu-inference-service-predictor/v1","http://hf-inference-service-predictor/openai/v1"]' + ``` + +## **Task: Customize RAG data sources** + +By default, the pattern loads data from the Validated Patterns documentation. You can change this to point to your own public git repositories or web pages. + +1. **Edit the Vector DB overrides:** + Open `overrides/vector-db-values.yaml` (this file should already exist). +2. **Update sources:** + Modify the repoSources and webSources keys. You can add any publicly available Git repository (using globs to filter files) or public web URLs. The job will also process PDFs from webSources. + + In `overrides/vector-db-values.yaml`: + + ```yaml + providers: + qdrant: + enabled: true + mssql: + enabled: true + + vectorEmbedJob: + repoSources: + - repo: https://github.com/your-org/your-docs.git # <-- Your repo + globs: + - "**/*.md" + webSources: + - https://your-company.com/product-manual.pdf # <-- Your PDF + chunking: + size: 4096 + ``` + +## **Task: Add a new RAG database provider** + +By default, the pattern enables _qdrant_ and _mssql_. You can also enable _redis_, _pgvector_ (Postgres), or _elastic_ (Elasticsearch). +This is a three-step process: (1) Add secrets, (2) Enable the DB, and (3) Tell the front end UI. + +### **Step 1: Update your secrets file** + +If your new DB requires credentials (like _pgvector_ or _elastic_), add them to your main secrets file: + +```sh +vim ~/values-secret-rag-llm-cpu.yaml +``` + +Add the necessary credentials. For example: + +```yaml +secrets: + # ... + - name: pgvector + fields: + - name: user + value: user # <-- Update the user + - name: password + value: password # <-- Update the password + - name: db + value: db # <-- Update the db +``` + +**Note:** refer to the file [`values-secret.yaml.template`](https://github.com/validatedpatterns-sandbox/rag-llm-cpu/blob/main/values-secret.yaml.template) for a reference as to which values are expected. + +### **Step 2: Enable the provider in the Vector DB chart** + +Edit `overrides/vector-db-values.yaml` and set enabled: true for the provider(s) you want to add. + +In `overrides/vector-db-values.yaml`: + +```yaml +providers: + qdrant: + enabled: true + mssql: + enabled: true + pgvector: # <-- ADD THIS + enabled: true + elastic: # <-- OR THIS + enabled: true +``` + +### **Step 3: Add the provider to the demo UI** + +Finally, edit `overrides/rag-llm-frontend-values.yaml` to configure the UI. You must: + +1. Add the new provider's secrets to the `dbProvidersSecret.vault` list. +2. Add the new provider's connection details to the `dbProvidersSecret.providers` list. + +Below is a complete example showing configuration for the non-default RAG DB providers: + +In `overrides/rag-llm-frontend-values.yaml` + +```yaml +dbProvidersSecret: + vault: + - key: mssql + field: sapassword + - key: pgvector # <-- Add this block + field: user + - key: pgvector + field: password + - key: pgvector + field: db + - key: elastic # <-- Add this block + field: user + - key: elastic + field: password + providers: + - type: qdrant # <-- Example for Qdrant + collection: docs + url: http://qdrant-service:6333 + embedding_model: sentence-transformers/all-mpnet-base-v2 + - type: mssql # <-- Example for MSSQL + table: docs + connection_string: >- + Driver={ODBC Driver 18 for SQL Server}; + Server=mssql-service,1433; + Database=embeddings; + UID=sa; + PWD={{ .mssql_sapassword }}; + TrustServerCertificate=yes; + Encrypt=no; + embedding_model: sentence-transformers/all-mpnet-base-v2 + - type: redis # <-- Example for Redis + index: docs + url: redis://redis-service:6379 + embedding_model: sentence-transformers/all-mpnet-base-v2 + - type: elastic # <-- Example for Elastic + index: docs + url: http://elastic-service:9200 + user: "{{ .elastic_user }}" + password: "{{ .elastic_password }}" + embedding_model: sentence-transformers/all-mpnet-base-v2 + - type: pgvector # <-- Example for PGVector + collection: docs + url: >- + postgresql+psycopg://{{ .pgvector_user }}:{{ .pgvector_password }}@pgvector-service:5432/{{ .pgvector_db }} + embedding_model: sentence-transformers/all-mpnet-base-v2 +``` diff --git a/content/patterns/rag-llm-cpu/getting-started.md b/content/patterns/rag-llm-cpu/getting-started.md new file mode 100644 index 000000000..8e0a2a718 --- /dev/null +++ b/content/patterns/rag-llm-cpu/getting-started.md @@ -0,0 +1,87 @@ +--- +title: Getting Started +weight: 10 +aliases: /rag-llm-cpu/getting-started/ +--- + +## Prerequisites + +- Podman is installed on your system. +- You are logged into an OpenShift 4 cluster with administrative permissions. + +## Deployment instructions + +1. Fork the [rag-llm-cpu](https://github.com/validatedpatterns-sandbox/rag-llm-cpu) git repository. + +2. Clone the forked repository by running the following command: + + ```sh + $ git clone git@github.com:your-username/rag-llm-cpu.git + ``` + +3. Go to your repository: Ensure you are in the root directory of your git repository by using the following command: + + ```sh + $ cd rag-llm-cpu + ``` + +4. Create a local copy of the secret values file by running the following command: + + ```sh + $ cp values-secret.yaml.template ~/values-secret-rag-llm-cpu.yaml + ``` + +5. Create an API token on [HuggingFace](https://huggingface.co/) + +6. Update the secret values file: + + ```sh + vim ~/values-secret-rag-llm-cpu.yaml + ``` + + At the very least, you must update the value of the `token` field in the `huggingface` section with the API token from the previous steps. By default, this pattern deploys Microsoft SQL Server as one of the RAG DB providers so it is good to update the `sapassword` field in the `mssql` section as well. If you plan to use other DB providers, feel free to update their secrets now as well. + +7. If you plan to just install the pattern as-is, you can install it already without making any changes: + + ```sh + $ ./pattern.sh oc whoami --show-console + ``` + + This should output the cluster you want to install the pattern on. If it does not, log in to your OpenShift cluster before running the following install command: + + ```sh + $ ./pattern.sh make install + ``` + + Everything gets deployed after running the install command. If you want to check the status of all the components after the install completes, you can run: + + ```sh + $ ./pattern.sh make argo-healthcheck + ``` + +8. If you want to make changes to the pattern before installing it (using different RAG DB providers, changing the model deployed to the LLM, updating the sources for the RAG DBs, and so on), follow the instructions on the [Configuring this Pattern](/rag-llm-cpu/configure/) page. + +## Verification + +1. Check that all applications are successfully installed: + + ```sh + $ ./pattern.sh make argo-healthcheck + ``` + + It might take several minutes after the pattern is installed for all the applications to become synced and healthy as downloading the LLM models and populating the RAG DBs might take several minutes to complete. + + ![Healthcheck](/images/rag-llm-cpu/healthcheck.png) + +2. Open the RAG LLM Demo UI by clicking the link in the 9-dots menu. + + ![9Dots](/images/rag-llm-cpu/9dots.png) + +3. Verify the LLMs and RAG DB providers you configured are available in the configuration and that making a query in the chatbot triggers a response from the RAG DB and LLM you selected. + > **Note**: it might take a minute or so for the CPU-based LLM to start streaming a response, especially the first time you make a query after installing the pattern as everything is loaded into memory. + + ![App](/images/rag-llm-cpu/app.png) + +## Next steps + +Once the pattern is up-and-running, you might want to customize the pattern (e.g., change the LLM, add new RAG sources, or switch vector databases). For more details on how you can tweak it to your use case, see [Configuring this pattern](/rag-llm-cpu/configure/). diff --git a/static/images/rag-llm-cpu/9dots.png b/static/images/rag-llm-cpu/9dots.png new file mode 100644 index 000000000..fe318b5a6 Binary files /dev/null and b/static/images/rag-llm-cpu/9dots.png differ diff --git a/static/images/rag-llm-cpu/app.png b/static/images/rag-llm-cpu/app.png new file mode 100644 index 000000000..36705b8d7 Binary files /dev/null and b/static/images/rag-llm-cpu/app.png differ diff --git a/static/images/rag-llm-cpu/healthcheck.png b/static/images/rag-llm-cpu/healthcheck.png new file mode 100644 index 000000000..06d30ad45 Binary files /dev/null and b/static/images/rag-llm-cpu/healthcheck.png differ diff --git a/static/images/rag-llm-cpu/rag-augmented-query.png b/static/images/rag-llm-cpu/rag-augmented-query.png new file mode 100644 index 000000000..d166c53ea Binary files /dev/null and b/static/images/rag-llm-cpu/rag-augmented-query.png differ