Skip to content

stonedseeker/Drac

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DRAC: Dynamic Retrieval Across Content

A production-ready Retrieval-Augmented Generation system that processes and queries multiple data formats including images, text documents, and PDFs with mixed content.

Core Capabilities

  • Multi-format document processing (TXT, PDF, PNG, JPG, JPEG, DOCX, XLSX)
  • Dense vector search using OpenAI embeddings
  • Sparse retrieval using BM25
  • Hybrid search combining dense and sparse methods
  • Reranking for improved results
  • OCR for image and PDF text extraction
  • Intelligent text chunking
  • Metadata management
  • Caching for performance
  • LLM traceability
  • Input/output guardrails

Features

  • Batch document upload
  • Async processing
  • Query expansion
  • Cross-modal retrieval
  • RESTful API with FastAPI
  • Interactive web interface
  • Comprehensive logging
  • Unit tests

Prerequisites

  • Python 3.11.9
  • Tesseract OCR
  • OpenAI API key
  • Windows 11 / Linux / macOS

Installation

1. Clone Repository

git clone https://github.com/stonedseeker/Drac.git
cd Drac

2. Create Conda Environment

conda create -n Drac python=3.11.9 -y
conda activate Drac

3. Install Dependencies

pip install -r requirements.txt

4. Install Tesseract OCR

Windows: Download and install from: https://github.com/UB-Mannheim/tesseract/wiki Default path: C:\Program Files\Tesseract-OCR\tesseract.exe

Linux:

sudo apt-get install tesseract-ocr

macOS:

brew install tesseract

5. Configure Environment

Edit .env:

OPENAI_API_KEY=your_openai_api_key_here
TESSERACT_CMD=C:\Program Files\Tesseract-OCR\tesseract.exe

Usage

Start the API Server

cd Drac
python -m uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

Server will be available at: http://localhost:8000

API Documentation

Interactive API docs: http://localhost:8000/docs

API Endpoints

Upload Document

curl -X POST "http://localhost:8000/api/upload" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@document.pdf"

Batch Upload

curl -X POST "http://localhost:8000/api/upload/batch" \
  -F "files=@doc1.pdf" \
  -F "files=@doc2.txt" \
  -F "files=@image.png"

Query Documents

curl -X POST "http://localhost:8000/api/query" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is machine learning?",
    "top_k": 10,
    "enable_reranking": true
  }'

Health Check

curl http://localhost:8000/health

Sample Queries

import requests

response = requests.post('http://localhost:8000/api/query', json={
    'query': 'Find documents about sales data',
    'top_k': 5,
    'enable_reranking': True,
    'file_types': ['pdf', 'xlsx']
})

results = response.json()

Testing

Run all tests:

pytest tests/ -v

Run specific test file:

pytest tests/test_ingestion.py -v

Run with coverage:

pytest tests/ --cov=app --cov-report=html

Future Enhancements

  • Support for audio/video files
  • Multi-language support
  • Document summarization
  • Conversation memory
  • Advanced analytics
  • User authentication
  • Cloud deployment
  • GPU acceleration

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Submit a pull request

About

A production-ready Retrieval-Augmented Generation system that processes and queries multiple data formats including images, text documents, and PDFs with mixed content.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages