Skip to content

iqbal-sk/LLMTwin

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLMTwin

Build a personal “digital twin” by crawling your public footprint (GitHub, Medium, and generic articles) and structuring it into a queryable knowledge base. The project packages the crawling and ingestion as a ZenML pipeline, persists normalized documents in MongoDB via Pydantic, and exposes a simple CLI to run end‑to‑end ETL.

This repository is designed as a production‑lean ETL for LLM applications: ingest your data once, then use it for RAG, analytics, or fine‑tuning datasets.

What It Does

  • Crawl public sources: GitHub repositories, Medium articles, and arbitrary article links. A LinkedIn crawler is scaffolded but currently deprecated due to site changes.
  • Normalize into documents: Unified Pydantic models for Article, Post, and Repository, associated to a User.
  • Store in MongoDB: Clean insertion with UUIDs, bulk inserts, and simple find APIs.
  • Orchestrate with ZenML: A compact ETL pipeline with steps for user resolution and link crawling, including step metadata for traceability.
  • Run via CLI: One command to execute the pipeline from a YAML config.

Why It’s Interesting

  • Production‑minded data plumbing: Clean separation of concerns (application/domain/infrastructure), Pydantic models, and a minimal ODM for MongoDB.
  • Composable crawling: Dispatcher pattern routes each link to the right crawler; Selenium and LangChain are used pragmatically where they add value.
  • Pipeline discipline: ZenML pipeline and step metadata enable reproducible runs and observability.
  • LLM‑ready outputs: Persisted, normalized data can directly power RAG over a candidate’s work, portfolio summaries, or dataset creation for alignment.

Design Rationale & Benefits

  • ZenML for orchestration: Prefer pipelines over ad‑hoc scripts for clear DAGs, caching, and reproducibility. This makes runs traceable, parameterized via YAML, and easy to evolve into scheduled jobs or to swap stacks (local → cloud) without code changes.
  • MongoDB + Pydantic (minimal ODM): Content varies wildly (articles, posts, repositories). A schemaless document store fits naturally while Pydantic enforces types at the boundary. The custom ODM keeps dependencies light yet provides UUID handling, bulk inserts, and conversion helpers, making models portable and testable.
  • Dispatcher + crawler abstraction: Each site has unique structure/rate limits. The dispatcher decouples URL routing from extraction, enabling drop‑in crawlers for new domains and safe deprecation (e.g., LinkedIn) without touching pipeline code.
  • Selenium vs async HTML loaders: Dynamic sites (Medium, LinkedIn historically) need a browser; static pages benefit from fast async fetch + HTML→text via LangChain transformers. This hybrid keeps performance and reliability balanced.
  • Headless Chrome with autoinstaller: Minimizes setup friction across machines and CI, while still allowing deterministic scrapes with consistent rendering.
  • Text‑only GitHub tree: Storing normalized, whitespace‑trimmed text per file reduces storage, avoids binaries, and produces content that is immediately tokenization‑friendly for RAG and summarization.
  • Step metadata (observability): Per‑domain success counts and query metadata are attached to pipeline outputs, aiding quick diagnostics, alerts, or CI gating (“fail if <90% domains succeed”).
  • Secrets via ZenML and .env fallback: Secure, environment‑agnostic configuration. Local development uses .env; production can pull from ZenML Secret Store without code changes.
  • Deprecating brittle scrapes: LinkedIn methods explicitly raise DeprecationWarning. This makes compliance and maintenance intent obvious and reduces “silent failure” risk while keeping the scaffold for future API‑based replacements.

Use Cases

  • Personal RAG knowledge base: Index your repos and writings to power Q&A, interview prep, and on‑the‑fly code/content explanations.
  • Portfolio summarization: Generate highlights across GitHub and articles (topics, tech stack, impact) for resumes and personal sites.
  • Job matching and cover letters: Feed job descriptions + your corpus into a RAG pipeline to tailor cover letters or talk tracks with concrete evidence links.
  • Content analytics: Track themes over time, detect expertise areas, and aggregate performance/coverage across platforms.
  • Dataset creation for alignment: Export structured samples to instruction or preference datasets (hooks via domain/types) to fine‑tune small models.

Architecture

  • Pipeline (pipelines/digital_data_etl.py):
    • get_or_create_user: Resolves a User by name, creates if missing.
    • crawl_links: Iterates links, selects a crawler, extracts and saves documents; logs per‑domain success counts as step metadata.
  • Crawlers (llm_engineering/application/crawlers/…):
    • GithubCrawler: Clones a repo to a temp dir and captures a text‑only tree of files (ignores large/binary formats) into RepositoryDocument.
    • MediumCrawler: Headless Selenium + BeautifulSoup to extract title/subtitle/content into ArticleDocument.
    • CustomArticleCrawler: Uses LangChain Async HTML loader + HTML→text transformer for generic articles.
    • LinkedInCrawler: Scaffolded; currently raises DeprecationWarning for login/extract due to site changes.
    • CrawlerDispatcher: Regex‑based domain matching to select the appropriate crawler; defaults to CustomArticleCrawler.
  • Domain (llm_engineering/domain):
    • NoSQLBaseDocument: Minimal ODM to/from MongoDB (UUID handling, bulk insert, find/get_or_create) built on Pydantic v2.
    • ArticleDocument, PostDocument, RepositoryDocument, UserDocument: Typed, serializable models.
  • Infrastructure (llm_engineering/infrastructure/db/mongo.py):
    • Singleton connector to MongoDB; parameters loaded from .env or ZenML Secrets.
  • Settings (llm_engineering/settings.py):
    • Loads from ZenML secret store if available, else falls back to .env.
  • CLI (tools/run.py):
    • Click‑based entrypoint to run the ETL pipeline with caching and config options.

Tech Stack

  • Python 3.10+
  • ZenML (pipelines and secrets)
  • MongoDB + PyMongo
  • Pydantic v2 + pydantic‑settings
  • Selenium + chromedriver‑autoinstaller + BeautifulSoup4
  • LangChain Community (async HTML loader, html2text transformer)
  • Click, Loguru, TQDM

Quickstart

  1. Clone and create a virtual environment
  • python -m venv .venv && source .venv/bin/activate
  • pip install -r requirements.txt
  1. Provide configuration
  • Copy .env.example to .env and fill in values, or create a ZenML secret named settings with the same keys.
  • Required variables:
    • DATABASE_HOST: Mongo connection string (e.g., mongodb://user:pass@127.0.0.1:27017).
    • DATABASE_NAME: Target database name (default twin).
  • Optional variables:
    • LINKEDIN_USERNAME, LINKEDIN_PASSWORD: Not required (LinkedIn crawler is deprecated).
  • Ensure MongoDB is reachable and credentials can write to DATABASE_NAME.
  1. Initialize ZenML (local stack is fine)
  • zenml init
  1. Prepare a pipeline config
  • See configs/digital_data_etl_example.yaml and copy/adjust as needed.
  1. Run the ETL pipeline
  • python tools/run.py --run-etl --etl-config-filename digital_data_etl_example.yaml
  • Add --no-cache to disable ZenML caching if you want a fresh run.

Notes:

  • The Selenium crawlers run headless Chrome. Ensure Chrome is installed on your system; chromedriver-autoinstaller pins a matching driver automatically.
  • LinkedIn scraping is intentionally disabled and raises DeprecationWarning.

How To Extend

  • Pick a model: Reuse ArticleDocument, PostDocument, or RepositoryDocument. If the new source doesn’t fit, add a new subclass in llm_engineering/domain/documents.py and a DataCategory in llm_engineering/domain/types.py with the collection name.
  • Implement a crawler: Create llm_engineering/application/crawlers/<site>.py that subclasses BaseCrawler (static pages) or BaseSeleniumCrawler (dynamic pages). Implement extract(self, link, **kwargs) and persist via the chosen model.
  • Register with dispatcher: Add a registration in CrawlerDispatcher so URLs route to your crawler. You can chain this beside the existing .register_medium().register_github() calls.
  • Update config: Add your URLs to configs/<your_config>.yaml under parameters.links. No pipeline code changes needed.
  • Validate: Run the ETL and check ZenML step metadata for per‑domain success and MongoDB for inserted documents.

Example crawler (static HTML with LangChain):

# llm_engineering/application/crawlers/my_site.py
from urllib.parse import urlparse
from langchain_community.document_loaders import AsyncHtmlLoader
from langchain_community.document_transformers.html2text import Html2TextTransformer
from loguru import logger
from .base import BaseCrawler
from llm_engineering.domain.documents import ArticleDocument

class MySiteCrawler(BaseCrawler):
    model = ArticleDocument

    def extract(self, link: str, **kwargs) -> None:
        if self.model.find(link=link):
            logger.info(f"Article already exists: {link}")
            return

        loader = AsyncHtmlLoader([link])
        docs = loader.load()
        textifier = Html2TextTransformer()
        doc = textifier.transform_documents(docs)[0]

        content = {
            "Title": doc.metadata.get("title"),
            "Subtitle": doc.metadata.get("description"),
            "Content": doc.page_content,
            "language": doc.metadata.get("language"),
        }
        platform = urlparse(link).netloc
        user = kwargs["user"]
        self.model(
            content=content,
            link=link,
            platform=platform,
            author_id=user.id,
            author_full_name=user.full_name,
        ).save()

Register it with the dispatcher:

# llm_engineering/application/crawlers/dispatcher.py
from .my_site import MySiteCrawler

class CrawlerDispatcher:
    ...
    def register_my_site(self) -> "CrawlerDispatcher":
        self.register("https://my-site.com", MySiteCrawler)
        return self

And include it in the chain where the dispatcher is built (e.g., steps/etl/crawl_links.py):

dispatcher = (
    CrawlerDispatcher.build()
    .register_linkedin()
    .register_medium()
    .register_github()
    .register_my_site()
)

For dynamic pages, start from BaseSeleniumCrawler, use self.driver.get(link), self.scroll_page(), and parse with BeautifulSoup.

Configuration File (example)

configs/digital_data_etl_example.yaml

  • parameters.user_full_name: Full name used to create/lookup the user document.
  • parameters.links: List of links to crawl. Supported: GitHub repos, Medium posts, generic article URLs.

Minimal example:

parameters:
  user_full_name: "Jane Doe"
  links:
    - "https://github.com/janedoe/my-repo"
    - "https://medium.com/@janedoe/my-post"
    - "https://example.com/interesting-article"

Data Model

  • UserDocument: {first_name, last_name} with computed full_name.
  • ArticleDocument: {content, link, platform, author_id, author_full_name} where content includes title/subtitle/text.
  • RepositoryDocument: {name, content, link, platform, author_*} where content is a text‑only mapping of path -> file_text.
  • PostDocument: Similar to ArticleDocument with optional image.

Operational Considerations

  • Ethics & Terms: Only crawl content you own or have rights to; respect site robots/ToS.
  • Rate limiting: Current implementation is conservative but not rate‑aware; consider backoff and robots parsing for scale.
  • Storage growth: Repository content can be large; the GitHub crawler omits binaries and common heavy formats.
  • Indexing: Add MongoDB indexes per your query patterns if you scale this.

Roadmap Ideas

  • Add robust LinkedIn support via official APIs or compliant scraping strategies.
  • Build RAG and summarization pipelines on top of the stored corpus.
  • Exporters for instruction and preference datasets (hooks already exist in domain/types).
  • Scheduler/backfill and incremental crawl policies.

What I Focused On (for reviewers)

  • Clear boundaries: application vs domain vs infrastructure.
  • Minimal ODM: pragmatic MongoDB + Pydantic integration without heavy dependencies.
  • Observable pipelines: ZenML metadata per step output for quick diagnostics.
  • Testability/readability: compact, typed code with explicit responsibilities and logging.

Running Troubleshooting

  • Chrome not found: install Google Chrome or provide a compatible Chromium build; re‑run.
  • Mongo auth errors: verify DATABASE_HOST credentials and that the DB exists.
  • ZenML errors: run zenml init in the repo and ensure the local stack is active.

About

LLMTwin turns your public work into an LLM-ready knowledge base. It crawls GitHub, Medium, and the web, normalizes content with Pydantic, stores it in MongoDB, and orchestrates reproducible pipelines with ZenML. The result is a portfolio-quality system that ingests repos and articles into structured, queryable documents for RAG.

Resources

Stars

Watchers

Forks

Contributors

Languages