Build a personal “digital twin” by crawling your public footprint (GitHub, Medium, and generic articles) and structuring it into a queryable knowledge base. The project packages the crawling and ingestion as a ZenML pipeline, persists normalized documents in MongoDB via Pydantic, and exposes a simple CLI to run end‑to‑end ETL.
This repository is designed as a production‑lean ETL for LLM applications: ingest your data once, then use it for RAG, analytics, or fine‑tuning datasets.
- Crawl public sources: GitHub repositories, Medium articles, and arbitrary article links. A LinkedIn crawler is scaffolded but currently deprecated due to site changes.
- Normalize into documents: Unified Pydantic models for
Article,Post, andRepository, associated to aUser. - Store in MongoDB: Clean insertion with UUIDs, bulk inserts, and simple find APIs.
- Orchestrate with ZenML: A compact ETL pipeline with steps for user resolution and link crawling, including step metadata for traceability.
- Run via CLI: One command to execute the pipeline from a YAML config.
- Production‑minded data plumbing: Clean separation of concerns (application/domain/infrastructure), Pydantic models, and a minimal ODM for MongoDB.
- Composable crawling: Dispatcher pattern routes each link to the right crawler; Selenium and LangChain are used pragmatically where they add value.
- Pipeline discipline: ZenML pipeline and step metadata enable reproducible runs and observability.
- LLM‑ready outputs: Persisted, normalized data can directly power RAG over a candidate’s work, portfolio summaries, or dataset creation for alignment.
- ZenML for orchestration: Prefer pipelines over ad‑hoc scripts for clear DAGs, caching, and reproducibility. This makes runs traceable, parameterized via YAML, and easy to evolve into scheduled jobs or to swap stacks (local → cloud) without code changes.
- MongoDB + Pydantic (minimal ODM): Content varies wildly (articles, posts, repositories). A schemaless document store fits naturally while Pydantic enforces types at the boundary. The custom ODM keeps dependencies light yet provides UUID handling, bulk inserts, and conversion helpers, making models portable and testable.
- Dispatcher + crawler abstraction: Each site has unique structure/rate limits. The dispatcher decouples URL routing from extraction, enabling drop‑in crawlers for new domains and safe deprecation (e.g., LinkedIn) without touching pipeline code.
- Selenium vs async HTML loaders: Dynamic sites (Medium, LinkedIn historically) need a browser; static pages benefit from fast async fetch + HTML→text via LangChain transformers. This hybrid keeps performance and reliability balanced.
- Headless Chrome with autoinstaller: Minimizes setup friction across machines and CI, while still allowing deterministic scrapes with consistent rendering.
- Text‑only GitHub tree: Storing normalized, whitespace‑trimmed text per file reduces storage, avoids binaries, and produces content that is immediately tokenization‑friendly for RAG and summarization.
- Step metadata (observability): Per‑domain success counts and query metadata are attached to pipeline outputs, aiding quick diagnostics, alerts, or CI gating (“fail if <90% domains succeed”).
- Secrets via ZenML and .env fallback: Secure, environment‑agnostic configuration. Local development uses
.env; production can pull from ZenML Secret Store without code changes. - Deprecating brittle scrapes: LinkedIn methods explicitly raise
DeprecationWarning. This makes compliance and maintenance intent obvious and reduces “silent failure” risk while keeping the scaffold for future API‑based replacements.
- Personal RAG knowledge base: Index your repos and writings to power Q&A, interview prep, and on‑the‑fly code/content explanations.
- Portfolio summarization: Generate highlights across GitHub and articles (topics, tech stack, impact) for resumes and personal sites.
- Job matching and cover letters: Feed job descriptions + your corpus into a RAG pipeline to tailor cover letters or talk tracks with concrete evidence links.
- Content analytics: Track themes over time, detect expertise areas, and aggregate performance/coverage across platforms.
- Dataset creation for alignment: Export structured samples to instruction or preference datasets (hooks via
domain/types) to fine‑tune small models.
- Pipeline (
pipelines/digital_data_etl.py):get_or_create_user: Resolves aUserby name, creates if missing.crawl_links: Iterates links, selects a crawler, extracts and saves documents; logs per‑domain success counts as step metadata.
- Crawlers (
llm_engineering/application/crawlers/…):GithubCrawler: Clones a repo to a temp dir and captures a text‑only tree of files (ignores large/binary formats) intoRepositoryDocument.MediumCrawler: Headless Selenium + BeautifulSoup to extract title/subtitle/content intoArticleDocument.CustomArticleCrawler: Uses LangChain Async HTML loader + HTML→text transformer for generic articles.LinkedInCrawler: Scaffolded; currently raisesDeprecationWarningforlogin/extractdue to site changes.CrawlerDispatcher: Regex‑based domain matching to select the appropriate crawler; defaults toCustomArticleCrawler.
- Domain (
llm_engineering/domain):NoSQLBaseDocument: Minimal ODM to/from MongoDB (UUID handling, bulk insert, find/get_or_create) built on Pydantic v2.ArticleDocument,PostDocument,RepositoryDocument,UserDocument: Typed, serializable models.
- Infrastructure (
llm_engineering/infrastructure/db/mongo.py):- Singleton connector to MongoDB; parameters loaded from
.envor ZenML Secrets.
- Singleton connector to MongoDB; parameters loaded from
- Settings (
llm_engineering/settings.py):- Loads from ZenML secret store if available, else falls back to
.env.
- Loads from ZenML secret store if available, else falls back to
- CLI (
tools/run.py):- Click‑based entrypoint to run the ETL pipeline with caching and config options.
- Python 3.10+
- ZenML (pipelines and secrets)
- MongoDB + PyMongo
- Pydantic v2 + pydantic‑settings
- Selenium + chromedriver‑autoinstaller + BeautifulSoup4
- LangChain Community (async HTML loader, html2text transformer)
- Click, Loguru, TQDM
- Clone and create a virtual environment
python -m venv .venv && source .venv/bin/activatepip install -r requirements.txt
- Provide configuration
- Copy
.env.exampleto.envand fill in values, or create a ZenML secret namedsettingswith the same keys. - Required variables:
DATABASE_HOST: Mongo connection string (e.g.,mongodb://user:pass@127.0.0.1:27017).DATABASE_NAME: Target database name (defaulttwin).
- Optional variables:
LINKEDIN_USERNAME,LINKEDIN_PASSWORD: Not required (LinkedIn crawler is deprecated).
- Ensure MongoDB is reachable and credentials can write to
DATABASE_NAME.
- Initialize ZenML (local stack is fine)
zenml init
- Prepare a pipeline config
- See
configs/digital_data_etl_example.yamland copy/adjust as needed.
- Run the ETL pipeline
python tools/run.py --run-etl --etl-config-filename digital_data_etl_example.yaml- Add
--no-cacheto disable ZenML caching if you want a fresh run.
Notes:
- The Selenium crawlers run headless Chrome. Ensure Chrome is installed on your system;
chromedriver-autoinstallerpins a matching driver automatically. - LinkedIn scraping is intentionally disabled and raises
DeprecationWarning.
- Pick a model: Reuse
ArticleDocument,PostDocument, orRepositoryDocument. If the new source doesn’t fit, add a new subclass inllm_engineering/domain/documents.pyand aDataCategoryinllm_engineering/domain/types.pywith the collection name. - Implement a crawler: Create
llm_engineering/application/crawlers/<site>.pythat subclassesBaseCrawler(static pages) orBaseSeleniumCrawler(dynamic pages). Implementextract(self, link, **kwargs)and persist via the chosen model. - Register with dispatcher: Add a registration in
CrawlerDispatcherso URLs route to your crawler. You can chain this beside the existing.register_medium().register_github()calls. - Update config: Add your URLs to
configs/<your_config>.yamlunderparameters.links. No pipeline code changes needed. - Validate: Run the ETL and check ZenML step metadata for per‑domain success and MongoDB for inserted documents.
Example crawler (static HTML with LangChain):
# llm_engineering/application/crawlers/my_site.py
from urllib.parse import urlparse
from langchain_community.document_loaders import AsyncHtmlLoader
from langchain_community.document_transformers.html2text import Html2TextTransformer
from loguru import logger
from .base import BaseCrawler
from llm_engineering.domain.documents import ArticleDocument
class MySiteCrawler(BaseCrawler):
model = ArticleDocument
def extract(self, link: str, **kwargs) -> None:
if self.model.find(link=link):
logger.info(f"Article already exists: {link}")
return
loader = AsyncHtmlLoader([link])
docs = loader.load()
textifier = Html2TextTransformer()
doc = textifier.transform_documents(docs)[0]
content = {
"Title": doc.metadata.get("title"),
"Subtitle": doc.metadata.get("description"),
"Content": doc.page_content,
"language": doc.metadata.get("language"),
}
platform = urlparse(link).netloc
user = kwargs["user"]
self.model(
content=content,
link=link,
platform=platform,
author_id=user.id,
author_full_name=user.full_name,
).save()Register it with the dispatcher:
# llm_engineering/application/crawlers/dispatcher.py
from .my_site import MySiteCrawler
class CrawlerDispatcher:
...
def register_my_site(self) -> "CrawlerDispatcher":
self.register("https://my-site.com", MySiteCrawler)
return selfAnd include it in the chain where the dispatcher is built (e.g., steps/etl/crawl_links.py):
dispatcher = (
CrawlerDispatcher.build()
.register_linkedin()
.register_medium()
.register_github()
.register_my_site()
)For dynamic pages, start from BaseSeleniumCrawler, use self.driver.get(link), self.scroll_page(), and parse with BeautifulSoup.
configs/digital_data_etl_example.yaml
parameters.user_full_name: Full name used to create/lookup the user document.parameters.links: List of links to crawl. Supported: GitHub repos, Medium posts, generic article URLs.
Minimal example:
parameters:
user_full_name: "Jane Doe"
links:
- "https://github.com/janedoe/my-repo"
- "https://medium.com/@janedoe/my-post"
- "https://example.com/interesting-article"UserDocument:{first_name, last_name}with computedfull_name.ArticleDocument:{content, link, platform, author_id, author_full_name}wherecontentincludes title/subtitle/text.RepositoryDocument:{name, content, link, platform, author_*}wherecontentis a text‑only mapping ofpath -> file_text.PostDocument: Similar toArticleDocumentwith optionalimage.
- Ethics & Terms: Only crawl content you own or have rights to; respect site robots/ToS.
- Rate limiting: Current implementation is conservative but not rate‑aware; consider backoff and robots parsing for scale.
- Storage growth: Repository content can be large; the GitHub crawler omits binaries and common heavy formats.
- Indexing: Add MongoDB indexes per your query patterns if you scale this.
- Add robust LinkedIn support via official APIs or compliant scraping strategies.
- Build RAG and summarization pipelines on top of the stored corpus.
- Exporters for instruction and preference datasets (hooks already exist in
domain/types). - Scheduler/backfill and incremental crawl policies.
- Clear boundaries: application vs domain vs infrastructure.
- Minimal ODM: pragmatic MongoDB + Pydantic integration without heavy dependencies.
- Observable pipelines: ZenML metadata per step output for quick diagnostics.
- Testability/readability: compact, typed code with explicit responsibilities and logging.
- Chrome not found: install Google Chrome or provide a compatible Chromium build; re‑run.
- Mongo auth errors: verify
DATABASE_HOSTcredentials and that the DB exists. - ZenML errors: run
zenml initin the repo and ensure the local stack is active.