Parsimony is the kernel of a connector ecosystem for financial and economic data. It gives you two things:
-
A connector model. A connector is an
asyncPython function plus metadata. You fetch by awaiting it (await conn(series_id="GDP")); the framework wraps the rawDataFrameyour function returns into a typedResult/TabularResultwith automatic provenance. Operational failures surface through a small, agent-facing error taxonomy (UnauthorizedError,RateLimitError,ProviderError, …) instead of raw HTTP exceptions. -
A hybrid-search catalog. When you need to discover what data exists — search across thousands of series codes, titles, and descriptions — Parsimony ships a portable
Catalogthat combines BM25 keyword indexes and FAISS vector indexes, fused into a single ranked result and snapshot-able to disk or a Hugging Face dataset.
The kernel ships zero connectors in-tree. Each connector (e.g. parsimony-fred, parsimony-sdmx) is a separate parsimony-<name> distribution, discovered at runtime through the parsimony.providers entry-point group. import parsimony stays cheap: the heavy catalog stack (torch, FAISS, sentence-transformers) is an optional install and loads lazily on first use.
- Connectors are just async functions. The function's own parameters are the connector's call surface — no separate params schema to wire up.
- Typed, provenance-tagged results. Return a raw
pandasDataFrame; the framework builds theResult/TabularResultand aProvenancerecord (source, description, UTC fetch time, call params). - Declarative output schemas.
OutputConfig+Column+ColumnRole(DATA/KEY/TITLE/METADATA) shape results and drive catalog-entity extraction. - Agent-facing error taxonomy. A single
ConnectorErrorbase with subclasses whose default messages embed retry directives — built for autonomous agent loops, not just humans. - Credential injection by composition.
bind(api_key=...)fixes a parameter, removes it from the call surface, and keeps it out of provenance. - HTTP transport helpers.
HttpClient,fetch_json, andmap_http_errortranslatehttpxerrors into the typed taxonomy, with secret redaction in logs and transient retry built in. - Plugin discovery + conformance. Plugins register under
parsimony.providers;parsimony listenumerates them and--strictruns a conformance suite. - Hybrid search. BM25 + FAISS vector indexes fused with Z-score / min-max / RRF rankers, with adaptive FAISS index selection by row count.
- Swappable embedders. Local PyTorch, faster ONNX (int8), or hosted (litellm) — each behind its own optional extra.
- Lean default install. The mandatory footprint is
pydantic,pandas,pyarrow,httpx,platformdirs. No torch, no FAISS unless you ask for them.
pip install parsimony-core # kernel: connectors, results, errors, transport
pip install 'parsimony-core[standard]' # + the hybrid-search Catalog
pip install parsimony-fred parsimony-sdmx # individual connectors (each its own distribution)The default install pulls only the lean kernel deps. The Catalog and its embedders are opt-in:
| Extra | Adds | Unlocks |
|---|---|---|
standard |
faiss-cpu, rank-bm25, sentence-transformers, huggingface_hub |
Catalog, BM25Index, VectorIndex, HybridIndex, the default local embedder, and the hf:// snapshot loader |
standard-onnx |
standard + optimum[onnxruntime], onnxruntime |
OnnxEmbedder — 2–3× faster CPU inference via int8 quantization, ~4× smaller on disk |
litellm |
litellm |
LiteLLMEmbeddingProvider — hosted embeddings (OpenAI, Gemini, Cohere, Voyage, Bedrock) |
s3 |
s3fs |
Installs the dependency for s3:// catalog URLs. Note: the s3:// source handler is still a stub — adding this extra does not yet enable the scheme. |
all |
standard + standard-onnx + litellm + s3 |
Everything |
dev |
pytest, ruff, mypy, pip-audit (+ standard, litellm) |
The full test/lint toolchain |
Requires Python 3.11+.
Connectors are separate distributions. With parsimony-fred installed and a free FRED API key:
import asyncio
import os
from parsimony_fred import fred_fetch, fred_search
async def main() -> None:
api_key = os.environ["FRED_API_KEY"]
# bind() fixes api_key and removes it from the call surface. The bound
# value is NOT recorded in provenance.params.
search = fred_search.bind(api_key=api_key)
fetch = fred_fetch.bind(api_key=api_key)
search_result = await search(search_text="US gross domestic product")
print(search_result.df[["id", "title"]].head())
result = await fetch(series_id="GDP") # -> TabularResult
print(result.df[["date", "value"]].tail())
print(result.provenance.source) # 'fred_fetch'
asyncio.run(main())A connector is an async function decorated with @connector. Return raw data — the framework builds the typed envelope.
import asyncio
import pandas as pd
from parsimony import Column, ColumnRole, OutputConfig, connector
CUSTOM_OUTPUT = OutputConfig(
columns=[
Column(name="code", role=ColumnRole.KEY, namespace="my_source"),
Column(name="label", role=ColumnRole.TITLE),
Column(name="score", role=ColumnRole.DATA, dtype="numeric"),
]
)
@connector(output=CUSTOM_OUTPUT, tags=["custom"])
async def my_data_source(category: str) -> pd.DataFrame:
"""Return sample rows for a category (replace with a real HTTP call)."""
return pd.DataFrame(
{
"code": ["A1", "A2", "A3"],
"label": [f"{category} - Alpha", f"{category} - Beta", f"{category} - Gamma"],
"score": [0.95, 0.87, 0.73],
}
)
# The framework wraps the raw DataFrame into a TabularResult with provenance.
result = asyncio.run(my_data_source(category="widgets"))
print(result.df)
print(result.provenance.source) # 'my_data_source'Connectors must be async and must have a description (docstring or description=). They must return raw data — returning a Result, TabularResult, or (data, properties) tuple raises TypeError. Provider facts belong in DataFrame columns, never in provenance.properties (which is framework-only).
Connectors is an immutable, composable collection. Combine bundles with + and scope a credential across only the connectors that accept it:
import os
from parsimony import Connectors
from parsimony_fred import CONNECTORS as FRED
from parsimony_sdmx import CONNECTORS as SDMX
api_key = os.environ["FRED_API_KEY"]
# Combine bundles with the + operator. Connectors.bind scopes api_key only to
# connectors that actually accept it (FRED), leaving SDMX untouched.
bundle = (FRED + SDMX).bind(api_key=api_key)
print(bundle.names())
gdp = await bundle["fred_fetch"](series_id="GDP")
fx = await bundle["sdmx_fetch"](dataset_key="ECB-EXR", series_key="D.USD.EUR.SP00.A")The transport layer maps httpx errors (401/402/429/5xx/timeout) into the typed parsimony.errors taxonomy and redacts secrets in logs:
import pandas as pd
from parsimony import Column, ColumnRole, OutputConfig, connector
from parsimony.transport.helpers import fetch_json, make_api_key_client
OUT = OutputConfig(
columns=[
Column(name="date", role=ColumnRole.KEY, namespace="acme"),
Column(name="value", role=ColumnRole.DATA, dtype="numeric"),
]
)
@connector(output=OUT, secrets=("api_key",))
async def acme_fetch(series_id: str, api_key: str) -> pd.DataFrame:
"""Fetch an ACME time series by id."""
http = make_api_key_client("https://api.acme.test", api_key=api_key)
payload = await fetch_json(http, path=f"series/{series_id}", provider="acme", op_name="series")
return pd.DataFrame(payload["observations"])secrets=("api_key",) strips that parameter from provenance even when passed at call time, and validates the name is a real parameter at decoration.
When you need to discover which series exist, build a Catalog over Entity rows. Each Entity is identified by (namespace, code) and carries a title plus free-form metadata. Field indexes (BM25Index, VectorIndex, HybridIndex) are keyed by a logical search surface; queries are either broad plain text (routed to the default field) or structured field: value clauses.
The Catalog stack requires
pip install 'parsimony-core[standard]'. ImportingCatalog/BM25Index/VectorIndexfromparsimonyalways works (lazy PEP 562), butbuild()raisesImportErroron first use without the extra.
import asyncio
from parsimony import BM25Index, Catalog, Entity, HybridIndex, VectorIndex
from parsimony.ranking import ZScoreFusion
async def main() -> None:
entries = [
Entity(namespace="fred", code="GDPC1", title="Real Gross Domestic Product",
metadata={"description": "Inflation-adjusted US output and real growth."}),
Entity(namespace="fred", code="UNRATE", title="Unemployment Rate",
metadata={"description": "Monthly civilian unemployment rate."}),
]
catalog = Catalog(
"macro",
indexes={
"code": BM25Index(),
"title": HybridIndex(
components=[BM25Index(), VectorIndex()], # VectorIndex() defaults to all-MiniLM-L6-v2
fusion=ZScoreFusion(weights={"bm25": 0.5, "vector": 1.0}),
),
},
default_field="title",
)
catalog.set_entities(entries)
await catalog.build() # MUST build before search/save
hits, diag = await catalog.search("inflation adjusted output", limit=5) # broad
print(diag.mode, [(h.code, round(h.score, 3)) for h in hits])
hits2, _ = await catalog.search("code: UNRATE", limit=1) # structured, exact match
print(hits2[0].title)
await catalog.save("file:///tmp/macro-catalog", builder="readme-example")
reloaded = await Catalog.load("file:///tmp/macro-catalog")
print(len(reloaded))
asyncio.run(main())A few important details, grounded in the code:
- The catalog API is async.
build,search,save,load,get,delete_many, and embedder methods are all coroutines. You mustawait catalog.build()after construction and after anyset_entities/set_indexes/delete_many—search()andsave()raiseValueErroruntil rebuilt. search(query, limit, *, namespaces=None)—limitis positional and required.- Default index policy.
Catalog(name, indexes=None)auto-creates BM25 indexes forcode,title, and every metadata key atbuild()time. Pass an explicitindexesdict for full control. - Exact value matches win. A case-insensitive exact value match short-circuits to a sentinel score that dominates fuzzy BM25/cosine scores — ideal for code lookups.
- Portable, integrity-checked snapshots. A saved catalog is a directory of Parquet (zstd) files plus
meta.json;Catalog.loadrecomputes a content SHA-256 over every file and rejects a mismatch. Onlyfile://(or a bare path) andhf://(Hugging Face dataset) schemes are wired in.
OutputConfig.build_entities(df) projects a DataFrame into Entity rows using column roles — the single KEY column (which must declare a namespace) becomes the code, TITLE becomes the title, and METADATA columns become metadata. This is how an enumerator connector feeds the catalog.
from parsimony import BM25Index, HybridIndex, VectorIndex
from parsimony.embedder import LiteLLMEmbeddingProvider, OnnxEmbedder
# Faster CPU path — requires parsimony-core[standard-onnx]
onnx_title = HybridIndex(components=[BM25Index(), VectorIndex(embedder=OnnxEmbedder())])
# Hosted embeddings — requires parsimony-core[litellm]; you supply model + dim
hosted = LiteLLMEmbeddingProvider(model="text-embedding-3-small", dimension=1536)
hosted_title = HybridIndex(components=[BM25Index(), VectorIndex(embedder=hosted)])Any object satisfying the EmbeddingProvider protocol works. An embedder's identity (model, dim, normalize) is persisted in a snapshot; VectorIndex.load rejects a mismatched embedder, so rebuild if you change models.
A Connector is a frozen dataclass wrapping an async function plus metadata. Await it to fetch.
await conn(**kwargs)→Result(raw__call__;call_raw(**kwargs)returns the unwrapped function output).conn.bind(**kwargs)→ a new connector with parameters fixed and removed fromexposed_signature.conn.with_callback(cb)→ adds a post-fetch observer (exceptions are logged and swallowed, never propagated).conn.describe()/conn.to_llm()→ human- and LLM-readable cards.
Connectors([...]) is the immutable collection: + to combine (rejects duplicate names), bundle[name] / bundle.get(name) to index, names(), filter(pred), search(query, *, tags=None, **properties), bind(**kwargs), with_callback(cb), describe(), to_llm(). There is no merge classmethod — use +.
| Decorator | Purpose | Output contract |
|---|---|---|
@connector |
General-purpose fetch | output optional |
@enumerator(output=...) |
Entity/series discovery | exactly one KEY (with namespace), at least one TITLE, no DATA; function must annotate a pd.DataFrame return; returned columns strictly validated |
@loader(output=...) |
Observation-data fetch | exactly one KEY (with namespace), at least one DATA, no TITLE/METADATA |
Result(data, provenance) is the opaque envelope; TabularResult adds .df and an optional output_schema, and round-trips through Arrow/Parquet (to_arrow/from_arrow/to_parquet/from_parquet) embedding provenance and the column schema in table metadata. Provenance is framework-only — connectors never construct it.
All operational failures derive from ConnectorError (carries .provider). Default messages embed agent-loop directives (e.g. "DO NOT retry"). Programmer errors stay as TypeError/ValueError/ValidationError.
| Error | Maps to | Notable fields |
|---|---|---|
UnauthorizedError |
401 / 403 | env_var hint |
PaymentRequiredError |
402 / plan restriction | — |
RateLimitError |
429 | retry_after (rejects values > 86400), quota_exhausted |
ProviderError |
5xx / 4xx / timeout | status_code (408 for timeouts) |
EmptyDataError |
200, no rows | query_params |
ParseError |
200, unparseable | — |
InvalidParameterError |
invalid call-time args | — |
CatalogNotFoundError |
missing catalog bundle | — |
A connector plugin is a parsimony-<name> distribution that:
- Exports a module-level
CONNECTORS(aConnectorsinstance). - Registers under the
parsimony.providersentry-point group.
# my_plugin/__init__.py
from parsimony import Connectors
from .connectors import acme_fetch, enumerate_demo
CONNECTORS = Connectors([acme_fetch, enumerate_demo])# pyproject.toml of the parsimony-acme distribution
[project.entry-points.'parsimony.providers']
acme = "my_plugin"Consumers discover plugins via parsimony.discover:
from parsimony import discover
providers = list(discover.iter_providers()) # metadata only, no imports
bundle = discover.load("acme") # strict: raises LookupError if not installed
everything = discover.load_all() # forgiving: logs and skips broken pluginsiter_providers() raises RuntimeError if two installed distributions register the same provider name.
parsimony.testing gives plugin authors a conformance suite. Subclass ProviderTestSuite in a pytest file to inherit conformance plus an installation check, or call assert_plugin_valid(module) procedurally:
# tests/test_conformance.py in the plugin repo
from parsimony.testing import ProviderTestSuite
import my_plugin
class TestMyPlugin(ProviderTestSuite):
module = my_plugin
entry_point_name = "acme" # also verifies registration under parsimony.providersThe same checks run from the shell via parsimony list --strict.
The package installs a parsimony console script (parsimony = parsimony.cli:main) with two verbs:
# Enumerate installed plugins (name, version, connector count)
parsimony list
parsimony list --json
parsimony list --strict # import each plugin, run conformance; non-zero exit on failure
# Inspect or clear the global cache (subdirs: catalogs, models, connectors, staging)
parsimony cache path
parsimony cache info [--json]
parsimony cache clear [--subdir NAME] [--yes]The cache root resolves through PARSIMONY_CACHE_DIR, defaulting to platformdirs.user_cache_dir("parsimony"). Hugging Face catalog snapshots land under the catalogs subdir; ONNX models under models; connector-owned scratch under connectors; and per-provider catalog build staging (staging_dir(provider)) under staging.
| Env var | Purpose |
|---|---|
PARSIMONY_CACHE_DIR |
Override the cache root (must point at a user-private directory; world/group-writable dirs are refused) |
PARSIMONY_FAISS_IVF_THRESHOLD |
Row count at/above which FAISS index construction switches to IndexIVFFlat (default 500000) |
Parsimony is the foundation that the rest of the parsimony / Ockham ecosystem builds on. The dependency direction is one-way:
parsimony-core → parsimony-<name> connectors → parsimony-agents → applications
- This package (
parsimony-core) depends on nothing else in the ecosystem. - Each connector distribution depends on
parsimony-coreand registers through theparsimony.providersentry point. - Higher-level packages (e.g.
parsimony-agents) build on top, consuming connectors and the catalog through the public API here.
make install # uv pip install -e ".[dev]"
make test # pytest tests/ -x --tb=short -q
make test-cov # pytest with coverage (--cov-fail-under=80)
make lint # ruff check + ruff format --check (parsimony/ tests/ examples/)
make format # ruff format + ruff check --fix
make typecheck # mypy parsimony/
make check # lint + typecheck + testTests run with asyncio_mode=auto and an 80% coverage floor. Two pytest markers gate heavier tests:
integration— hits live APIs (may be slow, requires env vars).slow— heavy local tests; opt-in.
To exercise the full FAISS + BM25 + sentence-transformers paths during development, install with the standard extra (the dev extra already pulls it in).
Apache-2.0. See LICENSE.
