Skip to content

Releases: dgunning/edgartools

EdgarTools 5.35.1

04 Jun 11:02

Choose a tag to compare

10-K section detection and agent TOC parsing receive two targeted fixes that close gaps introduced in 5.34.0.

Fixed

  • Spurious Part IV Item 1/1A keys no longer appear in 10-K section maps — the section detector emitted duplicate entries for Items 1 and 1A under the Part IV heading of certain 10-Ks; the keys are now dropped so lookups return the correct Part I sections. (#836)
  • Agent TOC parsers no longer drop Item 1 on title-only rows — when a TOC row contained only a title with no page number or hyperlink, the parser silently skipped Item 1; the row is now accepted and keyed correctly. (#837)

EdgarTools 5.35.0

02 Jun 22:11

Choose a tag to compare

BDC non-accrual extraction no longer depends on a filer phrasing its footnotes exactly the way our whitelist expected, and a parsing gap is now surfaced as a warning rather than read as a confirmed zero.

Added

  • edgar.__version__ — the installed version is now exposed at the package root (import edgar; edgar.__version__), following the standard pkg.__version__ convention so downstream consumers can detect which version they have without reading edgar.__about__ or running pip show. (#794)
  • NonAccrualResult.warnings — flags a portfolio that produced no non-accrual signal from any extraction layer, and recognized flags that resolved no investments, so an LLM consumer never mistakes a parsing gap for a confirmed zero. Surfaced in to_context, mirroring the Section.warnings pattern.

Fixed

  • BDC non-accrual footnote detection is now robust to wording drift — the exact-phrase affirmative-pattern whitelist silently dropped any footnote a filer didn't phrase as an enumerated sentence. MAIN changed "Non-accrual and non-income producing…" to "…or…" and its 10-Q returned an empty list; PSEC's verb-less "Investment on non-accrual status as of the reporting date" matched nothing. The binary regex gate is replaced with a layered classifier (mention → negation → explicit pattern → structure-corroborated short label) that accepts short footnotes linked to specific investment facts regardless of exact phrasing, while long rollforward/policy footnotes stay excluded by length. Real-world impact: PSEC 0 → 5 non-accrual investments, GBDC now extracts footnote-level detail, MAIN/ARCC/FSK unchanged. (#835)

Install: pip install -U edgartools

Full Changelog: v5.34.0...v5.35.0

EdgarTools 5.34.0

02 Jun 10:36

Choose a tag to compare

SEC section extraction is now form-aware by design: form structure is declarative data rather than 10-K-shaped heuristics, link-less-TOC bank filings (Goldman Sachs, Citigroup) extract their items correctly, and wrong-content sections are flagged instead of trusted.

Added

  • Section.markdown() now works on TOC-detected sections — slices the section HTML and renders structure-preserving markdown (tables, lists) instead of falling back to flat text. Completes the Section.markdown() work from 5.32.0.
  • Per-form section schema — each form's extraction rules live in a declarative schema (form_schema.py) instead of branches in the TOC analyzer; supporting a new form is now a table entry.
  • Body-header item recovery — recovers canonical items from link-less-TOC 10-Ks (Goldman Sachs: 13 garbage sections → 21 correct items). Fires only when the linked-TOC parse is incomplete, so well-formed filings are untouched.
  • Section.warnings — flags sections whose content size is anomalous (truncated or over-captured) instead of returning them at high confidence.

Fixed

  • TenQ['Item 1'] returned Legal Proceedings instead of Financial Statements — pre-header 10-Q items were keyed without their Part prefix, so lookups fell through to Part II.
  • Fund get_company() silently returned None — SEC now types fund CIKs as numeric (225323.0), which broke key matching; CIKs are normalized through int so all forms key identically.
  • TenK.items now returns canonical SEC order (1, 1A, … 16) on all paths, not detection order.
  • Bare 10-K item keys get their canonical part prefix inferred from the item number; "Item 8" in sections still works.
  • Filer-specific item suffixes (e.g. Caterpillar "Item 1D") are accepted instead of dropped as non-canonical.
  • Descriptive free-text and bare Part labels no longer leak as sections in the generic TOC path.
  • 'part' no longer false-matches inside words like "counterparties" when inferring Part context.
  • TOC analyzer logs internal failures instead of silently degrading to the generic scraper.

Changed

  • Refreshed bundled reference datact.pq (CUSIP→ticker, 13F rendering) refreshed from SEC Fails-to-Deliver and merged to preserve coverage (68,512 CUSIPs); company_tickers.parquet (ticker↔CIK resolution) refreshed as a clean mirror of SEC's current data (10,365 entries).

Installation

pip install edgartools==5.34.0

v5.32.0

28 May 14:31

Choose a tag to compare

Added

  • xbrl.calculation_linkbase() DataFrame — exposes the per-filing calculation linkbase as one row per parent→child arc, with signed weight, role URI, taxonomy attribution (us-gaap vs filer extension), and SEC menucat classification. Enables external pipelines (e.g., bank revenue disaggregation, REIT rental income rollups) to build per-filer concept hierarchies without re-parsing _cal.xml. Layer 1 of the GH #766 implementation plan; the parser was already producing this data on CalculationTree/CalculationNode, this is a DataFrame projection over existing output. (#766)

  • Statement.extension_arcs() — surfaces filer-authored concepts that participate in a statement's calculation linkbase but are absent from its presentation tree, i.e. concepts that silently drop from render() output today. Opt-in via Statement.extension_arcs(include_values=False); default mode returns one ExtensionArc per concept (structural), include_values=True emits one per (concept, context) with the instance value attached. The existing render() path is untouched. Layer 2 of GH #766. Ground-truth verified on JPM FY2023 10-K cash flow (jpm:NetChangeInAdvancesToandInvestmentsInSubsidiaries, jpm:NetBorrowingsFromSubsidiaries — both calc-present, presentation-absent). (#766)

  • Section.markdown() accessor — closes the gap between Section.text() (item-aware but flattens tables and bullet lists) and Filing.markdown() (preserves structure but whole-document only). Per-item chunkers / RAG pipelines can now get structure-preserving markdown scoped to a single section. Pattern/heading-detected sections render the cached node tree via MarkdownRenderer; TOC-detected sections currently fall back to Section.text() to avoid corrupting adjacent-section markup (full TOC support tracked as a follow-up). Real-filing regression on AAPL 8-K Item 9.01 exhibit table locks in the pipe-table contract. (#833, contributor @HonzaCuhel)

Fixed

  • StreamingParser dropped 20%+ of text from <span>-wrapped paragraphs on large filings — for SEC filings crossing the 10 MB streaming threshold (so most ~30–110 MB 10-Ks/20-Fs), filing.text() silently returned output 20%+ shorter than the non-streaming path. Two compounding bugs in the iterparse loop: elem.clear() ran on every event (both start and end), and ran on every element regardless of whether an enclosing structural element (<p>, <h1><h6>, <section>) had finished reading its children. Since SEC filings wrap virtually every word in <span style="…">, the inner <span>'s end event cleared .text/.tail before the enclosing <p> could read them — paragraphs came out empty, with no warning. Clearing now runs only on end events and is gated on a new _content_depth counter (mirroring the existing _table_depth gate). A separate gate prevents <p>/<h*>/<section> inside <td> from being emitted twice. (#830, contributor @kevinchiu)

  • HTTP_MGR had no default timeout — stalled requests could block workers indefinitely — the internal httpx client was constructed without a timeout, so a stalled upstream or slow TLS handshake could pin a worker on an uninterruptible socket read syscall. Downstream users observed processes running 50+ minutes past their job budget on a single request. get_http_mgr() now sets Timeout(30.0, connect=10.0) by default; EDGAR_HTTP_TIMEOUT (seconds) configures it statically and the existing configure_http(timeout=...) runtime API still works. Callers that need unbounded waits can opt out explicitly. (#831, contributor @kevinchiu)

  • 13F-HR holdings merged Put/Call positions into the underlying equity rowThirteenF.holdings grouped by CUSIP alone, so Put/Call rows aggregated into the same security's equity row and the PutCall column was lost on the merged result. Categories also used uppercase PUT/CALL while SEC XML emits title-case Put/Call, so the categorical conversion silently dropped those values too. Group key now includes PutCall when the column exists; category labels match SEC XML. Regression verified on SG Capital Management 13F-HR/A (3 distinct Put positions preserved in the aggregated view). (#824)

  • import edgar emitted DeprecationWarning on every startup — the legacy HTML modules (edgar.files.html_documents, edgar.files.html, edgar.files.htmltools) emitted warnings at module top, and edgartools' own startup cascade imports them, so the warnings fired on every fresh import. Downstream test suites running under -W error (a recommended pytest setup) had to install warning filters just to let import edgar succeed. The deprecation signal moved from module top to per-class __init__, so internal callers don't trip the warning while user-instantiated legacy classes still do. (#832, contributor @kevinchiu)

  • Filing.search() / Filing.grep() returned nothing on pre-2002 plain-text filingsFiling.search() raised AssertionError and Filing.grep() returned 0 matches on plain-text filings (e.g. PCG's 1999 10-K). Both relied on attachment iteration that finds nothing because SGML decomposition emits empty shells for text-only filings. sections() now falls back to chunking filing.text() on <PAGE> markers or blank lines when html() is None, and grep() falls back to filing.text() when no attachment yields usable text. (#819)

  • TOC analyzer fabricated phantom Items on 10-Q filingsTOCAnalyzer had three 10-K-shaped heuristics that fired regardless of form: it accepted any bare number 1–15 as an item identifier in preceding-<td> siblings (so a page-number cell like <td>8</td> became "Item 8"); it mapped any "financial statements" link to "Item 8" (correct for 10-K, wrong for 10-Q where Financial Statements is Part I, Item 1); and it sorted using a 10-K-shaped section-order table. All three heuristics are now form-guarded. (#827, contributor @HonzaCuhel)

  • SearchResults panel labels conflated BM25 rank with section indexSearchResults.__rich__ used the enumeration rank of the sorted display as the panel title, so the same numeric label meant different things in the BM25 and regex paths (BM25 sorts by score, regex preserves original order). "0" in BM25 output was the top-scoring section while "0" in regex output was the first section that matched, and the two were rarely the same. Panels now display DocSection.loc — the section's index in filing.sections() — consistently across search methods, so callers can index back into the corpus regardless of search mode. (#765)

Documentation

  • calculation_linkbase() and Statement.extension_arcs() documented alongside Phase 1 and Phase 2 of the GH #766 implementation, including the difference from presentation linkbase and worked examples on real filings. (#766, Phase 3)

Contributors: @HonzaCuhel, @kevinchiu, @0ywfe

Full changelog: v5.31.5...v5.32.0

v5.31.5

22 May 01:26

Choose a tag to compare

Fixed

  • xbrl.facts.to_dataframe() mislabeled Q2/Q3 as Q3/Q4 for 52/53-week fiscal-year filers (JNJ, PFE, AAPL, COST) — the XBRL instance parser's _quarter_for_date classified the fiscal quarter from the raw calendar month of the period end. 52/53-week issuers pin quarter ends to a weekday near the calendar quarter boundary, so the period_end can drift into the first days of the following month — JNJ Q2 2023 ended 2023-07-02, Q3 2023 ended 2023-10-01 — bucketing those facts into the next quarter. The EntityFacts layer already handled this via calculate_fiscal_year_for_label, but the XBRL parser has an independent fiscal classification path feeding xbrl.facts.to_dataframe() and query().by_fiscal_period(...), silently misclassifying quarterly data for any RAG / analytics pipeline reading raw facts. End dates in the first 7 days of a month are now treated as belonging to the previous month for quarter classification; the 7-day window covers max drift for Sunday-nearest (≤3 days), Saturday-nearest (≤1 day), and last-Sat/Sun (no drift) patterns with safety margin. (#816, reporter @kmatosli)

Full Changelog: v5.31.4...v5.31.5

v5.31.4

21 May 20:02

Choose a tag to compare

Fixed

  • Empty income statement on 16-week-quarter filers (CAVA, RRGB) — quarterly period selection bucketed durations as 80-100 days or 150-285 days, leaving CAVA's 111-day Q1 in a dead zone. The selector now anchors on filing.period_of_report. (#822, reporter @mkdeak)

  • TenK.business silently returned Part II MD&A content on GS's 2025 10-K — the cross-Part lookup happily returned a mislabeled part_ii_item_1 key. Item lookup is now constrained to the SEC-canonical Part per item. (#821, reporter @FlorinAndrei)

  • Viewer ConceptRow.numeric_value returned wrong values on ADI 2019 and ADSK 2019 10-Ksprimary_period is now form-aware (annual forms prefer the longest \"X Months Ended\" duration), and class=\"th\" spacer cells are dropped from body rows so column positions align. (#818, reporter @mpreiss9)

  • Filing.search() raised a bare AssertionError on pre-2001 SGML/text filings — replaced with a descriptive ValueError pointing users at filing.text(). (#819, reporter @shenker)

Added

  • 10-K section patterns for Item 1B (Unresolved Staff Comments) and Item 1C (Cybersecurity) — closes the gap left when the same SEC rulemaking added 8-K Item 1.05 and 20-F Item 16K patterns. (#813, contributor @HonzaCuhel)

v5.31.3

17 May 22:08

Choose a tag to compare

Fixed

  • viewer.financial_statements returned wrong income statement for filings with multi-row period headers (e.g. ADI 2019 10-K mislabeled annual columns as quarterly). The R*.htm header parser was rewritten to walk <thead> row by row and filter footnote markers. Affected most 10-K/10-Q filings silently. (#812, reporter @mpreiss9)

  • Financials.get_net_income() returned wrong value (often wrong sign) for filers reporting a net loss with a separate noncontrolling-interest line — for Micron Q2 2013 returned +$2M (the NCI row) instead of -$286M. Also fixes IFRS 20-F filers whose row label isn't "Net income" (e.g. Barclays "Profit after tax"). Concept lookup is now exact and IFRS-aware. (#814, reporter @wei-jianlin)

v5.31.2

15 May 15:17

Choose a tag to compare

Fixed

  • FundReport.options_data() crashed with TypeError: bad operand type for abs(): 'NoneType' on N-PORT filings whose nested forwards had null USD amountsedgar/funds/reports.py:1011-1012 cast fwd.amount_sold / fwd.amount_purchased through abs() when the corresponding currency_* field equalled 'USD', but valid N-PORT XBRL can pair a stated USD currency with a null amount — every option-on-forward in such a filing tripped the crash before any data was returned. The documented public API was effectively unusable for any fund whose options referenced such a forward (reproducer: GOF NPORT-P). Both assignments now guard on amount_* is not None; the exchange-rate calculation just below was already safe via Python's short-circuiting. Defensive grep across the file confirmed lines 1011-1012 were the only unguarded abs() calls. (#811, reporter @HristoRaykov)

  • viewer.concept_rows[i].numeric_value silently returned a prior-year value when the primary reporting period had no fact for the rowConceptRow.numeric_value (and the sibling Concept.value accessor on the concept graph) returned parse_numeric(next(iter(self.values.values()))) — the first entry of the values dict, which was populated only for periods that had a non-empty cell. When the primary (leftmost) reporting period had no value, the singular accessor silently returned whichever period happened to be first in the dict, masking missing-period as a prior-year value. Most visible on the ABT 2019 10-K income statement: concept_rows[16] (us-gaap_IncomeLossFromDiscontinuedOperationsNetOfTax) returned 34.0 (the 2018 value) because ABT had no 2019 discontinued-ops fact. Tracks primary_period on ConceptRow (populated by the R*.htm parser from period_headers[0]) and resolves numeric_value against it explicitly, returning None when the primary period has no value. Concept.value in concept_graph.py got the same fix — same antipattern, same underlying row data, user-visible via the concept graph's Rich/text rendering. (#810, reporter @mpreiss9)

  • FundFeeNotice crashed with AttributeError: 'list' object has no attribute 'get' on per-class 24F-2NT filingsxmltodict-style parsing returns repeated annualFilingInfo blocks as a list, but every typed accessor (fund_name, series, aggregate_sales, etc.) called .get() on the result. ~2% of recent 24F-2NT filings — including all five BNY Mellon family filings — file one block per share class, so the first call into the data object raised before any data was returned. The data model now iterates every annualFilingInfo block: typed financial properties (aggregate_sales, net_sales, redemptions_current_year, registration_fee, total_due, …) sum across blocks; metadata properties (fund_name, fiscal_year_end, investment_company_act_file_number) read from block[0] (identical across blocks); series deduplicates by seriesId. A new FundClassFee dataclass + is_per_class flag + class_fees list expose the per-share-class breakdown. The _parse_float helper now also handles accounting-parens notation (NNN)-NNN, which appears in redemptionCreditsAvailableForUseInFutureYears. Backwards-compatible: every existing property keeps the same return shape; the fund total invariant aggregate_sales == sum(cf.aggregate_sales for cf in class_fees) is verified against BNY Mellon Research Growth Fund. (edgartools-8ohs)

  • viewer.concept_report.currency_scaling returned wrong scales for filers using non-Apple header formatsConceptReport.currency_scaling was derived from a narrow text match on the R*.htm <th class='tl'> header ($ in millions / $in millions). Filers using In Millions, (in millions), USD ($) in Millions, or Dollars in Millions silently fell through to the default of 1, producing scaling that disagreed across statements within a single filing (ALGN balance sheet vs income statement) and wrong values for whole multi-year ranges (ABNB showing 1 for 2023/2024 when the actual scale is millions). ViewerReport.currency_scaling now derives the scale from the XBRL decimals attribute on monetary facts mapped to the report's role in the presentation linkbase — filer-mandated and uniform (-6 → millions, -3 → thousands, 0 → units). The text-match value is retained as a fallback when XBRL is unavailable. The resolved scale is mirrored back onto ConceptReport.currency_scaling so existing code reading it via the concept-report path also benefits. Same precedent as GH #799 (level enrichment from XBRL). (#807, reporter @mpreiss9)

v5.31.1

12 May 10:55

Choose a tag to compare

Fixed

  • Schedule 13D/13G silently dropped CUSIPs with the new <issuerCusips> wrapper — SEC began wrapping <issuerCusipNumber> inside an <issuerCusips> container element on some Schedule 13D/13G filings (e.g. CIK 1906837 13D, CIK 1425851 13G). The parser's BS4 recursive=False lookup at the top-level only matched the flat layout, so subject_company.cusip came back as '' whenever the wrapper was present. Parsing now falls back to a recursive lookup when the flat probe misses, handling both wire formats. (#802, PR #803 by @HristoRaykov)

  • Schedule 13D/13G event-date attribute name mismatchSchedule13D exposed the triggering-event date as date_of_event while Schedule13G exposed it as event_date, breaking duck-typing across a mixed list of 13D/13G filings and forcing callers to use getattr / hasattr. Both classes now accept either name; the underlying attribute is unchanged, so existing code keeps working. (#804, PR #805 by @0ywfe)

  • Spurious DocumentTooLargeError from StreamingParser on legitimate documents — The streaming HTML parser accumulated len(etree.tostring(elem)) on every lxml iterparse end event. Because tostring serializes the full subtree and end fires for every closing tag, nested elements were counted multiple times — large nested HTML could trip max_document_size even though the source document was under the limit. The per-event accumulator is also redundant: HTMLParser._parse already validates len(html.encode("utf-8")) against max_document_size before invoking streaming mode. The accumulator and its state are removed; size is now checked once at the top of StreamingParser.parse() and the same encoded bytes are reused for iterparse. (#806 by @kevinchiu)

Full Changelog: v5.31.0...v5.31.1

v5.31.0

08 May 14:35

Choose a tag to compare

Added

  • include_quarterly parameter on stitched XBRLS statementsXBRLS.from_filings() previously emitted a single column per filing, preferring YTD/annual over the discrete-quarter period when both existed in the source XBRL (Issue #475 design). This created a parity gap with single-filing XBRL, which surfaces both. The new opt-in include_quarterly=False parameter on XBRLS.get_statement(), StitchedStatement, and statements.income_statement() / cashflow_statement() causes each 10-Q to contribute both a 90-day discrete column and the YTD column, and each 10-K to contribute both an annual column and its embedded Q4 column. Distinct from discrete_quarters (v5.30.3) which derives quarterly cash-flow values by subtraction; this surfaces facts already in the filing. Default behavior is preserved. Has no effect on Balance Sheet (instant periods only). (#780, reporter @AhmedShaker12)

Fixed

  • viewer.financial_statements silently dropped income statements miscategorized in FilingSummary.xml — AbbVie's 2021 10-K placed Consolidated Statements of Earnings under MenuCategory='Uncategorized' instead of 'Statements' — a filer mistake that EdgarTools faithfully reflected, so the income statement disappeared from viewer.financial_statements while comparable 2019/2020/2022-2025 filings worked fine. The viewer now returns the union of FilingSummary MenuCategory='Statements' and MetaLinks groupType='statement', deduplicated by HTML filename, in filing-position order. MetaLinks reflects XBRL taxonomy classification and is more reliable than filer-provided menu metadata. (#797, reporter @mpreiss9)

  • viewer.concept_rows[*].level always returned 0 — Modern SEC R*.htm files don't encode hierarchy in the rendered HTML — empirically verified across 10 diverse 2025 10-Ks (AAPL, ABT, JPM, WMT, XOM, VZ, MSFT, GS, PFE, BRK.B): zero `plN` class tokens on primary statements, almost no `padding-left` styles, no row nesting. The canonical source is the XBRL presentation linkbase, which the existing parser already loads as `xbrl.presentation_trees[role].all_nodes[concept_id].depth`. The viewer now lazy-loads the parsed XBRL on first `concept_rows` access and populates `ConceptRow.level` from the presentation tree, normalized so the smallest depth observed in a report becomes 0. For the issue's canary case (ABT balance sheet) the level distribution went from `{0: 45}` to `{0: 15, 1: 26, 2: 4}`. (#799, reporter @mpreiss9, investigation by @tjhub1983)

  • XBRLS.from_filings(list, filter_amendments=True) crashed with AttributeError — The signature accepts `Union[Filings, List[Filing]]` and defaults `filter_amendments=True`, but the implementation called `filings.filter()` unconditionally — raising `AttributeError: 'list' object has no attribute 'filter'` whenever a plain list was passed. The implementation now branches on whether the input has a `.filter` method; for plain lists it falls back to a form-suffix check that drops forms ending in `/A`. (edgartools-6k96)

Full Changelog: v5.30.3...v5.31.0