Releases: dgunning/edgartools
EdgarTools 5.35.1
10-K section detection and agent TOC parsing receive two targeted fixes that close gaps introduced in 5.34.0.
Fixed
- Spurious Part IV Item 1/1A keys no longer appear in 10-K section maps — the section detector emitted duplicate entries for Items 1 and 1A under the Part IV heading of certain 10-Ks; the keys are now dropped so lookups return the correct Part I sections. (#836)
- Agent TOC parsers no longer drop Item 1 on title-only rows — when a TOC row contained only a title with no page number or hyperlink, the parser silently skipped Item 1; the row is now accepted and keyed correctly. (#837)
EdgarTools 5.35.0
BDC non-accrual extraction no longer depends on a filer phrasing its footnotes exactly the way our whitelist expected, and a parsing gap is now surfaced as a warning rather than read as a confirmed zero.
Added
edgar.__version__— the installed version is now exposed at the package root (import edgar; edgar.__version__), following the standardpkg.__version__convention so downstream consumers can detect which version they have without readingedgar.__about__or runningpip show. (#794)NonAccrualResult.warnings— flags a portfolio that produced no non-accrual signal from any extraction layer, and recognized flags that resolved no investments, so an LLM consumer never mistakes a parsing gap for a confirmed zero. Surfaced into_context, mirroring theSection.warningspattern.
Fixed
- BDC non-accrual footnote detection is now robust to wording drift — the exact-phrase affirmative-pattern whitelist silently dropped any footnote a filer didn't phrase as an enumerated sentence. MAIN changed "Non-accrual and non-income producing…" to "…or…" and its 10-Q returned an empty list; PSEC's verb-less "Investment on non-accrual status as of the reporting date" matched nothing. The binary regex gate is replaced with a layered classifier (mention → negation → explicit pattern → structure-corroborated short label) that accepts short footnotes linked to specific investment facts regardless of exact phrasing, while long rollforward/policy footnotes stay excluded by length. Real-world impact: PSEC 0 → 5 non-accrual investments, GBDC now extracts footnote-level detail, MAIN/ARCC/FSK unchanged. (#835)
Install: pip install -U edgartools
Full Changelog: v5.34.0...v5.35.0
EdgarTools 5.34.0
SEC section extraction is now form-aware by design: form structure is declarative data rather than 10-K-shaped heuristics, link-less-TOC bank filings (Goldman Sachs, Citigroup) extract their items correctly, and wrong-content sections are flagged instead of trusted.
Added
Section.markdown()now works on TOC-detected sections — slices the section HTML and renders structure-preserving markdown (tables, lists) instead of falling back to flat text. Completes theSection.markdown()work from 5.32.0.- Per-form section schema — each form's extraction rules live in a declarative schema (
form_schema.py) instead of branches in the TOC analyzer; supporting a new form is now a table entry. - Body-header item recovery — recovers canonical items from link-less-TOC 10-Ks (Goldman Sachs: 13 garbage sections → 21 correct items). Fires only when the linked-TOC parse is incomplete, so well-formed filings are untouched.
Section.warnings— flags sections whose content size is anomalous (truncated or over-captured) instead of returning them at high confidence.
Fixed
TenQ['Item 1']returned Legal Proceedings instead of Financial Statements — pre-header 10-Q items were keyed without their Part prefix, so lookups fell through to Part II.- Fund
get_company()silently returnedNone— SEC now types fund CIKs as numeric (225323.0), which broke key matching; CIKs are normalized throughintso all forms key identically. TenK.itemsnow returns canonical SEC order (1, 1A, … 16) on all paths, not detection order.- Bare 10-K item keys get their canonical part prefix inferred from the item number;
"Item 8" in sectionsstill works. - Filer-specific item suffixes (e.g. Caterpillar "Item 1D") are accepted instead of dropped as non-canonical.
- Descriptive free-text and bare Part labels no longer leak as sections in the generic TOC path.
'part'no longer false-matches inside words like "counterparties" when inferring Part context.- TOC analyzer logs internal failures instead of silently degrading to the generic scraper.
Changed
- Refreshed bundled reference data —
ct.pq(CUSIP→ticker, 13F rendering) refreshed from SEC Fails-to-Deliver and merged to preserve coverage (68,512 CUSIPs);company_tickers.parquet(ticker↔CIK resolution) refreshed as a clean mirror of SEC's current data (10,365 entries).
Installation
pip install edgartools==5.34.0
v5.32.0
Added
-
xbrl.calculation_linkbase()DataFrame — exposes the per-filing calculation linkbase as one row per parent→child arc, with signed weight, role URI, taxonomy attribution (us-gaap vs filer extension), and SEC menucat classification. Enables external pipelines (e.g., bank revenue disaggregation, REIT rental income rollups) to build per-filer concept hierarchies without re-parsing_cal.xml. Layer 1 of the GH #766 implementation plan; the parser was already producing this data onCalculationTree/CalculationNode, this is a DataFrame projection over existing output. (#766) -
Statement.extension_arcs()— surfaces filer-authored concepts that participate in a statement's calculation linkbase but are absent from its presentation tree, i.e. concepts that silently drop fromrender()output today. Opt-in viaStatement.extension_arcs(include_values=False); default mode returns oneExtensionArcper concept (structural),include_values=Trueemits one per (concept, context) with the instance value attached. The existingrender()path is untouched. Layer 2 of GH #766. Ground-truth verified on JPM FY2023 10-K cash flow (jpm:NetChangeInAdvancesToandInvestmentsInSubsidiaries,jpm:NetBorrowingsFromSubsidiaries— both calc-present, presentation-absent). (#766) -
Section.markdown()accessor — closes the gap betweenSection.text()(item-aware but flattens tables and bullet lists) andFiling.markdown()(preserves structure but whole-document only). Per-item chunkers / RAG pipelines can now get structure-preserving markdown scoped to a single section. Pattern/heading-detected sections render the cached node tree viaMarkdownRenderer; TOC-detected sections currently fall back toSection.text()to avoid corrupting adjacent-section markup (full TOC support tracked as a follow-up). Real-filing regression on AAPL 8-K Item 9.01 exhibit table locks in the pipe-table contract. (#833, contributor @HonzaCuhel)
Fixed
-
StreamingParserdropped 20%+ of text from<span>-wrapped paragraphs on large filings — for SEC filings crossing the 10 MB streaming threshold (so most ~30–110 MB 10-Ks/20-Fs),filing.text()silently returned output 20%+ shorter than the non-streaming path. Two compounding bugs in theiterparseloop:elem.clear()ran on every event (both start and end), and ran on every element regardless of whether an enclosing structural element (<p>,<h1>–<h6>,<section>) had finished reading its children. Since SEC filings wrap virtually every word in<span style="…">, the inner<span>'s end event cleared.text/.tailbefore the enclosing<p>could read them — paragraphs came out empty, with no warning. Clearing now runs only onendevents and is gated on a new_content_depthcounter (mirroring the existing_table_depthgate). A separate gate prevents<p>/<h*>/<section>inside<td>from being emitted twice. (#830, contributor @kevinchiu) -
HTTP_MGRhad no default timeout — stalled requests could block workers indefinitely — the internalhttpxclient was constructed without a timeout, so a stalled upstream or slow TLS handshake could pin a worker on an uninterruptible socket read syscall. Downstream users observed processes running 50+ minutes past their job budget on a single request.get_http_mgr()now setsTimeout(30.0, connect=10.0)by default;EDGAR_HTTP_TIMEOUT(seconds) configures it statically and the existingconfigure_http(timeout=...)runtime API still works. Callers that need unbounded waits can opt out explicitly. (#831, contributor @kevinchiu) -
13F-HR
holdingsmerged Put/Call positions into the underlying equity row —ThirteenF.holdingsgrouped by CUSIP alone, so Put/Call rows aggregated into the same security's equity row and thePutCallcolumn was lost on the merged result. Categories also used uppercasePUT/CALLwhile SEC XML emits title-casePut/Call, so the categorical conversion silently dropped those values too. Group key now includesPutCallwhen the column exists; category labels match SEC XML. Regression verified on SG Capital Management 13F-HR/A (3 distinct Put positions preserved in the aggregated view). (#824) -
import edgaremittedDeprecationWarningon every startup — the legacy HTML modules (edgar.files.html_documents,edgar.files.html,edgar.files.htmltools) emitted warnings at module top, and edgartools' own startup cascade imports them, so the warnings fired on every fresh import. Downstream test suites running under-W error(a recommended pytest setup) had to install warning filters just to letimport edgarsucceed. The deprecation signal moved from module top to per-class__init__, so internal callers don't trip the warning while user-instantiated legacy classes still do. (#832, contributor @kevinchiu) -
Filing.search()/Filing.grep()returned nothing on pre-2002 plain-text filings —Filing.search()raisedAssertionErrorandFiling.grep()returned 0 matches on plain-text filings (e.g. PCG's 1999 10-K). Both relied on attachment iteration that finds nothing because SGML decomposition emits empty shells for text-only filings.sections()now falls back to chunkingfiling.text()on<PAGE>markers or blank lines whenhtml()is None, andgrep()falls back tofiling.text()when no attachment yields usable text. (#819) -
TOC analyzer fabricated phantom Items on 10-Q filings —
TOCAnalyzerhad three 10-K-shaped heuristics that fired regardless of form: it accepted any bare number 1–15 as an item identifier in preceding-<td>siblings (so a page-number cell like<td>8</td>became "Item 8"); it mapped any "financial statements" link to "Item 8" (correct for 10-K, wrong for 10-Q where Financial Statements is Part I, Item 1); and it sorted using a 10-K-shaped section-order table. All three heuristics are now form-guarded. (#827, contributor @HonzaCuhel) -
SearchResultspanel labels conflated BM25 rank with section index —SearchResults.__rich__used the enumeration rank of the sorted display as the panel title, so the same numeric label meant different things in the BM25 and regex paths (BM25 sorts by score, regex preserves original order). "0" in BM25 output was the top-scoring section while "0" in regex output was the first section that matched, and the two were rarely the same. Panels now displayDocSection.loc— the section's index infiling.sections()— consistently across search methods, so callers can index back into the corpus regardless of search mode. (#765)
Documentation
calculation_linkbase()andStatement.extension_arcs()documented alongside Phase 1 and Phase 2 of the GH #766 implementation, including the difference from presentation linkbase and worked examples on real filings. (#766, Phase 3)
Contributors: @HonzaCuhel, @kevinchiu, @0ywfe
Full changelog: v5.31.5...v5.32.0
v5.31.5
Fixed
xbrl.facts.to_dataframe()mislabeled Q2/Q3 as Q3/Q4 for 52/53-week fiscal-year filers (JNJ, PFE, AAPL, COST) — the XBRL instance parser's_quarter_for_dateclassified the fiscal quarter from the raw calendar month of the period end. 52/53-week issuers pin quarter ends to a weekday near the calendar quarter boundary, so the period_end can drift into the first days of the following month — JNJ Q2 2023 ended 2023-07-02, Q3 2023 ended 2023-10-01 — bucketing those facts into the next quarter. The EntityFacts layer already handled this viacalculate_fiscal_year_for_label, but the XBRL parser has an independent fiscal classification path feedingxbrl.facts.to_dataframe()andquery().by_fiscal_period(...), silently misclassifying quarterly data for any RAG / analytics pipeline reading raw facts. End dates in the first 7 days of a month are now treated as belonging to the previous month for quarter classification; the 7-day window covers max drift for Sunday-nearest (≤3 days), Saturday-nearest (≤1 day), and last-Sat/Sun (no drift) patterns with safety margin. (#816, reporter @kmatosli)
Full Changelog: v5.31.4...v5.31.5
v5.31.4
Fixed
-
Empty income statement on 16-week-quarter filers (CAVA, RRGB) — quarterly period selection bucketed durations as 80-100 days or 150-285 days, leaving CAVA's 111-day Q1 in a dead zone. The selector now anchors on
filing.period_of_report. (#822, reporter @mkdeak) -
TenK.businesssilently returned Part II MD&A content on GS's 2025 10-K — the cross-Part lookup happily returned a mislabeledpart_ii_item_1key. Item lookup is now constrained to the SEC-canonical Part per item. (#821, reporter @FlorinAndrei) -
Viewer
ConceptRow.numeric_valuereturned wrong values on ADI 2019 and ADSK 2019 10-Ks —primary_periodis now form-aware (annual forms prefer the longest\"X Months Ended\"duration), andclass=\"th\"spacer cells are dropped from body rows so column positions align. (#818, reporter @mpreiss9) -
Filing.search()raised a bareAssertionErroron pre-2001 SGML/text filings — replaced with a descriptiveValueErrorpointing users atfiling.text(). (#819, reporter @shenker)
Added
- 10-K section patterns for Item 1B (Unresolved Staff Comments) and Item 1C (Cybersecurity) — closes the gap left when the same SEC rulemaking added 8-K Item 1.05 and 20-F Item 16K patterns. (#813, contributor @HonzaCuhel)
v5.31.3
Fixed
-
viewer.financial_statementsreturned wrong income statement for filings with multi-row period headers (e.g. ADI 2019 10-K mislabeled annual columns as quarterly). The R*.htm header parser was rewritten to walk<thead>row by row and filter footnote markers. Affected most 10-K/10-Q filings silently. (#812, reporter @mpreiss9) -
Financials.get_net_income()returned wrong value (often wrong sign) for filers reporting a net loss with a separate noncontrolling-interest line — for Micron Q2 2013 returned +$2M (the NCI row) instead of -$286M. Also fixes IFRS 20-F filers whose row label isn't "Net income" (e.g. Barclays "Profit after tax"). Concept lookup is now exact and IFRS-aware. (#814, reporter @wei-jianlin)
v5.31.2
Fixed
-
FundReport.options_data()crashed withTypeError: bad operand type for abs(): 'NoneType'on N-PORT filings whose nested forwards had null USD amounts —edgar/funds/reports.py:1011-1012castfwd.amount_sold/fwd.amount_purchasedthroughabs()when the correspondingcurrency_*field equalled'USD', but valid N-PORT XBRL can pair a stated USD currency with a null amount — every option-on-forward in such a filing tripped the crash before any data was returned. The documented public API was effectively unusable for any fund whose options referenced such a forward (reproducer: GOF NPORT-P). Both assignments now guard onamount_* is not None; the exchange-rate calculation just below was already safe via Python's short-circuiting. Defensive grep across the file confirmed lines 1011-1012 were the only unguardedabs()calls. (#811, reporter @HristoRaykov) -
viewer.concept_rows[i].numeric_valuesilently returned a prior-year value when the primary reporting period had no fact for the row —ConceptRow.numeric_value(and the siblingConcept.valueaccessor on the concept graph) returnedparse_numeric(next(iter(self.values.values())))— the first entry of the values dict, which was populated only for periods that had a non-empty cell. When the primary (leftmost) reporting period had no value, the singular accessor silently returned whichever period happened to be first in the dict, masking missing-period as a prior-year value. Most visible on the ABT 2019 10-K income statement:concept_rows[16](us-gaap_IncomeLossFromDiscontinuedOperationsNetOfTax) returned34.0(the 2018 value) because ABT had no 2019 discontinued-ops fact. Tracksprimary_periodonConceptRow(populated by the R*.htm parser fromperiod_headers[0]) and resolvesnumeric_valueagainst it explicitly, returningNonewhen the primary period has no value.Concept.valueinconcept_graph.pygot the same fix — same antipattern, same underlying row data, user-visible via the concept graph's Rich/text rendering. (#810, reporter @mpreiss9) -
FundFeeNoticecrashed withAttributeError: 'list' object has no attribute 'get'on per-class 24F-2NT filings —xmltodict-style parsing returns repeatedannualFilingInfoblocks as a list, but every typed accessor (fund_name,series,aggregate_sales, etc.) called.get()on the result. ~2% of recent 24F-2NT filings — including all five BNY Mellon family filings — file one block per share class, so the first call into the data object raised before any data was returned. The data model now iterates everyannualFilingInfoblock: typed financial properties (aggregate_sales,net_sales,redemptions_current_year,registration_fee,total_due, …) sum across blocks; metadata properties (fund_name,fiscal_year_end,investment_company_act_file_number) read fromblock[0](identical across blocks);seriesdeduplicates byseriesId. A newFundClassFeedataclass +is_per_classflag +class_feeslist expose the per-share-class breakdown. The_parse_floathelper now also handles accounting-parens notation(NNN)→-NNN, which appears inredemptionCreditsAvailableForUseInFutureYears. Backwards-compatible: every existing property keeps the same return shape; the fund total invariantaggregate_sales == sum(cf.aggregate_sales for cf in class_fees)is verified against BNY Mellon Research Growth Fund. (edgartools-8ohs) -
viewer.concept_report.currency_scalingreturned wrong scales for filers using non-Apple header formats —ConceptReport.currency_scalingwas derived from a narrow text match on the R*.htm<th class='tl'>header ($ in millions/$in millions). Filers usingIn Millions,(in millions),USD ($) in Millions, orDollars in Millionssilently fell through to the default of1, producing scaling that disagreed across statements within a single filing (ALGN balance sheet vs income statement) and wrong values for whole multi-year ranges (ABNB showing1for 2023/2024 when the actual scale is millions).ViewerReport.currency_scalingnow derives the scale from the XBRLdecimalsattribute on monetary facts mapped to the report's role in the presentation linkbase — filer-mandated and uniform (-6→ millions,-3→ thousands,0→ units). The text-match value is retained as a fallback when XBRL is unavailable. The resolved scale is mirrored back ontoConceptReport.currency_scalingso existing code reading it via the concept-report path also benefits. Same precedent as GH #799 (level enrichment from XBRL). (#807, reporter @mpreiss9)
v5.31.1
Fixed
-
Schedule 13D/13G silently dropped CUSIPs with the new
<issuerCusips>wrapper — SEC began wrapping<issuerCusipNumber>inside an<issuerCusips>container element on some Schedule 13D/13G filings (e.g. CIK 1906837 13D, CIK 1425851 13G). The parser's BS4recursive=Falselookup at the top-level only matched the flat layout, sosubject_company.cusipcame back as''whenever the wrapper was present. Parsing now falls back to a recursive lookup when the flat probe misses, handling both wire formats. (#802, PR #803 by @HristoRaykov) -
Schedule 13D/13G event-date attribute name mismatch —
Schedule13Dexposed the triggering-event date asdate_of_eventwhileSchedule13Gexposed it asevent_date, breaking duck-typing across a mixed list of 13D/13G filings and forcing callers to usegetattr/hasattr. Both classes now accept either name; the underlying attribute is unchanged, so existing code keeps working. (#804, PR #805 by @0ywfe) -
Spurious
DocumentTooLargeErrorfromStreamingParseron legitimate documents — The streaming HTML parser accumulatedlen(etree.tostring(elem))on every lxmliterparseendevent. Becausetostringserializes the full subtree andendfires for every closing tag, nested elements were counted multiple times — large nested HTML could tripmax_document_sizeeven though the source document was under the limit. The per-event accumulator is also redundant:HTMLParser._parsealready validateslen(html.encode("utf-8"))againstmax_document_sizebefore invoking streaming mode. The accumulator and its state are removed; size is now checked once at the top ofStreamingParser.parse()and the same encoded bytes are reused foriterparse. (#806 by @kevinchiu)
Full Changelog: v5.31.0...v5.31.1
v5.31.0
Added
include_quarterlyparameter on stitched XBRLS statements —XBRLS.from_filings()previously emitted a single column per filing, preferring YTD/annual over the discrete-quarter period when both existed in the source XBRL (Issue #475 design). This created a parity gap with single-filingXBRL, which surfaces both. The new opt-ininclude_quarterly=Falseparameter onXBRLS.get_statement(),StitchedStatement, andstatements.income_statement()/cashflow_statement()causes each 10-Q to contribute both a 90-day discrete column and the YTD column, and each 10-K to contribute both an annual column and its embedded Q4 column. Distinct fromdiscrete_quarters(v5.30.3) which derives quarterly cash-flow values by subtraction; this surfaces facts already in the filing. Default behavior is preserved. Has no effect on Balance Sheet (instant periods only). (#780, reporter @AhmedShaker12)
Fixed
-
viewer.financial_statementssilently dropped income statements miscategorized inFilingSummary.xml— AbbVie's 2021 10-K placedConsolidated Statements of EarningsunderMenuCategory='Uncategorized'instead of'Statements'— a filer mistake that EdgarTools faithfully reflected, so the income statement disappeared fromviewer.financial_statementswhile comparable 2019/2020/2022-2025 filings worked fine. The viewer now returns the union of FilingSummaryMenuCategory='Statements'and MetaLinksgroupType='statement', deduplicated by HTML filename, in filing-position order. MetaLinks reflects XBRL taxonomy classification and is more reliable than filer-provided menu metadata. (#797, reporter @mpreiss9) -
viewer.concept_rows[*].levelalways returned 0 — Modern SEC R*.htm files don't encode hierarchy in the rendered HTML — empirically verified across 10 diverse 2025 10-Ks (AAPL, ABT, JPM, WMT, XOM, VZ, MSFT, GS, PFE, BRK.B): zero `plN` class tokens on primary statements, almost no `padding-left` styles, no row nesting. The canonical source is the XBRL presentation linkbase, which the existing parser already loads as `xbrl.presentation_trees[role].all_nodes[concept_id].depth`. The viewer now lazy-loads the parsed XBRL on first `concept_rows` access and populates `ConceptRow.level` from the presentation tree, normalized so the smallest depth observed in a report becomes 0. For the issue's canary case (ABT balance sheet) the level distribution went from `{0: 45}` to `{0: 15, 1: 26, 2: 4}`. (#799, reporter @mpreiss9, investigation by @tjhub1983) -
XBRLS.from_filings(list, filter_amendments=True)crashed withAttributeError— The signature accepts `Union[Filings, List[Filing]]` and defaults `filter_amendments=True`, but the implementation called `filings.filter()` unconditionally — raising `AttributeError: 'list' object has no attribute 'filter'` whenever a plain list was passed. The implementation now branches on whether the input has a `.filter` method; for plain lists it falls back to a form-suffix check that drops forms ending in `/A`. (edgartools-6k96)
Full Changelog: v5.30.3...v5.31.0