feat: save documentation info to SQLite database#347
feat: save documentation info to SQLite database#347hargoniX merged 112 commits intoleanprover:mainfrom
Conversation
This PR adds a SQLite database that contains all of the documentation info.
|
!bench |
|
Benchmark results for 7f88ca7 against 837f89a are in! @david-christiansen No significant changes detected. |
|
Significance detection isn't going yet due to there not being enough data. The result for this version isn't great, so more work is needed:
|
Experiment to see if slowdown due to database file contention
|
!bench |
|
Benchmark results for 9272ece against 837f89a are in! @david-christiansen No significant changes detected. |
|
!bench |
|
Benchmark results for 418fa51 against 837f89a are in! @david-christiansen No significant changes detected. |
|
!bench |
|
Benchmark results for 76667ea against 837f89a are in! @david-christiansen No significant changes detected. |
|
!bench |
|
Benchmark results for e429f8e against 837f89a are in! @david-christiansen No significant changes detected. |
|
!bench |
|
Benchmark results for e1d2a8b against 837f89a are in! @david-christiansen No significant changes detected. |
|
As of e29f8e, it's:
Seems that the Mathlib cache was only getting partial values. |
|
With the original PR code, it's comparable:
|
Extensions are not presently handled, but the fallback data are saved.
This is the first step towards rendering HTML from the DB instead of directly. The serializable version of CodeWithInfos used here can be saved in the DB. The generated HTML is the same, modulo commit hashes and external URLs.
|
!bench |
|
Benchmark results for ec4cf3e against 837f89a are in! @david-christiansen
No significant changes detected. |
This is preliminary to generating HTML from the database. The output is still unchanged, modulo commit hashes and source URLs.
|
!bench |
|
Benchmark results for 9489cfd against 837f89a are in! @david-christiansen
No significant changes detected. |
The scripts indicate that the output is the same, modulo minor differences in automatic linking
|
!bench |
|
Benchmark results for c08bf78 against 13ecfbb are in! @david-christiansen
No significant changes detected. |
It's only used single-threadedly today, but I'd hate to find out the hard way that it isn't.
|
!bench |
|
Benchmark results for beb37c1 against 13ecfbb are in! @david-christiansen
No significant changes detected. |
|
!bench |
|
Benchmark results for 06dfc9d against 13ecfbb are in! @david-christiansen
No significant changes detected. |
|
|
||
| def escape (s : String) : String := Id.run do | ||
| let mut out := "" | ||
| let mut i := s.startPos | ||
| let mut j := s.startPos | ||
| while h : j ≠ s.endPos do | ||
| let c := j.get h | ||
| if let some esc := subst c then | ||
| out := out ++ s.extract i j ++ esc | ||
| j := j.next h | ||
| i := j | ||
| else | ||
| j := j.next h | ||
| if i = s.startPos then s -- no escaping needed, return original | ||
| else out ++ s.extract i j | ||
| where | ||
| subst : Char → Option String | ||
| | '&' => some "&" | ||
| | '<' => some "<" | ||
| | '>' => some ">" | ||
| | '"' => some """ | ||
| | _ => none | ||
|
|
There was a problem hiding this comment.
The old implementation actually showed up in profiles - this isn't just an exercise in code-fancying.
| Direct and transitive dependencies. | ||
|
|
||
| Loosely inspired by bazel's [depset](https://bazel.build/rules/lib/builtins/depset). -/ | ||
| abbrev DepSet (α) [Hashable α] [BEq α] := Array α × OrdHashSet α |
There was a problem hiding this comment.
Did you check that after deleting this #286 still works?
There was a problem hiding this comment.
Assuming that the build.yml added in that PR tests it effectively, then it still works. Is there another way to double-check? The output when running that command locally is very similar to what was on the PR, but there's not quite enough context there to know what it did before the PR and how to check it more.
There was a problem hiding this comment.
The docs facet is still returning the list of generated files (without duplicates), so I believe the relevant behavior is preserved.
| @@ -23,8 +23,11 @@ require «UnicodeBasic» from git | |||
| require Cli from git | |||
There was a problem hiding this comment.
@tydeu I would like to get your opinion on the lakefile. I don't trust myself enough with lake trickery to say it's correct for sure.
| TAR_ARGS=(doc) | ||
| if [ -f .lake/build/api-docs.db ]; then | ||
| # Compact the DB into a single portable file (removes WAL/journal dependency) | ||
| sqlite3 .lake/build/api-docs.db "VACUUM" |
There was a problem hiding this comment.
How big is the database for mathlib? Is this a metric we should track if only not to blow up the amount of cache people have to use too much if they decide to cache the DB?
There was a problem hiding this comment.
It's 717MB. With zstd compression, it's 96MB.
I'll rig up the benchmark to track both, just in case.
There was a problem hiding this comment.
OK, the latest benchmark run includes the sizes of the DB for both doc-gen itself and Mathlib. It's not posted here by the bot, but clicking through shows it.
| private def isAutoGeneratedSuffix (s : String) : Bool := | ||
| s == "rec" || s == "recOn" || s == "casesOn" || s == "noConfusion" || | ||
| s == "noConfusionType" || s == "below" || s == "brecOn" |
There was a problem hiding this comment.
Would be great if we could consolidate this at some point :/
There was a problem hiding this comment.
I agree.
|
!bench |
|
Benchmark results for 91b4138 against 13ecfbb are in! @david-christiansen
No significant changes detected. |
Co-authored-by: Mac Malone <tydeu@hatpress.net>
|
!bench |
|
Benchmark results for bb0b496 against 13ecfbb are in! @david-christiansen
No significant changes detected. |
Fixes a bug in leanprover#347 that caused multi-library builds to skip generating HTML for some libraries. The issue was that the Lake setup used declaration-data.bmp as its build target. Multiple-library builds would populate the database, but then only the one that won the race would generate its declaration-data.bmp. The others would incorrectly see that it existed and generate no HTML. Now, the same "marker file" approach is used as the database content to indicate that HTML for a given module, library, or package is up to date. This gives them the right traces, and allows declaration-data.bmp to be updated as needed. A regression test is also included.
* fix: don't skip docs in multi-library situations Fixes a bug in #347 that caused multi-library builds to skip generating HTML for some libraries. The issue was that the Lake setup used declaration-data.bmp as its build target. Multiple-library builds would populate the database, but then only the one that won the race would generate its declaration-data.bmp. The others would incorrectly see that it existed and generate no HTML. Now, the same "marker file" approach is used as the database content to indicate that HTML for a given module, library, or package is up to date. This gives them the right traces, and allows declaration-data.bmp to be updated as needed. A regression test is also included. * fix: DB contention without timeout * chore: bump leansqlite and update calls Rather than sending pragmas in strings, it's nicer to use a higher-level API. * Revert "chore: bump leansqlite and update calls" This reverts commit b95beb5. It will be sent in a separate PR.
This PR adds a SQLite database between
doc-gen4's per-module analysis and HTML generation. Lakeruns one
singlecommand per module (andgenCorefor the core Lean modulesInit,Std,Lake, andLean), each of which writes to a shared SQLite database. If the database already exists,singleincrementally updates it by deleting the module's old rows and reinserting. Then the
fromDbcommand reads everything back and generates HTML in parallel. The old pipeline, which generated
HTML directly during module analysis, is removed.
The database makes documentation data available to other tools. Verso can query it for docstrings
instead of consulting the Lean environment, and future tools can use it in ways we haven't
anticipated yet. It also means that HTML generation has access to the full set of declarations
across all modules, which improves the heuristic insertion of links in code.
The Database
Within a module, each item (declaration, module doc, constructor, structure field) is assigned a
sequential position starting from 0. The composite key
(module_name, position)is the primarykey for most tables. Constructors and structure fields are interleaved between their parent
declarations, so positions are not contiguous across top-level members. HTML generation reconstructs
module members by querying
name_infoandmodule_docs_Markdown, ordered by position.The database schema is versioned by two hashes: a DDL hash that detects changes to table
definitions, and a type hash that detects changes to Lean types serialized as blobs (like
RenderedCodeandRenderedCode.Tag). If either hash doesn't match, the database is rejected withan error message asking the user to rebuild. The type hash is computed at compile time from a
string representation of the relevant inductive types, so adding a constructor to
RenderedCode.Tagwill invalidate old databases automatically.
RenderedCodeLean's pretty printer produces
CodeWithInfos(aTaggedText SubexprInfo), which carriesexpression types, universe levels, elaboration state, and other metadata that is too large to
serialize.
RenderedCodeis aTaggedText RenderedCode.Tagthat keeps only what is needed for HTMLrendering: which tokens are declaration references (for linking), which are sorts (for linking to
the foundational types page), and which are keywords or strings (for syntax highlighting). The
conversion from
CodeWithInfostoRenderedCodeis lossy and not reversible.RenderedCodeisserialized to the database as a binary blob.
Verso Docstrings
Verso docstrings contain a tree of
Doc.Block/Doc.Inlinenodes with extension points(
ElabInline/ElabBlock) that hold opaqueDynamicvalues identified byName. Different Leanpackages can register their own extension types, so there is no way to know all possible types at
compile time. The serialization uses a registry of handlers (
DocstringValues) keyed by name. If ahandler exists for a given extension type, the payload is serialized with it. If not, only the name
is stored, and on deserialization the unknown extension is replaced with a sentinel value. In Verso
docstrings, the content underneath one of these extension types represents an alternative, simpler
rendering (e.g. plain text instead of highlighted code). This means the database remains readable
even if extension types are added or removed between versions.
builtinDocstringValuesincludeshandlers for the extension types that ship with Lean. A future PR will add a plugin system for
registering additional handlers so authors of docstring extensions can control their serialization
and their rendering to HTML.
HTML Generation and Link Resolution
HTML is generated in parallel, with around 20 tasks that each have a database connection. This
number was chosen through experimentation.
When converting
RenderedCodeto HTML, the code needs to turn declaration names into links. Fornames that appear directly in the global name index, this is straightforward. For names that don't
appear directly (private names, auto-generated auxiliary names like match discriminants and proof
terms), the code tries several heuristics: resolving private names to user-facing names, stripping
trailing auxiliary components to find a linkable parent, and falling back to a link to the module
page. The details and examples are documented in
renderedCodeToHtmlAuxinDocGen4.Output.Base.Future Possibilities
Here are some useful directions that are made possible by this PR, but not implemented in it.
HTML Simplification
Today, things like the instances list are generated at run time via JavaScript because HTML files don't have a global view of the project. With this new model, the HTML can be straightforwardly generated. To ease comparison with the existing code, that feature is not implemented in this PR.
Plugin API
Having documentation in a database solves an immediate need in Verso brought about by the module system: it's no longer possible to extract docstrings from environments now that they're in the server
olean, but a SQLite file provides for easy retrieval of them.Additionally, there are more and more use cases for extending
doc-gen4:These plugins need to be used at multiple times in
doc-gen4:A plugin is a structure that has a field for each of these interpolation points. Instead of having a fixed
Main.lean, a custom Lake target discovers the registered plugins in all packages in the workspace and generates aMainthat includes calls to them.The database model makes this much easier to implement. Plugins can add their own tables to the DB and write to them in the analysis phase, and they can be invoked with a DB handle at various points during HTML generation.
Incrementality and Caching
It would be possible to build each package's documentation DB separately. For instance, the core docs DB and Mathlib DBs could be part of the Mathlib cache. Then, at HTML generation time, the HTML generation process could
ATTACHeach package's documentation database and generate HTML for all of them at once, rather than re-analyzing all of Mathlib.Validation
To check that the database-generated HTML matches the old pipeline's output, there is a Python
comparison script (
scripts/check_diff_soup.py) that was used extensively during development andwill not be present in the squashed commit history, so it is described here in some detail.
A simple bit-for-bit comparison (even after normalizing with
tidy) is not sufficient because thedatabase-generated HTML is intentionally different in some ways. It has more links than the old
output because it can resolve names across all modules rather than just within the transitive
imports of the module being processed. It also adds
idattributes to inherited structure fields,wraps extends clause parents in
<span>elements for linking, deduplicates import lists, and dropsempty equations sections. These are all improvements, but they mean that a naïve diff would report
thousands of false positives.
The script parses both HTML trees and walks them in parallel, matching elements by position within
their parent. It can distinguish between a changed attribute and a removed element, which makes it
precise enough to enforce specific rules about which differences are acceptable.
The rules are declarative. Each rule is a function that receives a
DiffContext(containing theold and new elements, their ancestor chains, and pre-collected sets of valid link targets from both
directories) and returns a reason string if the difference is acceptable, or
Noneif it is not.The rules cover the following cases:
<span class="fn">, or havetheir
hrefchanged.<span class="fn">may be replaced by<a>if the new link target is valid, since thedatabase version can resolve names that the old pipeline could not.
<a>elements may be added if their targets are valid.hrefattributes pointing to private names (_private.in the URL) may change if the newtarget is valid, since the new version resolves to the public declaration.
hrefattributes may change as long as the anchor fragment is the same and the new target isvalid, covering cases where a name resolves to a different module.
file:///hrefs to.leansource files may differ in their temp directory prefix.<li>elements in import lists may be deduplicated.pipeline's ordering was nondeterministic for declarations that share a position. This rule uses
the SQLite database itself (via the
--dbflag) to look up source positions.idattributes, since the new version adds them as linktargets.
<span id="...">elements to create link targets forparent projection names.
For
<code>elements (declaration signatures and types), the tool uses a separate comparisonalgorithm that walks through children of both elements simultaneously while matching text content,
tracking wrapper elements (
<a>and<span>) as it goes. This is more precise than the generaltree-walking comparison because it can verify that the underlying text content is identical even
when the wrapping structure differs.
In addition to the per-file HTML comparison, the script also checks several other things. It
compares static assets (CSS, JavaScript, fonts, images) byte for byte. It compares the search index
(
declaration-data.bmp, which is JSON) with domain-specific rules: module entries are compared byURL and import set, instance lists are compared as sets, and declaration entries are compared field
by field, with
docLinkdifferences accepted when both the old and new targets are valid anchors.Other JSON files are compared by structural equality.
The script also runs a declaration census: it queries the database for every declaration marked as
rendered (
render = 1inname_info) and checks that a corresponding anchor exists in thegenerated HTML. This catches cases where a declaration is in the database but missing from the
output, which would not be detected by the per-file comparison since there is no old file to
compare against. Finally, it does a bidirectional target coverage check, comparing the set of
anchored link targets between the old and new HTML to flag any targets that were dropped.
The output is organized per file: for each file with differences, the script prints the rejected
differences (with unified diffs of the parent element) and, in verbose mode, the accepted ones
with their rule names. At the end it prints a summary with counts for HTML files, data files,
static assets, the declaration census, and target coverage, followed by a breakdown of accepted
differences by rule with a sample for each to facilitate inspection.
Results
The results of the script show one accepted difference in data files.
declaration-data.bmpswaps the link destinations for two instances:CategoryTheory.Abelian.instIsStableUnderBaseChangeEpimorphismsandCategoryTheory.Abelian.instIsStableUnderCobaseChangeMonomorphisms. This is because the relevant instances exist in both modules in Mathlib (link 1 link 2). The generated destination in the HTML depends just on the order thatdoc-gen4happens to visit the modules in. I'm not sure why there isn't some kind of conflict when importing them, but I don't think this differences is a bug.It also shows many differences in the files. Most of them are source link targets that just point at different temp files. There are examples of each category of allowed difference to make it easier to understand.
Review Process
Once this PR is acceptable, we should archive the comparison script somewhere and delete it prior to the squash merge. It won't be useful to compare two docsets from the database version, but it could be adapted to be useful for that purpose at some point.