feat: import Markdown (.md) files as study documents#7
Conversation
Render an uploaded Markdown file to a clean, text-bearing PDF with pdfkit + marked up front, then send it through the exact same extract -> quality-gate -> store pipeline as any PDF. The viewer, concept tagging, knowledge graph, and all four study tools work on Markdown with no further changes, and the text-coverage gate applies unchanged. - lib/md-to-pdf.ts: server-side Markdown -> PDF renderer (headings, ordered and bullet lists, blockquotes, fenced code, GFM tables, inline emphasis, inline code and links), styled to match scripts/generate-sample-pdfs.ts - app/api/upload/route.ts: detect .md/.markdown uploads and convert before the %PDF- check; empty/oversized markdown returns a friendly 422 - components/UploadCard.tsx: accept Markdown in the file picker, drop zone, validation, and copy - package.json: promote marked from a dev to a runtime dependency, since the converter runs in the packaged app Latin scripts only for now: the standard PDF fonts cover Latin, while CJK and other scripts would need an embedded font (a deliberate follow-up).
There was a problem hiding this comment.
Code Review
This pull request introduces server-side Markdown-to-PDF conversion, allowing users to upload Markdown files which are rendered into PDFs before ingestion. The frontend has been updated to accept Markdown files, and a new utility lib/md-to-pdf.ts handles the conversion using marked and pdfkit. Feedback on the implementation focuses on several PDF rendering issues, including layout corruption when blockquotes span page breaks, incorrect indentation for wrapped list items, loss of formatting for block-level elements inside lists, and potential off-page rendering of table rows that exceed the header column count.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| function renderBlockquote(doc: PDFKitDoc, token: Tokens.Blockquote): void { | ||
| const top = doc.y; | ||
| doc.x = MARGIN + 16; | ||
| for (const child of token.tokens) renderBlock(doc, child); | ||
| const bottom = doc.y; | ||
| doc | ||
| .save() | ||
| .lineWidth(2) | ||
| .strokeColor(ACCENT) | ||
| .moveTo(MARGIN + 4, top) | ||
| .lineTo(MARGIN + 4, bottom) | ||
| .stroke() | ||
| .restore(); | ||
| doc.x = MARGIN; | ||
| doc.moveDown(0.3); | ||
| } |
There was a problem hiding this comment.
If a blockquote spans across a page break, top is captured on the first page, but bottom is captured on the subsequent page. When doc.stroke() is called, PDFKit draws the vertical line on the current (last) page using the top coordinate from the first page.
This results in a corrupted vertical line drawn backwards (from a high Y coordinate to a low Y coordinate) on the last page, crossing through unrelated text, while leaving the first page with no line.
Suggested Solutions:
- Prevent page breaks: Estimate the height of the blockquote and use
breakIfTightto push the entire blockquote to the next page if it doesn't fit. - Indent instead of drawing lines: Temporarily adjust
doc.page.margins.leftto indent the blockquote content, and draw shorter vertical line segments per block/page instead of a single continuous line across pages.
| emitSegments(doc, inline, { | ||
| size: BODY_SIZE, | ||
| color: INK_700, | ||
| indent, | ||
| width: CONTENT_WIDTH - indent, | ||
| paragraphGap: 3, | ||
| }); |
There was a problem hiding this comment.
When text inside a list item wraps to a new line (or across a page break), PDFKit wraps it to the document's default left margin (MARGIN), completely ignoring the list's indent. This causes wrapped lines to align under the bullet/number rather than maintaining a clean hanging indent.
To achieve a proper hanging indent and ensure text wraps correctly even across page breaks, you can temporarily adjust doc.page.margins.left during the rendering of the list item's text, and then restore it afterward.
| emitSegments(doc, inline, { | |
| size: BODY_SIZE, | |
| color: INK_700, | |
| indent, | |
| width: CONTENT_WIDTH - indent, | |
| paragraphGap: 3, | |
| }); | |
| const originalLeft = doc.page.margins.left; | |
| doc.page.margins.left = MARGIN + indent; | |
| emitSegments(doc, inline, { | |
| size: BODY_SIZE, | |
| color: INK_700, | |
| indent, | |
| width: CONTENT_WIDTH - indent, | |
| paragraphGap: 3, | |
| }); | |
| doc.page.margins.left = originalLeft; |
| const inline: Segment[] = []; | ||
| const nested: Tokens.List[] = []; | ||
| for (const child of item.tokens) { | ||
| if (child.type === "list") nested.push(child as Tokens.List); | ||
| else if (child.type === "text") flattenInline((child as Tokens.Text).tokens ?? [child as Token], BASE_STYLE, inline); | ||
| else flattenInline([child], BASE_STYLE, inline); | ||
| } |
There was a problem hiding this comment.
Currently, renderList assumes that list items only contain nested lists or inline text. If a list item contains other block-level elements (such as code blocks, blockquotes, tables, or multiple paragraphs), they will fall into the else branch and be flattened into inline text via flattenInline.
This causes them to lose all block-level formatting (e.g., code block backgrounds, table structures, or paragraph breaks) and render as a single continuous line of text.
Consider documenting this limitation, or recursively calling renderBlock with an adjusted indentation/margin to support block-level children within list items.
| const heights = cells.map((c) => | ||
| doc.heightOfString(cellText(c), { width: colWidth - 12 }), | ||
| ); | ||
| const rowHeight = Math.max(...heights, 14) + 8; | ||
| breakIfTight(doc, rowHeight); | ||
| const top = doc.y; | ||
| cells.forEach((c, i) => { | ||
| doc | ||
| .font(font) | ||
| .fontSize(header ? 9 : 10) | ||
| .fillColor(color) | ||
| .text(cellText(c), MARGIN + i * colWidth, top + 4, { width: colWidth - 12 }); | ||
| }); |
There was a problem hiding this comment.
If a table row has more cells than the header (token.header.length), colWidth is calculated based on the header length, but the loop will try to render all cells. This will cause cells with index i >= cols to be drawn at MARGIN + i * colWidth, which is beyond MARGIN + CONTENT_WIDTH (off the right edge of the page).
To prevent drawing off-page and potential layout corruption, you should slice the row cells to match the header column count.
| const heights = cells.map((c) => | |
| doc.heightOfString(cellText(c), { width: colWidth - 12 }), | |
| ); | |
| const rowHeight = Math.max(...heights, 14) + 8; | |
| breakIfTight(doc, rowHeight); | |
| const top = doc.y; | |
| cells.forEach((c, i) => { | |
| doc | |
| .font(font) | |
| .fontSize(header ? 9 : 10) | |
| .fillColor(color) | |
| .text(cellText(c), MARGIN + i * colWidth, top + 4, { width: colWidth - 12 }); | |
| }); | |
| const activeCells = cells.slice(0, cols); | |
| const heights = activeCells.map((c) => | |
| doc.heightOfString(cellText(c), { width: colWidth - 12 }), | |
| ); | |
| const rowHeight = Math.max(...heights, 14) + 8; | |
| breakIfTight(doc, rowHeight); | |
| const top = doc.y; | |
| activeCells.forEach((c, i) => { | |
| doc | |
| .font(font) | |
| .fontSize(header ? 9 : 10) | |
| .fillColor(color) | |
| .text(cellText(c), MARGIN + i * colWidth, top + 4, { width: colWidth - 12 }); | |
| }); |
When the markdown contains CJK / kana / hangul / fullwidth characters, probe for a system CJK font (PingFang / Hiragino on macOS, Microsoft YaHei / SimSun on Windows, Noto Sans CJK / WenQuanYi on Linux) and render the whole document with it. Those families carry Latin glyphs too, so mixed English/Chinese reads correctly; pdfkit subsets the font, so only the glyphs actually used are embedded. When no CJK font is present we fall back to the Latin fonts. Verified round-trip: a Chinese study note renders, embeds, and is re-extracted by pdf.js as correct Chinese text, clearing the text-coverage gate.
Add scripts/test-md-import.ts (run via `npm run test:md`), matching the existing tsx + check() harness. Tests the public interface only — markdownToPdf and the bytes it returns, round-tripped through the same extractPdf the upload pipeline uses — so they survive a rewrite of the layout internals: - renders a valid PDF - round-trip fidelity: heading, bold, inline code, list, blockquote, code block, table cell, and link text all survive into the extracted text, and a substantive doc clears the text-coverage gate - empty / whitespace-only input is rejected with MarkdownEmptyError - oversized input is rejected with MarkdownTooLargeError - CJK never throws and yields a valid PDF; the Chinese round-trip is asserted when a CJK system font is present and skipped otherwise (bare CI box)
Library, viewer, and the KG job titler stripped only `.pdf` from a filename when no curated title matched, so an imported `notes.md` displayed as `notes.md`. Extend the suffix strip to the markdown extensions we accept, mirroring MARKDOWN_EXT in the upload route.
| } | ||
|
|
||
| /** | ||
| * Render a Markdown string to PDF bytes. The returned buffer is a normal, | ||
| * text-bearing PDF that `extractPdf` reads exactly like any other upload. | ||
| */ | ||
| export async function markdownToPdf(markdown: string): Promise<Buffer> { | ||
| if (Buffer.byteLength(markdown, "utf-8") > MAX_MARKDOWN_BYTES) { | ||
| throw new MarkdownTooLargeError(); | ||
| } | ||
| if (!markdown.trim()) { | ||
| throw new MarkdownEmptyError(); | ||
| } | ||
|
|
||
| const tokens = marked.lexer(markdown, { gfm: true }); | ||
|
|
There was a problem hiding this comment.
Loose list items render raw markdown syntax verbatim
When a list is "loose" (items separated by blank lines), marked wraps each item's content in a paragraph token rather than a text token. The paragraph type is not handled explicitly here, so it falls through to else flattenInline([child], …). Inside flattenInline, paragraph hits the default branch which uses token.text — the raw markdown source string (e.g. "Plants convert **light** energy"). The result is that asterisks, underscores, and backticks appear literally in the rendered PDF, rather than being rendered as bold/italic/code.
A loose list is triggered by any blank line between list items, which is very common in study notes. The fix is to add an explicit branch for paragraph children, similar to the existing text branch:
else if (child.type === "paragraph")
flattenInline((child as Tokens.Paragraph).tokens, BASE_STYLE, inline);
|
|
||
| function renderBlockquote(ctx: Ctx, token: Tokens.Blockquote): void { | ||
| const { doc } = ctx; | ||
| const top = doc.y; | ||
| doc.x = MARGIN + 16; | ||
| for (const child of token.tokens) renderBlock(ctx, child); | ||
| const bottom = doc.y; | ||
| doc | ||
| .save() | ||
| .lineWidth(2) | ||
| .strokeColor(ACCENT) | ||
| .moveTo(MARGIN + 4, top) | ||
| .lineTo(MARGIN + 4, bottom) | ||
| .stroke() | ||
| .restore(); | ||
| doc.x = MARGIN; | ||
| doc.moveDown(0.3); | ||
| } |
There was a problem hiding this comment.
Multi-page blockquote accent line drawn with stale coordinates
top is captured before renderBlock processes the child tokens, and bottom is captured after. If the blockquote content wraps onto a new page, top will be a Y value on the previous page while bottom is a Y value on the current page. The moveTo(…, top).lineTo(…, bottom) call draws both points on the current page, producing a line that runs from near the bottom of the visible area upward to bottom — effectively backwards. Multi-paragraph blockquotes in longer documents will trigger this.
| <input | ||
| ref={inputRef} | ||
| type="file" |
There was a problem hiding this comment.
accept attribute is missing three of the five accepted extensions
ACCEPTED_FILE (and the server-side MARKDOWN_EXT) accept .mdown, .mkd, and .mdwn, but the accept string only lists .md and .markdown. Files with those three extensions will be hidden in the OS file picker unless the user chooses "All Files". They do pass client-side validation when picked by other means, so this is purely a discoverability gap rather than a blocking error.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
There was a problem hiding this comment.
2 issues found across 8 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="lib/md-to-pdf.ts">
<violation number="1" location="lib/md-to-pdf.ts:363">
P2: List-item child paragraphs are flattened from the block token instead of their inline token list. Loose list items can output raw markdown syntax and miss inline formatting/link annotations.</violation>
<violation number="2" location="lib/md-to-pdf.ts:386">
P2: Blockquote rule drawing assumes content stays on one page. When quoted content page-breaks, the accent line is drawn with stale coordinates and does not correctly cover the full quote.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Fix all with cubic | Re-trigger cubic
| const innerWidth = CONTENT_WIDTH - 20; | ||
| const height = doc.heightOfString(code, { width: innerWidth, lineGap: 2 }); | ||
| breakIfTight(doc, height + 16); | ||
| const top = doc.y; |
There was a problem hiding this comment.
P2: Blockquote rule drawing assumes content stays on one page. When quoted content page-breaks, the accent line is drawn with stale coordinates and does not correctly cover the full quote.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At lib/md-to-pdf.ts, line 386:
<comment>Blockquote rule drawing assumes content stays on one page. When quoted content page-breaks, the accent line is drawn with stale coordinates and does not correctly cover the full quote.</comment>
<file context>
@@ -0,0 +1,556 @@
+ const innerWidth = CONTENT_WIDTH - 20;
+ const height = doc.heightOfString(code, { width: innerWidth, lineGap: 2 });
+ breakIfTight(doc, height + 16);
+ const top = doc.y;
+ doc
+ .save()
</file context>
| for (const child of item.tokens) { | ||
| if (child.type === "list") nested.push(child as Tokens.List); | ||
| else if (child.type === "text") flattenInline((child as Tokens.Text).tokens ?? [child as Token], BASE_STYLE, inline); | ||
| else flattenInline([child], BASE_STYLE, inline); |
There was a problem hiding this comment.
P2: List-item child paragraphs are flattened from the block token instead of their inline token list. Loose list items can output raw markdown syntax and miss inline formatting/link annotations.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At lib/md-to-pdf.ts, line 363:
<comment>List-item child paragraphs are flattened from the block token instead of their inline token list. Loose list items can output raw markdown syntax and miss inline formatting/link annotations.</comment>
<file context>
@@ -0,0 +1,556 @@
+ for (const child of item.tokens) {
+ if (child.type === "list") nested.push(child as Tokens.List);
+ else if (child.type === "text") flattenInline((child as Tokens.Text).tokens ?? [child as Token], BASE_STYLE, inline);
+ else flattenInline([child], BASE_STYLE, inline);
+ }
+ emitSegments(ctx, inline, {
</file context>
What
Lets students upload a Markdown (
.md) file and study from it exactly like a PDF. This is a step toward the "support for more document shapes" item in CONTRIBUTING.How
Get It. is built around a PDF — the viewer renders pages and overlays concept tags at real PDF-space coordinates, and every agent reads the per-page text
lib/pdf-extract.tspulls out. So instead of teaching every layer about Markdown, an uploaded.mdis rendered up front into a clean, text-bearing PDF and then flows through the exact sameextractPdf → assessPdfQuality → storepipeline as any other upload. The viewer, concept tagging, knowledge graph, and all four study tools work on Markdown with zero further changes, and the text-coverage gate applies unchanged.Rendering uses
pdfkit(already a runtime dep — the same libraryscripts/generate-sample-pdfs.tsuses) +marked(already a dep; promoted dev → runtime). No new dependency is added.Changes
lib/md-to-pdf.ts(new) — server-side Markdown → PDF renderer. Tokenizes withmarked.lexerand lays out headings, ordered/bullet (and nested) lists, blockquotes, fenced code, GFM tables, and inline bold/italic/code/links. Styled to match the bundled sample PDFs. Guards empty and oversized (>1 MB) input.app/api/upload/route.ts— detect.md/.markdownuploads and convert before the%PDF-sanity check (pdfkit emits a valid header, so the rest of the route is untouched). Empty/oversized markdown returns a friendly422.components/UploadCard.tsx— accept Markdown in the file pickeraccept, drag-drop, client validation, and copy.library-client.tsx,viewer-client.tsx,lib/jobs.ts,UploadCard.tsx) — strip the markdown extensions too, so an importednotes.mdshows asnotes, notnotes.md.package.json— promotemarkedfrom dev to runtime dependency; add thetest:mdscript.scripts/test-md-import.ts(new) — behavior tests for the importer.CJK / international scripts
Markdown with CJK, kana, hangul, or fullwidth characters is detected and rendered with a system CJK font — PingFang/Hiragino on macOS, Microsoft YaHei/SimSun on Windows, Noto Sans CJK/WenQuanYi on Linux. These families carry Latin glyphs, so mixed English/Chinese reads correctly, and pdfkit subsets the font so only used glyphs are embedded. If no CJK font is found, it falls back to the Latin fonts. No font is bundled with the repo.
Testing
npm run build(next build + tsc typecheck) — passes.npm run test:errors— passes.npm run test:md(new) — 15 behavior checks against the public interface, all passing: valid-PDF output; round-trip fidelity of heading / bold / inline code / list / blockquote / code block / table cell / link text; a substantive doc clears the text-coverage gate; empty and oversized input throw the typed errors; CJK never throws and (where a CJK font is present) round-trips correct Chinese text. Tests go throughmarkdownToPdf+extractPdfonly, so they survive a rewrite of the layout internals.eslinton the new/changed lines — clean.Notes
eslintalready reports some pre-existing problems ondeveloper(e.g.setState-in-effect inlibrary-client.tsx/viewer-client.tsx, and a couple of unused imports). Those are untouched here — I only added the markdown extensions to the existing title-strip lines in those files.Opened as a draft — happy to discuss direction on Discord first if you'd prefer, and to adjust scope/styling to taste.
Greptile Summary
This PR adds Markdown (
.md) file import support by converting uploaded markdown to a text-bearing PDF on the server usingpdfkitandmarked— no new runtime dependencies — then feeding the result through the existingextractPdf → assessPdfQuality → storepipeline unchanged.lib/md-to-pdf.ts(new): A 556-line block-level renderer covering headings, paragraphs, ordered/unordered lists, blockquotes, fenced code, GFM tables, and inline bold/italic/code/links, with CJK font detection. A loose-list rendering bug causes inline formatting in items separated by blank lines to appear as raw**asterisks**in the PDF.app/api/upload/route.ts: Detects markdown extensions before the%PDF-check and converts, returning a422for empty/oversized input; well-integrated into the existing gate.components/UploadCard.tsx,lib/jobs.ts,library-client.tsx,viewer-client.tsx: Consistent extension of client validation, file-pickeraccept, drag-drop copy, and title-strip regex to cover markdown extensions.Confidence Score: 3/5
The upload pipeline and all existing study tools are untouched; the risk is confined to the new renderer in lib/md-to-pdf.ts.
The loose-list rendering bug means any markdown document whose list items are separated by blank lines will display raw asterisks and underscores in the converted PDF. Students studying from such a document would see corrupted output, and the test suite only exercises tight lists so the defect goes undetected.
lib/md-to-pdf.ts — specifically the list-item child dispatch loop in renderList and the blockquote page-boundary drawing in renderBlockquote.
Important Files Changed
markedfrom devDependencies to dependencies and adds the test:md script; correct and necessary change.Sequence Diagram
%%{init: {'theme': 'neutral'}}%% sequenceDiagram participant U as User / Browser participant UC as UploadCard participant API as POST /api/upload participant MD as markdownToPdf participant EX as extractPdf participant ST as store U->>UC: Drop / pick .md file UC->>UC: ACCEPTED_FILE.test(filename) UC->>API: multipart/form-data (md bytes) API->>API: MARKDOWN_EXT.test(filename)? API->>MD: markdownToPdf(utf-8 string) MD->>MD: Guard empty / oversized MD->>MD: marked.lexer then renderBlock per token MD-->>API: Buffer (PDF bytes) API->>EX: extractPdf(pdfBytes) EX-->>API: pages + text API->>API: assessPdfQuality gate API->>ST: saveDoc(docId, meta, pdf) API-->>UC: docId, numPages UC-->>U: Navigate to viewer%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%% sequenceDiagram participant U as User / Browser participant UC as UploadCard participant API as POST /api/upload participant MD as markdownToPdf participant EX as extractPdf participant ST as store U->>UC: Drop / pick .md file UC->>UC: ACCEPTED_FILE.test(filename) UC->>API: multipart/form-data (md bytes) API->>API: MARKDOWN_EXT.test(filename)? API->>MD: markdownToPdf(utf-8 string) MD->>MD: Guard empty / oversized MD->>MD: marked.lexer then renderBlock per token MD-->>API: Buffer (PDF bytes) API->>EX: extractPdf(pdfBytes) EX-->>API: pages + text API->>API: assessPdfQuality gate API->>ST: saveDoc(docId, meta, pdf) API-->>UC: docId, numPages UC-->>U: Navigate to viewerComments Outside Diff (1)
lib/jobs.ts, line 148 (link)The regex
\.(pdf|md|markdown|mdown|mkd|mdwn)(or the markdown-only variant) is inlined inlib/jobs.ts,app/library/library-client.tsx,app/viewer/[docId]/viewer-client.tsx,components/UploadCard.tsx, andapp/api/upload/route.ts. A shared constant would ensure a future extension addition only needs one change.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Reviews (1): Last reviewed commit: "feat: strip .md/.markdown from document ..." | Re-trigger Greptile