A FastAPI microservice for parsing PDF files into structured text and tables using PyMuPDF.
- Concurrent PDF parsing with configurable number of workers via
ProcessPoolExecutor. - Customizable margins (
footer_margin,header_margin), image-text filtering (no_image_text), and table merge tolerance. - Thin wrapper service (
PDFParseService) to construct and reuse a single parser instance. - REST API endpoint for uploading PDFs and returning JSON-serialized elements.
- Clone the repository:
git clone https://github.com/your-org/pymupdf-service.git cd pymupdf-service - Install dependencies with Poetry:
poetry install
Service-level defaults are defined in (parser_settings.yaml):
max_processors: Number of parallel workers (default: 2)footer_margin: Bottom margin to ignore as footer (default: 10)header_margin: Top margin to ignore as header (default: 10)no_image_text: Exclude text over images iftrue(default:false)tolerance: Pixel tolerance for merging adjacent table bounding boxes (default: 20)
Start the FastAPI server:
docker build -t pymupdf-service .
docker run -p 8888:8888 pymupdf-serviceForm-data parameters:
file: PDF file to parse (UploadFile, required)footer_margin: Optional integerheader_margin: Optional integerno_image_text: Optional booleantolerance: Optional integer
Response
{
"elements": [
{
"content": "Extracted text or table snippet...",
"content_type": "text|table",
"start_page": 1,
"end_page": 1
}
],
"num_pages": 5
}- Fork the repo
- Create a feature branch
- Submit a pull request
AGPL-3.0 © Vector8