pymupdf-service

A FastAPI microservice for parsing PDF files into structured text and tables using PyMuPDF.

Features

Concurrent PDF parsing with configurable number of workers via ProcessPoolExecutor.
Customizable margins (footer_margin, header_margin), image-text filtering (no_image_text), and table merge tolerance.
Thin wrapper service (PDFParseService) to construct and reuse a single parser instance.
REST API endpoint for uploading PDFs and returning JSON-serialized elements.

Installation

Clone the repository:

git clone https://github.com/your-org/pymupdf-service.git
cd pymupdf-service

Install dependencies with Poetry:
```
poetry install
```

Configuration

Service-level defaults are defined in (parser_settings.yaml):

max_processors: Number of parallel workers (default: 2)
footer_margin: Bottom margin to ignore as footer (default: 10)
header_margin: Top margin to ignore as header (default: 10)
no_image_text: Exclude text over images if true (default: false)
tolerance: Pixel tolerance for merging adjacent table bounding boxes (default: 20)

Running the Service with docker

Start the FastAPI server:

docker build -t pymupdf-service . 
docker run -p 8888:8888 pymupdf-service

API Usage

POST `/v1/pdf/parse`

Form-data parameters:

file: PDF file to parse (UploadFile, required)
footer_margin: Optional integer
header_margin: Optional integer
no_image_text: Optional boolean
tolerance: Optional integer

Response

{
  "elements": [
    {
      "content": "Extracted text or table snippet...",
      "content_type": "text|table",
      "start_page": 1,
      "end_page": 1
    }
  ],
  "num_pages": 5
}

Contributing

Fork the repo
Create a feature branch
Submit a pull request

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github/workflows		.github/workflows
devops		devops
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.info		build.info
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

pymupdf-service

Features

Installation

Configuration

Running the Service with docker

API Usage

POST `/v1/pdf/parse`

Contributing

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

vector8-dooel/pymupdf-service

Folders and files

Latest commit

History

Repository files navigation

pymupdf-service

Features

Installation

Configuration

Running the Service with docker

API Usage

POST /v1/pdf/parse

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

POST `/v1/pdf/parse`

Packages