# extract-pdf-annotations.py
A Python script designed to extract text annotations from a PDF file and save them
into a CSV file. The output CSV contains five columns:
- **Page Number**: The page number where the annotation appears.
- **Annotation Text**: The content of the annotation.
- **Author**: The author or creator of the annotation.
- **Creation Date**: The date when the annotation was created.
- **Modification Date**: The date when the annotation was last modified.
## Table of Contents
- [Overview](#overview)
- [Installation](#installation)
- [Usage](#usage)
- [Features](#features)
- [Dependencies](#dependencies)
- [License](#license)
- [Contact](#contact)
## Overview
The `extract-pdf-annotations.py` script allows you to extract all annotations
(e.g., comments, highlights, notes) from a PDF file and export them into a CSV
file for further analysis. This is useful when you need to review or analyze
comments made on a document without having to open the PDF itself.
## Installation
### Prerequisites
To run this script, you will need the `PyMuPDF` library, which provides functionality for working with PDFs.
### Step 1: Install PyMuPDF
You can install PyMuPDF via pip:
```bash
pip install PyMuPDFYou can download the Python script directly or clone the repository:
git clone https://github.com/yourusername/extract-pdf-annotations.gitEnsure that you are running this script in an environment where PyMuPDF is installed (either
in a virtual environment or globally).
To use the script, follow these steps:
-
Run the Python script:
python extract-pdf-annotations.py
-
The script will prompt you to enter the full path of the PDF file you want to extract annotations from.
Enter the full path of the PDF file: /path/to/your/pdf-file.pdf
-
The script will process the PDF file, extract all annotations, and create a CSV file in the same directory as the input file.
Example:
- Input file:
/path/to/your/pdf-file.pdf - Output CSV file:
/path/to/your/pdf-file.csv
- Input file:
-
The output CSV file will contain the following columns:
- Page Number
- Annotation Text
- Author
- Creation Date
- Modification Date
Enter the full path of the PDF file: /home/user/document.pdf
Annotations have been written to: /home/user/document.csv- Extracts PDF annotations: Captures text annotations (like comments and highlights) from all pages in the PDF.
- Date conversion: Converts the PDF-specific date format into a more human-readable format (with timezone handling).
- Customizable CSV output: The extracted annotations are saved into a CSV file with a clear structure.
- Automatic file naming: The output CSV file is automatically named after the input PDF, with
.csvas the extension.
This script requires the following Python library:
- PyMuPDF (fitz): For parsing the PDF and extracting annotations.
To install the required dependencies, run:
pip install PyMuPDFThis script is licensed under the MIT License. Feel free to modify and redistribute it under the terms of the license.
For questions, feedback, or suggestions, please reach out to the author:
- Author: Yahya Hamidaddin
- Email: [email protected]