Skip to content

aperetti/pdf-library

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pdf-library

Extract text from PDF files tracked in a text file or git repository and export to a CSV. Text will be extracted specified by regular expression.

###Simple as simple. The following command will by default extract all email addresses from the PDFs listed in the "pdffiles.txt" file.

java -jar pdf-library.jar "pdffiles.txt"

Given a text file "pdffiles.txt" with the following data:

\full\path\to1.pdf
\full\path\to2.pdf
\full\path\to3.pdf

The program will find all occurences of emails (using the regular expression "\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b") and output the data to a CSV file with following format. output.csv

\full\path\to1.pdf, email, [email protected]
\full\path\to2.pdf, email, [email protected]
\full\path\to2.pdf, email, [email protected]
\full\path\to3.pdf, email, [email protected]

###Options

####-o Specify an outfile. Default is the current directory of the input+output.csv.

####-m Specify your own mapping file. The mapping file is a comma separated list containing regular expression you would like to extract from the PDF.

For example, provided the file mapping.txt with the following information:

"myname", "Adam Peretti"
"phonenumber", "^(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:\(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\s*\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0-9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?$"

Running

java -jar pdf-library.jar "\path\to\input\text\file.txt" -m "\path\to\mapping.txt"

Will extract all unique occurrences of "Adam Peretti" and phones numbers from each of the PDF files. The output will look something like this:

\full\path\to1.pdf, myname, Adam Peretti
\full\path\to2.pdf, myname, Adam Peretti
\full\path\to2.pdf, phonenumber, 555-555-5555
\full\path\to3.pdf, phonenumber, 555-5555
\full\path\to3.pdf, phonenumber, 555-5465

About

Extract text from PDF files using regular expression.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages