bulk pdf text extraction tool?

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > bulk pdf text extraction tool?

double nine: Aug 8, 2013

I have a bunch of pdfs from way back when - mostly news/magazine/... articles that I copied to a word document and printed as a pdf. Problem is I never made an overview so now I don't know what most of these are. What I'd like to do is to make an excel that's an overview.
Basic structure of the pdfs is:

code:

title
by suchandsuch, ditto

	short summary, usually 2/3 sentences

text of article

Is there a script/tool/... to export these 3 elements - author, title, summary from each pdf to a single xls file (i.e. 1 line per file, 3 columns with that data)? Effectively these 3 elements would be the first 3 paragraphs of each pdf.

# ? Mar 10, 2017 21:10

Adbot: ADBOT LOVES YOU

# ? Apr 18, 2024 08:22

Suroi: Jun 13, 2013

depending on the format and how consistent the PDF formats are that you generated way back when, you could use a simple python library like PDFMiner and then export your title, summary, and text into a CSV and import that into excel. but if your PDF structure isn't uniform you might need to do some hand editing on the final product. if you aren't familiar with python, or scripting in general, I'm unsure how you would proceed.

# ? Mar 11, 2017 06:43

SuicidalSmurf: Feb 12, 2002

I can't offer too much advice, but I had good luck with VBA in excel directly interfacing with PDFs for form filling. Surely the opposite can be done and some VBA string manipulation could spit your data directly into the spreadsheet however you see fit. I don't have much coding background, but was able to hack together other people's code from googling to do what I needed.

# ? Mar 11, 2017 21:12

slightpirate: Dec 26, 2006; i am the dance commander

I've used the PDFMiner tool before and all I could ever get it to do was dump the pdf structure, not the content of the fields of a fillable form, which is what I was hoping to get.

# ? Mar 14, 2017 17:11

Bohemian Cowabunga: Mar 24, 2008

Echoing the use of PDFMiner. It worked perfectly when i had to parse through 20k+ documents and organize them by certain keywords.

# ? Mar 16, 2017 10:24

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > bulk pdf text extraction tool?