Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Locked thread
double nine
Aug 8, 2013



I have a bunch of pdfs from way back when - mostly news/magazine/... articles that I copied to a word document and printed as a pdf. Problem is I never made an overview so now I don't know what most of these are. What I'd like to do is to make an excel that's an overview.
Basic structure of the pdfs is:
code:
title
by suchandsuch, ditto

	short summary, usually 2/3 sentences

text of article
Is there a script/tool/... to export these 3 elements - author, title, summary from each pdf to a single xls file (i.e. 1 line per file, 3 columns with that data)? Effectively these 3 elements would be the first 3 paragraphs of each pdf.

Adbot
ADBOT LOVES YOU

Suroi
Jun 13, 2013


depending on the format and how consistent the PDF formats are that you generated way back when, you could use a simple python library like PDFMiner and then export your title, summary, and text into a CSV and import that into excel. but if your PDF structure isn't uniform you might need to do some hand editing on the final product. if you aren't familiar with python, or scripting in general, I'm unsure how you would proceed.

SuicidalSmurf
Feb 12, 2002




I can't offer too much advice, but I had good luck with VBA in excel directly interfacing with PDFs for form filling. Surely the opposite can be done and some VBA string manipulation could spit your data directly into the spreadsheet however you see fit. I don't have much coding background, but was able to hack together other people's code from googling to do what I needed.

slightpirate
Dec 26, 2006
i am the dance commander

I've used the PDFMiner tool before and all I could ever get it to do was dump the pdf structure, not the content of the fields of a fillable form, which is what I was hoping to get.

Bohemian Cowabunga
Mar 24, 2008



Echoing the use of PDFMiner. It worked perfectly when i had to parse through 20k+ documents and organize them by certain keywords.

  • Locked thread