Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Locked thread
Bouchehog
Dec 19, 2002

The Campaign for Badger Rights
Hey guys I'm running an i5-4670K 3.4GHz, 8GB DDR3, GTX 970 4GB build and I'm doing a lot of OCR, usually A4 size scans of between 500 and 2,000 pages a few times each week. Acrobat takes ages and I'm told that it only uses one core. Does anyone have any recommendations? I don't mind paying (particularly if it's a really good pdf program as well) - say >£150 max?

Also, would adding another 8GB of RAM help?

Cheers in advance!

Adbot
ADBOT LOVES YOU

Bouchehog
Dec 19, 2002

The Campaign for Badger Rights
Whilst I'm at it, is it worth upgrading my CPU? The compatability list is here.

Edit: to save anyone looking, I could upgrade to a i7-4790K or a Xeon E3-1286 v3.

Bouchehog fucked around with this message at 17:33 on Mar 9, 2018

derk
Sep 24, 2004

Bouchehog posted:

Whilst I'm at it, is it worth upgrading my CPU? The compatability list is here.

Edit: to save anyone looking, I could upgrade to a i7-4790K or a Xeon E3-1286 v3.

Check out Nitro Pro

https://www.gonitro.com/

wargames
Mar 16, 2008

official yospos cat censor

Bouchehog posted:

Whilst I'm at it, is it worth upgrading my CPU? The compatability list is here.

Edit: to save anyone looking, I could upgrade to a i7-4790K or a Xeon E3-1286 v3.

the thermal compound on the 4670 had a bad run for a while so they released the 4x90 series with better tim but really you should be fine cpu if you can use all your cores making pdfs.

BangersInMyKnickers
Nov 3, 2004

I have a thing for courageous dongles

Give FoxIt a spin, I was pleased with its performance and pricing. They should have a trial so you can benchmark it for your needs.

incoherent
Apr 24, 2004

01010100011010000111001
00110100101101100011011
000110010101110010
Probably going to need a real solution if you're doing this for a job. Hot folder to process the jobs while you're away. comes free with xerox scanners.

https://www.dokmee.com

Bouchehog
Dec 19, 2002

The Campaign for Badger Rights
I'm not doing it for a job per se. I'm self-employed and my job involved a poo poo ton of paper. It's only two or three times a week and I can always go make myself a coffee or something whilst it goes through but it would make life easier not to have to wait so long and some times I'm under time pressure so the extra ten minutes would be helpful.

thebigcow
Jan 3, 2001

Bully!
If you can script things, have the time to set something up, and hate yourself: https://github.com/tesseract-ocr/tesseract

Bouchehog
Dec 19, 2002

The Campaign for Badger Rights

thebigcow posted:

If you can script things, have the time to set something up, and hate yourself: https://github.com/tesseract-ocr/tesseract

I have none of these things, but thank you anyway.



I'll give FoxIt a go and see what it's like.

Bouchehog
Dec 19, 2002

The Campaign for Badger Rights
OK, so benchmarking with a file which is typical (289 of text pages with some images and colour pages) I get the following:

  • Original file: 14mb
  • Acrobat XI v11.0.23.22 12m01s 87.9mb
  • FoxIt v9 (not-quick) 13m06s 16.3mb
  • FoxIt v9 (quick mode) 13m03s 16.3mb
  • Nitro Pro (64-bit) 17.47s 28.2mb

So Acrobat is actually the quickest, albeit Foxit's file size makes it worthwhile given that a lot of the pdfs I am working with are 2,000+.

Oddly, using Task Manager's Resource Monitor and Core Temp the CPU never really went above 50% on any core for any of the programs (Acrobat sat around 45% and Foxit at 34%). My RAM is only at 50% and the SSD was barely used at all. I'm not sure what is actually bottlenecking this.

Any ideas? Any other suggestions for OCR software?

Methylethylaldehyde
Oct 23, 2004

BAKA BAKA

Bouchehog posted:

really went above 50% on any core for any of the programs (Acrobat sat around 45% and Foxit at 34%). My RAM is only at 50% and the SSD was barely used at all. I'm not sure what is actually bottlenecking this.

Any ideas? Any other suggestions for OCR software?

Without getting into the horrible realm of scriptable enterprise document processing solutions [Starting at the low low price of $17,000], you're gonna just have to queue up a bunch of these in Foxit and let it chew on them.

The OCR is most likely entirely single threaded, or at best running on two threads, so the max you can get your CPU to use is gonna be 100/(# of cores on your machine), so what's probably happening is one thread is handling the conversion of the compressed text image to something that the OCR engine likes to see, while the other is handling the actual OCR, and one thread is waiting for the other to complete, so it's using part of a single core.

Outside of big boy document processing systems, you won't see batching of stuff or paralellism enough to get 100% CPU utilization. Just find one that lets you queue up a ton of documents to be OCR'd in a batch then let it chew on them overnight.

Internet Explorer
Jun 1, 2005





Is just throwing all this stuff in Evernote and having it OCR it an option?

BangersInMyKnickers
Nov 3, 2004

I have a thing for courageous dongles

What's happening on the individual cores when you do the OCR pass in the software? Are you seeing just two cores loaded while the others are idle? Considering it should be trivial to batch pages to ensure there there is always something being processed the CPU should be at 100% the whole time for any of those products.

e: Can you OCR two doc's side-by-side and max it out? Boom, I've doubled your productivity.

Bouchehog
Dec 19, 2002

The Campaign for Badger Rights

Methylethylaldehyde posted:

The OCR is most likely entirely single threaded, or at best running on two threads, so the max you can get your CPU to use is gonna be 100/(# of cores on your machine), so what's probably happening is one thread is handling the conversion of the compressed text image to something that the OCR engine likes to see, while the other is handling the actual OCR, and one thread is waiting for the other to complete, so it's using part of a single core.
Interesting. It doesn't show as one core being taxed and the others idling but I guess that it wouldn't. It's still not overly clear to me why none of the cores is at 100%. In the absence of a RAM/SSD bottleneck I still can't see what the issue is.


Internet Explorer posted:

Is just throwing all this stuff in Evernote and having it OCR it an option?
Didn't even think of this. I'll give it a go.

BangersInMyKnickers posted:

What's happening on the individual cores when you do the OCR pass in the software? Are you seeing just two cores loaded while the others are idle? Considering it should be trivial to batch pages to ensure there there is always something being processed the CPU should be at 100% the whole time for any of those products.

e: Can you OCR two doc's side-by-side and max it out? Boom, I've doubled your productivity.
All four cores are shifting between c.22-45% with periodic spikes above this. I'd have to have another look to work out a proper pattern but from recollection two of the four jumped closer to 45% more frequently. It certainly wasn't the case that three cores were idling and one was maxed out.

I would be interested to see what happens if I OCR two or more documents at once. I don't generally OCR more than one file a day but I suppose that I could split it, OCR it and then reassemble if it's significantly quicker to do this. I'm not sure that I'd bother for a ten minute job but for larger files it might help.


Update
On an entirely related side-note I had purchased a second-hand i7-4790K but I think that the socket on my mobo was damaged when I removed my old CPU. I'm going to have a play this afternoon but I've decided just to go with a new build. If anyone kindly fancies checking that I haven't done something stupid in my selections the post is here.

Nam Taf
Jun 25, 2005

I am Fat Man, hear me roar!

You can probably get the filesize of the Adobe version down quite a lot depending on how it's storing the text (Text over an image, or converting it to actual text on a white page, for example). Play with OCR settings and test.

It'll look like it's processing on all cores but none at 100% because the operating system will chop and change which it uses constantly. Try running multiple instances of the apps (if your licence allows) and doing two simultaneous queues that way and see if it goes twice as slow or speeds up. You may not get a true 2x speedup but you might see, say, 1.75x happily.

Bouchehog
Dec 19, 2002

The Campaign for Badger Rights
Makes sense. I've upgraded to a new i7-8700K build so once I've set everything up I will re-run the benchmarks and see what difference it has made...

Bouchehog
Dec 19, 2002

The Campaign for Badger Rights
FYI, the updated benchmark on my new system is 8 minutes and 40 seconds, 3 minutes and 21 second faster (c.30% faster).

I've also been playing around with the OCR settings, I have a choice of three options in my version of Acrobat (Standard XI v11.0.23.22) - searchable image, searchable image (Exact) and ClearScan.

Searchable image: 8m 40s (600dpi), 87.8mb
Searchable image (Extract): 6m 41s, 15.6mb
ClearScan: 14m 27s (600dpi), 59.2mb

I have been using searchable image; this de-skews the image which is helpful most of the time. The ClearScan option is very similar - it gives slightly better quality text output and a smaller filesize at the expense of time. Searchable image (Extract) is the clear winner on time; it doesn't de-skew the image (a mixed blessing) but gives much smaller file sizes and is much quicker. Given this, I'm going to stick with Acrobat I think.

Bouchehog
Dec 19, 2002

The Campaign for Badger Rights
Hmmm. So whilst I can obviously run two (ore more) instances of Acrobat, running OCR stops me being able to use the other versions until the OCR competes (I can't even read them). Whilst I could break the document down and run OCR on one half in Acrobat and on the other half in Foxit, this is an expensive and rather inelegant way of dealing with things.

I guess that I just have to hang on until someone comes out with a multi-threaded version of their software. Astonishing that they haven't really...!

BangersInMyKnickers
Nov 3, 2004

I have a thing for courageous dongles

The Foxit dev teams were surprisingly receptive when I had occasion to give them a feature request. Shoot them an email, maybe its something they can slap together for you.

Adbot
ADBOT LOVES YOU

Morholt
Mar 18, 2006

Contrary to popular belief, tic-tac-toe isn't purely a game of chance.
Piggybacking on this thread, I want to OCR a 250-page PDF and translate it from Turkish. What's a reasonable method for this?

  • Locked thread