Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
BlankSystemDaemon
Mar 13, 2009



fletcher, who lives up to his status as poster in the digital packrats thread, has a script that can be used for archiving.
It generates html files which support html pagination properly, and on top of it it also grabs the images and includes them, so you don't have to rely on webhosts staying up.
Only "problem" with it is that it sometimes generates zero-byte files, but you can work around that by simply deleting the zero-byte files and running the script again.
In case it's not obvious, it needs the python modules called requests, beautifulsoup4, and html5lib, and is meant to be installed with pip.

Adbot
ADBOT LOVES YOU

BlankSystemDaemon
Mar 13, 2009



SubNat posted:

How does this stack up vs SneezeOfTheDecade's script? If one of them does a good job + preserves pages then it might be better for me to just use that, if I can wrangle up python for it.
If I understand Sneezes code and execution right (it's what I used to start with), the script he's made seems to just grab the html files directly without archiving the images, so you either have to hope that imgur, or whatever image webhosts are used stay live, or live with the fact that images aren't really reliable.
fletchers script is made to grab every single image it can (the ones that don't 404, time out, or otherwise don't already work at any rate), and include that in the image.

Another difference is that with fletchers script you just use the dropdown menu or pages to navigate, whereas with the other one you have to open each page in succession - I realize this is a minor issue, but it's still there.

Also, fletchers script preserves the look of the forums, by including stylesheets.

fletchers script does take up a lot of diskspace; a ~450 page thread took almost 900MB, and I can't imagine how much the cat pictures thread would take up, at over 6000 pages.

Please understand, I don't want SneezeOfTheDecade of the decade to feel like their effort has been wasted - I really appreciate it, as it got me started archiving the stuff I care the most about, and it's made in such a way that it's easy to include in a command-line one-liner, which, honestly, is the biggest part.

BlankSystemDaemon has a new favorite as of 06:37 on Jun 26, 2020

BlankSystemDaemon
Mar 13, 2009



fletcher fixed the error handling and made it so that images that fail to download no longer cause the page to not be downloaded, so if you've installed it you should do pip install -upgrade <url> to grab the latest update.

BlankSystemDaemon
Mar 13, 2009



SneezeOfTheDecade posted:

Absolutely no problem at all - I want people to use the best tool they have available, whether or not it's My Tool. :)

That said, my tool (gonna link it again) now collects the CSS and scripting in the header, converts page links to be relative (so you can click 'em), and optionally downloads images with the --images flag. :toot:
More archiving for the archiving gods! :black101:

BlankSystemDaemon
Mar 13, 2009



As a card-carrying member of the digital packrats thread, I can confirm that these archiving tools ABSOLUTELY have a use.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply