Post, archive, and share your favorite forums thread

The Something Awful Forums > Main > Post Your Favorite: Carefully Curated > Post, archive, and share your favorite forums thread

BlankSystemDaemon: Mar 13, 2009

fletcher, who lives up to his status as poster in the digital packrats thread, has a script that can be used for archiving.
It generates html files which support html pagination properly, and on top of it it also grabs the images and includes them, so you don't have to rely on webhosts staying up.
Only "problem" with it is that it sometimes generates zero-byte files, but you can work around that by simply deleting the zero-byte files and running the script again.
In case it's not obvious, it needs the python modules called requests, beautifulsoup4, and html5lib, and is meant to be installed with pip.

# ¿ Jun 25, 2020 22:54

Adbot: ADBOT LOVES YOU

# ¿ Apr 28, 2024 06:40

BlankSystemDaemon: Mar 13, 2009

SubNat posted:

How does this stack up vs SneezeOfTheDecade's script? If one of them does a good job + preserves pages then it might be better for me to just use that, if I can wrangle up python for it.

If I understand Sneezes code and execution right (it's what I used to start with), the script he's made seems to just grab the html files directly without archiving the images, so you either have to hope that imgur, or whatever image webhosts are used stay live, or live with the fact that images aren't really reliable.
fletchers script is made to grab every single image it can (the ones that don't 404, time out, or otherwise don't already work at any rate), and include that in the image.

Another difference is that with fletchers script you just use the dropdown menu or pages to navigate, whereas with the other one you have to open each page in succession - I realize this is a minor issue, but it's still there.

Also, fletchers script preserves the look of the forums, by including stylesheets.

fletchers script does take up a lot of diskspace; a ~450 page thread took almost 900MB, and I can't imagine how much the cat pictures thread would take up, at over 6000 pages.

Please understand, I don't want SneezeOfTheDecade of the decade to feel like their effort has been wasted - I really appreciate it, as it got me started archiving the stuff I care the most about, and it's made in such a way that it's easy to include in a command-line one-liner, which, honestly, is the biggest part.

BlankSystemDaemon has a new favorite as of 06:37 on Jun 26, 2020

# ¿ Jun 26, 2020 06:34

BlankSystemDaemon: Mar 13, 2009

fletcher fixed the error handling and made it so that images that fail to download no longer cause the page to not be downloaded, so if you've installed it you should do pip install -upgrade <url> to grab the latest update.

# ¿ Jun 26, 2020 18:16

BlankSystemDaemon: Mar 13, 2009

SneezeOfTheDecade posted:

Absolutely no problem at all - I want people to use the best tool they have available, whether or not it's My Tool.

That said, my tool (gonna link it again) now collects the CSS and scripting in the header, converts page links to be relative (so you can click 'em), and optionally downloads images with the --images flag.

More archiving for the archiving gods! :black101:

# ¿ Jun 27, 2020 08:27

BlankSystemDaemon: Mar 13, 2009

As a card-carrying member of the digital packrats thread, I can confirm that these archiving tools ABSOLUTELY have a use.

# ¿ Jun 27, 2020 17:35

The Something Awful Forums > Main > Post Your Favorite: Carefully Curated > Post, archive, and share your favorite forums thread