you are viewing a single comment's thread.

view the rest of the comments →

[–]LarrySwinger2 3 insightful - 2 fun3 insightful - 1 fun4 insightful - 2 fun -  (3 children)

Note that youtube-dl can scrape entire playlists and channels. Be sure to archive as much as possible while you can. We have significant storage on the Cassandra server, so we can make them available there.

[–]JasonCarswell 2 insightful - 2 fun2 insightful - 1 fun3 insightful - 2 fun -  (2 children)

Actually we don't have "significant" storage, but we've got room for a bunch with a handful of terabytes and the ability to get more.

Separately, I've already archived 10x 4tb drives of YouTube stuff alone, on many topics, including lots of conspolitics - but those drives are not online, need some boxes to put them in, some UPS, PeerTube maybe, etc. (I have other drives with other stuff (books, music, TV, movies, documentaries, etc) that aren't really good for sharing due to copyright tyranny. I doubt my claim would stand that I'm not sharing them so much as using them for fair use sampling.)

If you could create some kind of archival thing on Cassandra or her sisters I hope to get, along with a tutorial for dummies, we could all cue stuff up (within moderation) and whatever.

/u/zyxzevn and /u/Robin had a great idea here:
/s/CorbettCommenters/comments/6hck/solution_wiki_for_corbett_report_and_others/

It would be nice to have bots that could go through all of SaidIt, CorbettReport, etc. and scrape pages and/or download all links to an archive while also making a wiki-table-list that can be sorted via topic/sub/hashtag, article/media date, shared date, article/media source, shared source, etc. That wiki-table-list can be mirrored on WikiSpooks, InfoGalactic, GiraffeIdeas.wiki, etc. The archived media could be shared via IPFS and not clog up the wikis unnecessarily. Of course the bot(s) would leave a comment behind saying the data has been backed up with links to the wiki lists, the IFPS info, and of course the archives.

[–]zyxzevn 2 insightful - 2 fun2 insightful - 1 fun3 insightful - 2 fun -  (1 child)

Great idea.
I have a html reader, but it is in Lazarus.
Likely there are many such libraries in Python.
The text can be extracted if it has a class="Text" in it or something.
In what format do you want to store it?

[–]JasonCarswell 1 insightful - 2 fun1 insightful - 1 fun2 insightful - 2 fun -  (0 children)

I don't know what this means or what formats are preferable or why. I just want the max data and highest resolution so that nothing is left behind when folks want it in the future.

Apparently there are open-source archival things that can webscrape/snapshot pages. That doesn't seem too difficult, but what do I know. The tricky part (to me) is having it automatically add to meta-table-lists on wikis and posting this on SaidIt (or other forums like Corbett Report, etc) that it's been archived after actually archiving it all and sharing it on IPFS.