you are viewing a single comment's thread.

view the rest of the comments →

[–]pitterpatterwater[S] 3 insightful - 1 funny3 insightful - 0 funny4 insightful - 1 funny -  (8 children)

My probem is that we're going to lose /r/internetcollection if reddit goes under, as well as lots of other neat communities.

[–]magnora7 4 insightful - 1 funny4 insightful - 0 funny5 insightful - 1 funny -  (7 children)

Perhaps eventually someone could develop an automated process to port those subreddits over to saidit subs

[–]d3rr 4 insightful - 1 funny4 insightful - 0 funny5 insightful - 1 funny -  (6 children)

i think it's doable. their api only goes back 1000 posts but it could be screen scraped.

[–]pitterpatterwater[S] 4 insightful - 1 funny4 insightful - 0 funny5 insightful - 1 funny -  (5 children)

We at /r/internetcollection maintain a stickied list of links to previous posts, so that isn't a problem. The important bit is the text in the posts, which contains a short description, archive and source links, and categorisation-related info.

[–]d3rr 3 insightful - 1 funny3 insightful - 0 funny4 insightful - 1 funny -  (4 children)

yeah wow, you guys are seriously organized.

[–]pitterpatterwater[S] 4 insightful - 1 funny4 insightful - 0 funny5 insightful - 1 funny -  (3 children)

/u/snallygaster deserves the credit; I just became an approved submitter fairly recently, he's the one who maintains the list and posted most of the linked stuff.

Anyways, I'm thinking a python script would be sufficient. Problem is that it's nearly 300 posts; I need a method which won't use up my bandwidth downloading it, aka I need to get famillair with website scraping and the Reddit API.

[–]d3rr 2 insightful - 1 funny2 insightful - 0 funny3 insightful - 1 funny -  (2 children)

in Python world i recommend Beautiful Soup for scraping and I'd put a delay in there or they will block your ip. Sounds like a fun project. I'd help but I'm overwhelmed with this site already.

[–]pitterpatterwater[S] 2 insightful - 1 funny2 insightful - 0 funny3 insightful - 1 funny -  (1 child)

I'm a bit busy myself; I'll post it here once I'm done with it. Can probably be generalised to a reddit archival tool. Do you know what the delay should be?

[–]d3rr 1 insightful - 1 funny1 insightful - 0 funny2 insightful - 1 funny -  (0 children)

if you are scraping at your leisure, I'd put it high, like a random 30 seconds to 2 mins between requests.

yeah man throw her up on github it could prove very useful to a lot of people.