you are viewing a single comment's thread.

view the rest of the comments →

[–]NetweaselContinuing the struggle 6 insightful - 1 fun6 insightful - 0 fun7 insightful - 1 fun -  (18 children)

how many subs are there in total, even a ballpark figure.

Well, Maniak grabbed and dove into the Pushshift Data Dump, looking for WayoftheBern, and found it in a "ranked by posts+comments" ranking at #714 out of about 13.5 million alleged subreddits, each with an individual name.

We're on line 714 out of 13575389

Best "ballpark" I can give you. Upper bound of "number of subreddits" : 13,575,389. You could theoretically crawl through the same database and see how many of those 13.5 million would count as "actual" subs, if you could define the term by "total comments+posts."

Simply see what line the smallest "actual" sub is on, and Bob's your uncle.

[–]Maniak🥃😾 5 insightful - 1 fun5 insightful - 0 fun6 insightful - 1 fun -  (17 children)

Well, Maniak grabbed and dove into the Pushshift Data Dump, looking for WayoftheBern, and found it in a "ranked by posts+comments" ranking at #714 out of about 13.5 million alleged subreddits, each with an individual name.

Addendum to this, I used a "subreddit_counts.txt" file that was made specifically for this purpose, and those subreddits include the "user subs", with a fuckton of subs named "u_{username}", which explains the 13.5 million.

Going up to the first lines where the count is at least 1000 brings me to r/discountharmony at 1000, still with a lot of user subs above that. Whether or not they should be counted as proper subs is another question.

In any case, and that's without having information about which ones are actually active, the number of subs that have any kind of significance when it comes to usage and traffic is way, way, WAY below the 2.8 million number that was asserted in the thread above.

The 100k mark (for posts + comments) is crossed at line 12075 with r/imagesofflorida at 100001.

On the other end of the list, the #1 sub is r/askreddit with 746,740,850 posts+comments, that one is participating in the blackout.

#2 is r/politics, that one is as usual being an establishment bitch.

Then r/funny, r/pics, r/worldnews, r/memes, r/teenagers, r/nba, the only remaining ones above 100M. Of those, only r/worldnews and r/memes are not participating.

So basically, as far as content is concerned, the top 8 subs have more than 1.5 billion posts+comments. Of those, about 1.2 billion are blacked out.

[–]NetweaselContinuing the struggle 3 insightful - 1 fun3 insightful - 0 fun4 insightful - 1 fun -  (13 children)

with a fuckton of subs named "u{username}",

Don't suppose you could send a "string counter" program through to tell how many "u/" and "r/" strings there are in the database?

[–]Maniak🥃😾 5 insightful - 1 fun5 insightful - 0 fun6 insightful - 1 fun -  (12 children)

9,501,204 matches for lines starting with u_, which leaves 4,074,185 others.

If I remove the lines with a count of 1 (because those are clearly not 'actual' subs): 4,616,932 "u_*" and 3,037,203 others.

With a count of at least 10: 1,023,366 u_*, 1,277,125 others.

The number of 'real subs' drops fast.

[–]NetweaselContinuing the struggle 3 insightful - 1 fun3 insightful - 0 fun4 insightful - 1 fun -  (2 children)

With that "u*" vs "u_*" mixup, might you need to rerun those numbers?

[–]Maniak🥃😾 4 insightful - 1 fun4 insightful - 0 fun5 insightful - 1 fun -  (1 child)

Nah, the mixup was on the sql side. These numbers were direct from the text file, where I hadn't forgotten to escape the underscore :)

[–]NetweaselContinuing the struggle 2 insightful - 1 fun2 insightful - 0 fun3 insightful - 1 fun -  (0 children)

Cool.

[–]NetweaselContinuing the struggle 4 insightful - 1 fun4 insightful - 0 fun5 insightful - 1 fun -  (1 child)

There's another thing in this, which is much more complicated...

As I understand it, the PushShift numbers are aggregate totals. If a subreddit blew up for a month and then died off two years ago, those huge numbers would still be sitting there.

If you could subtract the February numbers from the March numbers, you could get the March activity alone.
But, as I said, complicated.

[–]Maniak🥃😾 5 insightful - 2 fun5 insightful - 1 fun6 insightful - 2 fun -  (0 children)

Hence why the API is needed. Because I sure as shit am not going to download multiple multi-TB torrents and process them manually in order to do this :)

[–]NetweaselContinuing the struggle 2 insightful - 1 fun2 insightful - 0 fun3 insightful - 1 fun -  (6 children)

4,074,185 others.

Now we're getting toward reasonable numbers......

Tougher database manipulation question, can you delete every "/u" line and port what's left to a different file?

If you can, then follow up with checking https://subredditstats.com/ on the 100,000th "/r."

It's called "chasing the lower bound." If you then check the 50,000th one, then the 20,000th one, then the 10,000th one... you'll probably see a great jump between two of them. The "lower bound" would probably be between those two.

I figure it would have to be below [a higher number than] 5000.

[–]Maniak🥃😾 1 insightful - 1 fun1 insightful - 0 fun2 insightful - 1 fun -  (3 children)

Name             Count    Rank
growcastle      133361  #10000
menshealth       39784  #20000
panamacitybeach   6835  #50000
boners            1583 #100000

(counting only those that don't start with u_)

growcastle

panamacitybeach

menshealth

r/boners not found, so I went to the #99999:

bourbontrade same count (1583)

[–]NetweaselContinuing the struggle 3 insightful - 1 fun3 insightful - 0 fun4 insightful - 1 fun -  (2 children)

Off to the stats page!

r/ boners (#100,000): "not found"
r/ panamacitybeach (#50,000): Subscribers -- 3,324 Comments Per Day -- 24 Posts Per Day -- 1
r/ menshealth (#20,000): Subscribers -- 11,170 Comments Per Day -- 12 Posts Per Day -- 5
r/ growcastle (#10,000): Subscribers -- 34,966 Comments Per Day -- 60 Posts Per Day -- 12

For comparison...

r/ WayoftheBern (<1000): Subscribers -- (That's odd. It does not show on that page. No matter.) 87,991.
Comments Per Day -- 93 Posts Per Day -- 17.

Hmm. Perhaps the blackout is skewing numbers. Maybe this should be checked next week.


Update: r/ WayoftheBern -- Comments in past 24 hours: Zero. Posts in past 24 hours: Zero.


From random archive 25 posts in 12 hours, and from memory 25 comments in less than two hours, usually.

[–]Maniak🥃😾 4 insightful - 2 fun4 insightful - 1 fun5 insightful - 2 fun -  (1 child)

Oh wait, I fucked up the query removing the u_, it removed all those starting with u :)

That'll teach me to go too fast.

So:

Name                   Count     Rank
wayofthebern         3492849      713
makeupflatlays        136923    10000
winnipeggonewild       41101    20000
chrisdeliauncensored    7059    50000
hl_women_only           1639    99999

wayofthebern makeupflatlays winnipeggonewild chrisdeliauncensored hl_women_only

[–]NetweaselContinuing the struggle 1 insightful - 1 fun1 insightful - 0 fun2 insightful - 1 fun -  (0 children)

hl_women_only (not listed)/(not listed)/4
chrisdeliauncensored 2,139/8/1
winnipeggonewild 9,893/103/9
makeupflatlays 77,408/(not listed)/3

[–]Maniak🥃😾 3 insightful - 1 fun3 insightful - 0 fun4 insightful - 1 fun -  (1 child)

I ended up importing it in a quick sql table because I was getting bored with doing regexes in notepad++ so... that opens up the queries :)

(then again it's only name + count, so the information is very limited)

[–]NetweaselContinuing the struggle 2 insightful - 1 fun2 insightful - 0 fun3 insightful - 1 fun -  (0 children)

it's only name + count, so the information is very limited)

That's what https://subredditstats.com/ is for. All you need is a name.

[–]NetweaselContinuing the struggle 2 insightful - 1 fun2 insightful - 0 fun3 insightful - 1 fun -  (2 children)

The 100k mark (for posts + comments) is crossed at line 12075 with r/imagesofflorida at 100001.

In my head I was estimating about 15,000 "actual" subreddits, whatever that term would actually mean.

[–]Maniak🥃😾 3 insightful - 1 fun3 insightful - 0 fun4 insightful - 1 fun -  (1 child)

Here's a conspiracy theory: what if getting rid of PushShift was seen as a highly profitable move by the cunts-in-power because without a way to easily query the entirety of Reddit and be able to see just how many subs are actually active, with how many actually active users, it's way easier for the executives to make up bullshit numbers in order to get more money from investors?

[–]NetweaselContinuing the struggle 4 insightful - 1 fun4 insightful - 0 fun5 insightful - 1 fun -  (0 children)

As I have said for years, while FaceBook is "Weaponized Peer Pressure," Reddit is "Weaponized Autism."

They left data lying around for people to analyze. People will. And have, and are.

And when the numbers do not add up, it becomes obvious that they do not.