you are viewing a single comment's thread.

view the rest of the comments →

[–]Site_rly_sux 1 insightful - 1 fun1 insightful - 0 fun2 insightful - 1 fun -  (2 children)

It's not 'tried to' - the ia bot is still right there:

https://www.nytimes.com/robots.txt

So why does the NYT not want the IA crawler?

Let's see what theintercept thinks.

They think it's so NYT can do stealth edits without the one particular archive noticing. For evidence, they talk about some stealth edits that fucking everyone noticed.

  1. They edited the tone of an article about Bernie

  2. They removed "death" as one way to get rid of a loan, because sometimes death doesn't discharge the loan

Really?

You fucking pathetic baby snowflakes, you think a major publication is making major changes so that one bot service won't notice changes to "six ways to shed your student debts"

What a pathetic infantile way of looking at the world, exhibited by OP and the Intercept.

Look again at the robots.txt

They also ban the chatgpt bot. And the bot for some crawler called "Omgili"

It's totally infeasible, and totally pathetically paranoid, that OP assumes this is about a cover up, instead of normal web crawler reasons.

Hey maybe they just don't want to render web pages for non-human visitors. That's up to them. For you to assume it's a conspiracy to hide the six ways to lose your student debt is such pathetic paranoid conspiracy bullshit.

OP, there's something seriously wrong with you, if you read and believed the fake news linked here

[–]SueBoyle 2 insightful - 1 fun2 insightful - 0 fun3 insightful - 1 fun -  (1 child)

I have written web crawler software before, though it was small scale for my personal use, my web crawler does not need to respect the instructions that are in the robots.txt file.

So basically whoever's running this crawler could just crawl the New York times website whether the New York times website likes it or not.

There is no police force out there that's going to arrest you for not obeying their instructions in the robots file

In fact there's a twist on this that if you analyze the robots.txt file it might give you clues about where to find sensitive documents that you really want to crawl..

[–]neolib 1 insightful - 1 fun1 insightful - 0 fun2 insightful - 1 fun -  (0 children)

Yeah, archive.today/.is doesn't respect robots.txt for example, unlike archive.org.