you are viewing a single comment's thread.

view the rest of the comments →

[–]SueBoyle 2 insightful - 1 fun2 insightful - 0 fun3 insightful - 1 fun -  (1 child)

I have written web crawler software before, though it was small scale for my personal use, my web crawler does not need to respect the instructions that are in the robots.txt file.

So basically whoever's running this crawler could just crawl the New York times website whether the New York times website likes it or not.

There is no police force out there that's going to arrest you for not obeying their instructions in the robots file

In fact there's a twist on this that if you analyze the robots.txt file it might give you clues about where to find sensitive documents that you really want to crawl..

[–]neolib 1 insightful - 1 fun1 insightful - 0 fun2 insightful - 1 fun -  (0 children)

Yeah, archive.today/.is doesn't respect robots.txt for example, unlike archive.org.