Page:Aaron Swartz s A Programmable Web An Unfinished Work.pdf/32

From Wikisource
Jump to navigation Jump to search
This page has been proofread, but needs to be validated.

20 3. BUILDING FOR SEARCH ENGINES: FOLLOWING REST


Some people think they can just box robots out. “Oh, you need a login to get in; that’ll keep out the robots.” That’s what David Heinemeier-Hansson, creator of Rails, said. He was wrong. Google software that ran on users computers ended up exploring even pages behind the log-in requirement, meaning the robots clicked on all the “Delete” links, meaning robots deleted all the content. (Hansson, for his part, responded by whining about the injustice of it all.) Don’t let this happen to you.

Luckily, that genius Tim Berners-Lee (see previous chapter) anticipated all this and set precautions. You see, when you visit a website, you don’t just ask the server for the URL, you also tell it what kind of request you’re making. Here’s a typical HTTP/1.0 request:

GET /about/ HTTP/1.0

The first part (“GET”) is called the method, the second (‘/about/‘) is the path, and the third (“HTTP/1.0”) is obviously the version. GET is the method we’re probably all familiar with—it’s the normal method used whenever you want to get (GET it?) a page. But there’s another method as well: POST.

If you think of URLs as little programs sitting inside a server somewhere, GET can be thought of as just running the program and getting a copy of its output, whereas POST is more like sending it a message. Indeed, POST requests, unlike GET requests, come with a payload. A message is attached at the bottom, for the URL to do with as it wishes.

It’s intended for requests that actually do something that messes the order of the universe (or, in the jargon, “changes state”), instead of just trying to figure out what’s what. So, for example, reading an old news story is a GET, since you’re just trying to figure stuff out, but adding to your blog is a POST, since you’re actually changing the state of your blog.

(Now, if you want to be a real jerk about it, you can say that all requests mess with the state of the universe. Every time you request an old news story, it uses up electricity, and moves the heads around on disk drives, and adds a line to the server’s log, and puts a note in your NSA file, and so on. Which is all true, but pretty obviously not the sort of thing we had in mind, so let’s not mention it again. (Please, NSA?))

The end result is pretty clear. It’s fine if Google goes and reads old news stories, but it’s not OK if it goes around posting to your blog. (Or worse, deleting things from it.) Which means that reading the news story has to be a GET and blog deleting has to be a POST.