Python & the New York Times API

For my data mining class I had to use Python in order to grab some articles using the NY Times API. That's easier to be done than to be said!

You can find all the instructions here, but it's all quite easy. You have to register to the website, so you can get the API key for your application. You have to know which API you'll need before requesting the key. In our case we used only the newswire times API, so I only requested them.

Once you got the key, you can go and code.
With Python we just need a few lines of code to get everything done.
You have to use the urllib2 and the json libraries. I suggest requesting using json as with this library you can easily convert the response from json to a python dictionary.

The two main difficulties I've found in writing down the code for this application were only setting up the url for the request and handling the errors. Actually, since most of the errors happened because of a wrong url I can say that mostly the URL had been my nightmare for a while.

In fact you have to "build up" your own URL. Let me say that the NYT API are quite clear about this. Here you can find those whom concern the API in our example.

The base structure is the following:


http://api.nytimes.com/svc/news/{version}/content/{source}/{section}[/time-period][.response-format]?api-key={your-API-key} 


Where you have to substitute what's between brackets with what you need. For example currently version should be equal to v3. With source you have to indicate if you want articles from the New York Times only (nyt), The International Herald Tribune only (iht), or both (all). The time period is optional. Beware that if you go further that the time period you set you'll get Error 404 (I spent a whole afternoon to figure this out). As suggested the response format should be json but also xml is available. 
After the question mark then you can insert your query. The important thing is that it should contain yout API key. Plus it can contains other fields for search queries or for offset, in case you're looking for more than the last 20 articles.
Exact, because each call to this link will get you at most 20 results. If you're looking for more (I needed 30'000 of them for my assignment) you have to play with the offset field.

My API url in the end was:

http://api.nytimes.com/svc/news/v3/content/nyt/all.json?offset="+str(offset)+"&api-key={my API key}

Where offset was a variable that changed at each iteration of a while loop.

For what concerns the errors you should handle HTTPError and URLError. Since I didn't want my program to stop when an error of these kind was encountered I catched the exception and handled it just using a sleep(1) and incrementing the offset.

Currently I'm still running my program to download my articles. When I'm sure it works fine I'll share the code.

Enjoy!




Commenti

  1. Any chance of getting the code now..? :)

    RispondiElimina
    Risposte
    1. Hi Pierre! Yep, I think it's about time! :) Here it is: http://leaguesunderthecode.blogspot.it/2016/10/python-download-articles-from-new-york-times.html

      Elimina

Posta un commento