Hi all!
A couple of years ago I posted on this blog about some issues I had in downloading articles from the NY Times using the Python API. In the end of the post I promised to post the code snippet some day and well.. that day is today!
I did this as part of an homework for my data mining class, so here is the assignment:
The NY Times API is available at http://developer.nytimes.com. It provides access to various articles, both historic and new. For this assignment we are interested in the Times Newswire API, which provides access to articles, blogs, and so on, as they are being produced, and for each article we can obtain directly from the API its URL and its abstract, among other information.
Your tasks for this problem are the following:
1. Study the API, download 30,000 different articles from the Times Newswire API, and for each of them save the URL and the abstract. Note that because artic les are created continually, you may end up downloading some articles multiple times; you should make sure that you do not store each article more than once. For each article assign a unique docID. Note that this part is not completely trivial because, among other issues, you need to deal with the fact that often the API does not return the expected document, so you need to catch the exceptions thrown, put the right delays, and retry, for some steps.
Then I had to do some other processing stuff with the articles, but that's above the scope of this post. Let me know if you need any help with the code. Cheers!
from urllib2 import urlopen
import urllib2
from json import loads
import codecs
import time
posts = list()
i=30
keys= dict()
count=0
offset=0
while(offset<40000):
if(len(posts)>=30000): break
if(700<offset<800):
offset=offset + 100
#for p in xrange(100):
try:
url = "http://api.nytimes.com/svc/news/v3/content/nyt/all.json?offset="+str(offset)+"&api-key=[YOUR API KEY GOES HERE]"
data= loads(urlopen(url).read())
print str(len(posts) )+ " offset=" + str(offset)
for item in data["results"]:
if(('url' in item) & ('abstract' in item)) :
url= item["url"]
abst=item["abstract"]
if(url not in keys.values()):
keys[count]=url
count=count + 1
res= url + " " + abst
posts.append(res)
if(len(posts)>=30000):
break
except urllib2.HTTPError, e:
print e
time.sleep(1)
offset=offset + 21
continue
except urllib2.URLError,e:
print e
time.sleep(1)
offset=offset + 21
continue
offset=offset + 19
outfile= open("out2.tsv", "w")
for s in posts:
outfile.write(s.encode("utf-8") + "\n")
indexfile=open("ind2.tsv", "w")
for x in keys.keys():
indexfile.write(str(x) + " " + str(keys[x]))
print str(len(posts))
print str(len(keys))
outfile.close()
indexfile.close()
A couple of years ago I posted on this blog about some issues I had in downloading articles from the NY Times using the Python API. In the end of the post I promised to post the code snippet some day and well.. that day is today!
I did this as part of an homework for my data mining class, so here is the assignment:
The NY Times API is available at http://developer.nytimes.com. It provides access to various articles, both historic and new. For this assignment we are interested in the Times Newswire API, which provides access to articles, blogs, and so on, as they are being produced, and for each article we can obtain directly from the API its URL and its abstract, among other information.
Your tasks for this problem are the following:
1. Study the API, download 30,000 different articles from the Times Newswire API, and for each of them save the URL and the abstract. Note that because artic les are created continually, you may end up downloading some articles multiple times; you should make sure that you do not store each article more than once. For each article assign a unique docID. Note that this part is not completely trivial because, among other issues, you need to deal with the fact that often the API does not return the expected document, so you need to catch the exceptions thrown, put the right delays, and retry, for some steps.
Then I had to do some other processing stuff with the articles, but that's above the scope of this post. Let me know if you need any help with the code. Cheers!
import urllib2
from json import loads
import codecs
import time
posts = list()
i=30
keys= dict()
count=0
offset=0
while(offset<40000):
if(len(posts)>=30000): break
if(700<offset<800):
offset=offset + 100
#for p in xrange(100):
try:
url = "http://api.nytimes.com/svc/news/v3/content/nyt/all.json?offset="+str(offset)+"&api-key=[YOUR API KEY GOES HERE]"
data= loads(urlopen(url).read())
print str(len(posts) )+ " offset=" + str(offset)
for item in data["results"]:
if(('url' in item) & ('abstract' in item)) :
url= item["url"]
abst=item["abstract"]
if(url not in keys.values()):
keys[count]=url
count=count + 1
res= url + " " + abst
posts.append(res)
if(len(posts)>=30000):
break
except urllib2.HTTPError, e:
print e
time.sleep(1)
offset=offset + 21
continue
except urllib2.URLError,e:
print e
time.sleep(1)
offset=offset + 21
continue
offset=offset + 19
outfile= open("out2.tsv", "w")
for s in posts:
outfile.write(s.encode("utf-8") + "\n")
indexfile=open("ind2.tsv", "w")
for x in keys.keys():
indexfile.write(str(x) + " " + str(keys[x]))
print str(len(posts))
print str(len(keys))
outfile.close()
indexfile.close()
Commenti
Posta un commento