How to get data from NYT developer network using python

Ok so here are the steps:

Step1: Go to NYT developer network. Request an API key. There are many kinds of API’s available. I’ll go with Article Search API for now.

Step2: Go to the Article Search API page, and click on to Read Me on the top right. Scroll down to “Requests” section and copy the url with q=new+york+times.

Step3: Replace the ending ### with the key you got in the mail and open the link. You’ll see something like this:

Screen Shot 2016-10-08 at 11.57.54 PM.png

Step4: You will see all the data that was exposed by the API. Download JSON Formatter extension of chrome to see a readable version of the API data. Here’s how it should look:

Screen Shot 2016-10-08 at 11.58.37 PM.png

Step5: Look at the formatted data for a second. If you collapse all you see that first of all, this is an object that starts with { and ends with }. Secondly, you’ll see that there are three main elements in this object; response, status and copyright. We’ll expand each of these one by one to understand what lies inside and where the actual data about articles is. Later on we’ll do this using python.

Step6: So inside there are two objects; meta and docs. Inside the docs object, there are 10 objects. Each object is an article. Each article has several properties like web_url, snippets, lead_paragraph, headline etc. Some of these elements are objects themselves, some are arrays, some strings, some numbers. So that’s it, we have seen and understood the data we got from NYT. It’s structured this way because we can use code to extract data from it easily and use it for our purposes. So lets see how can we do that.

Step7: Open terminal and install pip. Use the command sudo easy_install pip. You can be in any directory while you do this. We need pip because we will have to install several libraries in order to extract data from the JSON using Python.

Step8: Google search for how to extract data from JSON using python. You’ll find this stackoverflow example. We will use the code given here to extract data from our JSON using python.

Step9: Open Sublime-text and save the file as nyt.py. DOTpy is the extension for python. In the terminal, change your directory to the folder where you saved the file.

Step10: Now paste the code we found in stackoverflow. We will start by just using this:

import json, requests

url = ‘http://api.nytimes.com/svc/search/v2/articlesearch.json?q=new+york+times&page=2&sort=oldest&api-key=2310ba05bce344d98f720ae433ff8e5b’

resp = requests.get(url)
data = json.loads(resp.text)

What this does is that it imports the json and requests library. In url, we need to put the url we used to access our NYT data earlier in Step4. Next we use the get function in requests library to store data in a local variable called resp. This gives us everything in the form of a continuous string. We need to change this string into json format. So we use the loads function of json library to do that and store the json data into a variable called data.

Step11: We cant import the json and requests library unless we install them. So we’ll install these libraries using pip. In terminal, write “sudo pip install requests” and “sudo pip install json”.

Step 12: Now we are ready to parse the data using python and use the terminal to see the output of our requests. To run a simple command, we can say:

print data

in our python file. Then we can go to the terminal and say:

python nyt.py

Make sure you are in the same folder in terminal where the python file is. What the above command should do is produce the entire data we got from the api, just like we saw in the browser window without the JSON formatter. Here’s how it would look:

Screen Shot 2016-10-08 at 11.56.15 PM.png

Again a single string of data. You can clear that using cmd+K.

Step13: We can do something about the appearance of this data. We can use the pprint (called pretty print) command from the pprint library. To do that write this with the import commands on top:

from pprint import pprint

now we can use pprint(data) instead of print data. When we do python nyt.py in the terminal now we get a better formatted form of data like this:

Screen Shot 2016-10-09 at 12.05.42 AM.png

Step14: Now we want to start extracting data from it one by one, just like using the expand collapse in the browser. So for that we use the for loop in python, like this:

for key in data:

print key

What this does is prints all the keys it found in the object data. As we saw in the browser, there are three keys; status, response and copyright. When we do python nyt.py in our terminal, we should get the same thing.

Screen Shot 2016-10-09 at 12.11.09 AM.png

Step15: To get the data from each of these three keys we use this:

print data[“response”]

or

pprint(data[“response”])

This should give the data inside response object, which is inside data object. Like this:

Screen Shot 2016-10-09 at 12.17.56 AM.png

Step16: We can go like this again and again. So we can see all the keys inside response by:

for key in data[“response”]:

print key

and then

print data[“response”][“docs”]

and so on.

Step17: If we see docs, it is not an object, it is an array. We know this because it starts and ends with [] rather than {}. If we print the first element of this array by

print data[“response”][“docs”][0]

or

pprint(data[“response”][“docs”][0])

we see that it is data on an article.

Step18: Now that we have come to the list of articles and we see that there is an object called headline that has a key called main which has its value as the actual sentence of headline of the article, how do we print all the headlines of all the articles that we have? here’s how:

for key in data[“response”][“docs”]:

print key[“headline”][“main”]

This will print all the headlines when we do python nyt.py in our terminal.

Step19: How to query this database now? Remember the URL? Here how it looks:

http://api.nytimes.com/svc/search/v2/articlesearch.json?q=new+york+times&page=2&sort=oldest&api-key=###

the q=new+york+times is the query part.

so what we can do is this:

query = “Presidential”

url = “http://api.nytimes.com/svc/search/v2/articlesearch.json?q=” + query + “&page=2&sort=oldest&api-key=###”

This will search the database for query. Suppose we want all the headings with that query in them we just do what we did before:

for key in data[“response”][“docs”]:

print key[“headline”][“main”]

Step20: Lastly, what of we want to query using terminal and not change the code all the time. For example what if we want to do something like this in terminal:

python nyt.py “query”

To do this we include a library called sys by

import sys

Now if we write print sys.argv in our python code and do python nyt.py in terminal, we get nyt.py in return. What this means is that sys.argv (argv= argument value) is an array that has different words in the command given in the terminal as its elements. So if we want to do a query by doing python nyt.py “query” we need to pass sys.argv[1] in the url in our python code.

So we do this:

import requests
import json
from pprint import pprint
import sys

query= sys.argv[1]

url = “http://api.nytimes.com/svc/search/v2/articlesearch.json?q=” + query + “&page=2&sort=oldest&api-key=2310ba05bce344d98f720ae433ff8e5b”
resp = requests.get(url)
data = json.loads(resp.text)

for key in data[“response”][“docs”]:
print key[“headline”][“main”]

That’s it for now.

Advertisements
How to get data from NYT developer network using python

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s