Using Tweepy to Extract Content From Twitter
This article will show you how to quickly start fetching tweets from any public Twitter handle or hashtag, or get a list of followers and friends (following) for any public Twitter handle.
Hey Twitter, I’d like to get access to your API
In order to get access to the Twitter API, you will need a developer account. After you apply, it might take a little while to get a response from Twitter, but you should be able to easily generate your keys once your request is accepted.
You should never share your API keys with anybody, or upload them onto GitHub. Simply keep them in a .csv or .txt file on your local device. To follow the tutorial below, you will want to have your keys in the following order and format:
consumer_key, consumer_secret, access_token_key, access_token_secret
Now that we have our developer keys stored in a .csv file, we can open any IDE and import the following modules:
- Tweepy: a Python library that enables developers to interact with the official Twitter API and retrieve any publicly available data from Twitter (tweets, retweets, favourites, likes, hashtags, etc..).
- Pandas: the tweets we harvest from the Twitter API will be stored into a dataframe.
Let’s get started
import pandas as pd
import tweepy
We first need to open our .csv file and pass its values into a dictionary. Please note that we will have to remove the quotation marks from our string values, as shown on line 4. The names chosen for the keys within the dictionary are absolutely aribtrary, feel free to rename them as you please.
def getKeys(file_name):
with open(file_name, "r") as keys_csv:
for key in keys_csv.readlines():
keys = [k.replace('"',"") for k in key.split(",")]
twitter_keys = {
"consumer_key": keys[0],
"consumer_secret": keys[1],
"access_token_key": keys[2],
"access_token_secret": keys[3]
}
return twitter_keys
Our next step is to create a function that will use the values stored in the aforementioned dictionary. To do so, we pass our first set of two keys as arguments within the method OAuthHandler()
, and save this into a variable named auth
. We then pass the remaining two keys as arguments for the set_access_token
method on the variable that we just called.
def getAccess(twitter_keys):
auth = tweepy.OAuthHandler(
twitter_keys["consumer_key"],
twitter_keys["consumer_secret"]
)
auth.set_access_token(
twitter_keys["access_token_key"],
twitter_keys["access_token_secret"]
)
api = tweepy.API(
auth,
wait_on_rate_limit=True
)
return api
At a high level, this is more or less how Tweepy interacts with the Twitter API:
Wait, I can’t fetch as many tweets as I want to?
You will have probably noticed that the final step has a non mandatory argument named wait_on_rate_limit
, but what does that mean? Well, let’s see what the Tweepy documentation has to say about that.
wait_on_rate_limit
: Whether or not to automatically wait for rate limits to replenish
In other words, there’s a catch. We are unfortunately limited in the number of content that we can get from the Twitter API (this is not related to Tweepy). You can find more here.
Getting our first tweets
The following part is absolutely optional, but creating something that resembles a struct
will help keep the code clean and easy to debug.
def getStruct():
data = {
"created": [],
"author": [],
"favorites": [],
"retweets": [],
"tweet": [],
"replying_to": [],
"quoted": [],
"place": [],
"favorited": [],
"retweeted": [],
"geo": []
}
return data
Most of the times, some of the keys in the above dictionary will contain no values. This is particularly true for the "geo":
or "quoted":
keys.
Our next step is to wrap the three functions we just created into a fourth and final function, which will also contain a smaller nested function named getUser()
.
We will nedd to define the following four arguments when calling the function:
- choice: this parameter will take a single letter, either
u
orq
. Enteringu
will mean that we are fetching tweets for a specific user handle, while enteringq
will allow us to query Twitter for a particular set of strings, or a hashtag.
Example 1:
getTweets("u","ID_AA_Carmack",None,20)
to get the latest 20 tweets from John Carmack.
Example 2:
getTweets("q",None,"matplotlib", 15)
to get the latest 15 tweets about Matplotlib.
The nested getUser()
function will return the string “Unknown” if no User ID can be found, which surprisingly happens more often than not.
def getTweets(choice,user=None,query=None,volume):
def getUser(id_user):
try:
api.get_user(id = c.id_user).user_name
except:
return "Unknown"
keys = getKeys("tweepy.csv")
api = getAccess(keys)
data = getStruct()
if choice == "u":
cursor = tweepy.Cursor(
api.user_timeline,
id=user,
tweet_mode="extended"
).items(volume)
for c in cursor:
data["created"].append(c.created_at),
data["author"].append(getUser(c.id)),
data["favorites"].append(c.favorite_count),
data["retweets"].append(c.retweet_count),
data["tweet"].append(c.full_text),
data["replying_to"].append(c.in_reply_to_screen_name),
data["quoted"].append(c.is_quote_status),
data["place"].append(c.place),
data["favorited"].append(c.favorited),
data["retweeted"].append(c.retweeted),
data["geo"].append(c.geo)
df = pd.DataFrame(data)
elif choice == "q":
cursor = tweepy.Cursor(
api.search,
q=query,
tweet_mode="extended"
).items(volume)
for c in cursor:
data["created"].append(c.created_at),
data["author"].append(getUser(c.id)),
data["favorites"].append(c.favorite_count),
data["retweets"].append(c.retweet_count),
data["tweet"].append(c.full_text),
data["replying_to"].append(c.in_reply_to_screen_name),
data["quoted"].append(c.is_quote_status),
data["place"].append(c.place),
data["favorited"].append(c.favorited),
data["retweeted"].append(c.retweeted),
data["geo"].append(c.geo)
df = pd.DataFrame(data)
else:
print("Wrong input")
df["time"] = pd.to_datetime(df["created"]).dt.time
df["created"] = pd.to_datetime(df["created"]).dt.to_period("D")
return df
Basically, what the long block of code above does is pretty simple. Tweepy will create a Cursor()
constructor method which will handle all the pagination work and the parameters for us.
Once we have instantiated the getAccess()
function, Tweepy’s Cursor()
will perform different actions depending on the parameters we entered. If we entered u
for User, the cursor will call api.user_timeline
and will search for a username through id=user
. However, if we entered q
for Query, it will call api.search
and look for whichever search terms we passed through q=query
.
The rest is pretty simple: we loop through the results fetched by Tweepy’s Cursor()
, and map them as values to their corresponding keys within the dictionary that was created when calling the getStruct()
function. The last two lines simply add some extra series to the returned Pandas dataframe.
Important: when playing around with the code above, I highly recommend setting the volume parameter to 3 or 4 tweets max. As explained earlier, we want to avoid reaching the limit of tweets we can pull.
Here’s what happends when we run the following code: getTweets("u","ID_AA_Carmack",None,20)
Who’s following who
Last but not least, and as described in the opening lines of this article, we can also return all the followers and friends from any public user, making some slight changes to our previous function:
def getUserInfo(twitter_handle,volume):
keys = getKeys("tweepy.csv")
api = getAccess(keys)
followers = [f.screen_name for f in tweepy.Cursor(api.followers, twitter_handle).items(volume)]
following = [f.screen_name for f in tweepy.Cursor(api.friends, twitter_handle).items(volume)]
df = pd.DataFrame({"following": pd.Series(following), "followers": pd.Series(followers)})
return df
Again, Tweepy’s Cursor()
has a built-in method to retrieve what we need, and below are the first rows from the dataframe that is returned when passing the following parameters to our newly created function:
getUserInfo("TDataScience", 30)