There was a time you used to be able to easily get access to Facebook’s API, harvest data and do anything you want with it. It was so good in fact that it caused a global tragedy of stolen information. Now, however, Facebook is being extra stringent with who has access to their data.
One of the victims of this was their Instagram API. Now, if you are looking to export and analyse your Instagram data; you’ll either need to go to authorised 3rd party social data analysis tools, or go the extra mile and try to convince Facebook of your innocence by sending a screencast of exactly what your app is supposed to do.
Confused about all the requirements to get access to Facebook API, while researching, I’ve stumbled upon a rather simple solution. Instagram’s web app is built with React (another Facebook innovation); which exposes the data displayed on page through a Javascript variable. This eliminates the need for you to manually go in and write a parser to pick up the right information on page because all the information is already loaded on the _sharedData
variable.
In this post, I’ll show how to write a simple Python script that can periodically fetch data for your profile (followers) and your latest posts; and write them to a MySQL database for analysis.
Install Necessary Packages
For this tasks we’ll just need three Python modules. Requests to make HTTP requests from Python, Beautiful Soup to parse the Instagram web app and fetch the _sharedData
variable, and Pymysql to connect to the database and write the latest data.
Install the necessary packages with pip commands:
1pip install requests beautifulsoup4 pymysql
Write An Instagram Scraper Object
On a Python file, initially import the necessary modules we’ll be using. In addition to importing the packages we’ve just installed, we’ll also need other packages like datetime
and json
.
1from random import choice2from datetime import datetime3import json4import requests5from bs4 import BeautifulSoup6import pymysql7import pymysql.cursors
Next, we’ll define a USER_AGENTS
variable which will hold a few options for mocking a browser request to the Instagram web app. The script will randomly choose one of the values you’ll define here. This is to minimise the risk for getting blocked of making requests to the website on the assumption that we’re spamming.
1USER_AGENTS = ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36']
Next up, we’ll write an InstagramScraper object, which will accept the string url for our destination page (i.e link to the Instagram profile page). The InstagramScraper object will expose two methods, page_metrics
and post_metrics
.
Once we initiate an object, we’ll be able to call either of these methods to get the page/post data we want.
1class InstagramScraper:2 def __init__(self, url, user_agents=None):3 self.url = url4 self.user_agents = user_agents56 def __random_agent(self):7 if self.user_agents and isinstance(self.user_agents, list):8 return choice(self.user_agents)9 return choice(USER_AGENTS)1011 def __request_url(self):12 try:13 response = requests.get(14 self.url,15 headers={'User-Agent': self.__random_agent()})16 response.raise_for_status()17 except requests.HTTPError:18 raise requests.HTTPError('Received non-200 status code.')19 except requests.RequestException:20 raise requests.RequestException21 else:22 return response.text23 @staticmethod24 def extract_json(html):25 soup = BeautifulSoup(html, 'html.parser')26 body = soup.find('body')27 script_tag = body.find('script')28 raw_string = script_tag.text.strip().replace('window._sharedData =', '').replace(';', '')29 return json.loads(raw_string)3031 def page_metrics(self):32 results = {}33 try:34 response = self.__request_url()35 json_data = self.extract_json(response)36 metrics = json_data['entry_data']['ProfilePage'][0]['graphql']['user']37 except Exception as e:38 raise e39 else:40 for key, value in metrics.items():41 if key != 'edge_owner_to_timeline_media':42 if value and isinstance(value, dict):43 value = value['count']44 results[key] = value45 return results46 def post_metrics(self):47 results = []48 try:49 response = self.__request_url()50 json_data = self.extract_json(response)51 metrics = json_data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['edges']52 except Exception as e:53 raise e54 else:55 for node in metrics:56 node = node.get('node')57 if node and isinstance(node,dict):58 results.append(node)59 return results
Now that we have defined our object class, we can now create a scraper object and call either one of the methods.
1// Define the URL for the profile page.2url = 'https://www.instagram.com/said_tezel/?hl=en'3// Initiate a scraper object and call one of the methods.4instagram = InstagramScraper(url)5post_metrics = instagram.post_metrics()<br><br>6``78Now a list of most recent 12 posts, together with their metrics, are assigned to the `post_metrics` variable. If you wish, you can issue a `print(post_metrics)` command here to review the data.910Next up, we will push the key post metrics; such as the update time (when we fetch this data), post ID, post time, likes, comments and direct image link to a MySQL database. Before you go on, make sure that you have created a MySQL table with the necessary fields set up. If you need any help with setting up a database and table, see the [MySQL documentation](https://dev.mysql.com/doc/refman/8.0/en/creating-database.html).1112## Updating MySQL Table With Data1314Within the Python script, we'll start a connection to the MySQL table, iterate through all the post metrics we have defined at `post_metrics` and create new rows at the table.1516```python17// First, create a connection to your MySQL server.18// Make sure you change the host, user, password and db values.19connection = pymysql.connect(host=SERVER_URL,20 user=DB_USER,21 password=DB_PASSWORD,22 db=DB_TABLE,23 charset='utf8mb4',24 cursorclass=pymysql.cursors.DictCursor)
Next up, we will iterate through all single post metrics and write them to the database.
1// Set a datetime object to label the update time for the data in the database.2update_time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')34// Iterate through the metrics and write them to database.5for m in metrics:6 i_id = str(m['id'])7 i_post_time = datetime.fromtimestamp(m['taken_at_timestamp']).strftime('%Y-%m-%d %H:%M:%S')8 i_likes = int(m['edge_liked_by']['count'])9 i_comments = int(m['edge_media_to_comment']['count'])10 i_media = m['display_url']11 i_video = bool(m['is_video'])1213 with connection.cursor() as cursor:14 sql = "INSERT INTO `data` (`update_time`, `post_id`, `post_time`, `post_likes`, `post_comments`, `post_media`, `post_is_video`) VALUES (%s, %s, %s, %s, %s, %s, %s)"15 cursor.execute(sql, (update_time, i_id, i_post_time, i_likes, i_comments, i_media, i_video))16 connection.commit()17connection.close()
That’s it! Now every time you run this script, it will automatically fetch your latest post metrics and collect them in a MySQL database. If you want to automate this process, however, you might want to set up a cron job on your server in order to update the database at select intervals.
If you’d like to create a cron job to update the data every hour, for example, simply open the crontab edit console via crontab -e
on your server, and add the following line at the bottom of the edit screen. Make sure you correctly define the path to your Python script file.
10 * * * * PATH_TO_YOUR_PYTHON_SCRIPT
You can file the full script file with everything included on my Github repo.