Simple Way To Export Instagram Data Using Python

There was a time you used to be able to easily get access to Facebook’s API, harvest data and do anything you want with it. It was so good in fact that it caused a global tragedy of stolen information. Now, however, Facebook is being extra stringent with who has access to their data.

One of the victims of this was their Instagram API. Now, if you are looking to export and analyse your Instagram data; you’ll either need to go to authorised 3rd party social data analysis tools, or go the extra mile and try to convince Facebook of your innocence by sending a screencast of exactly what your app is supposed to do.

Confused about all the requirements to get access to Facebook API, while researching, I’ve stumbled upon a rather simple solution. Instagram’s web app is built with React (another Facebook innovation); which exposes the data displayed on page through a Javascript variable. This eliminates the need for you to manually go in and write a parser to pick up the right information on page because all the information is already loaded on the _sharedData variable.

In this post, I’ll show how to write a simple Python script that can periodically fetch data for your profile (followers) and your latest posts; and write them to a MySQL database for analysis.

Install Necessary Packages

For this tasks we’ll just need three Python modules. Requests to make HTTP requests from Python, Beautiful Soup to parse the Instagram web app and fetch the _sharedData variable, and Pymysql to connect to the database and write the latest data.

Install the necessary packages with pip commands:

1pip install requests beautifulsoup4 pymysql

Write An Instagram Scraper Object

On a Python file, initially import the necessary modules we’ll be using. In addition to importing the packages we’ve just installed, we’ll also need other packages like datetime and json.

1from random import choice
2from datetime import datetime
3import json
4import requests
5from bs4 import BeautifulSoup
6import pymysql
7import pymysql.cursors

Next, we’ll define a USER_AGENTS variable which will hold a few options for mocking a browser request to the Instagram web app. The script will randomly choose one of the values you’ll define here. This is to minimise the risk for getting blocked of making requests to the website on the assumption that we’re spamming.

1USER_AGENTS = ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36']

Next up, we’ll write an InstagramScraper object, which will accept the string url for our destination page (i.e link to the Instagram profile page). The InstagramScraper object will expose two methods, page_metrics and post_metrics.

Once we initiate an object, we’ll be able to call either of these methods to get the page/post data we want.

1class InstagramScraper:
2    def __init__(self, url, user_agents=None):
3        self.url = url
4        self.user_agents = user_agents
5
6    def __random_agent(self):
7        if self.user_agents and isinstance(self.user_agents, list):
8            return choice(self.user_agents)
9        return choice(USER_AGENTS)
10
11    def __request_url(self):
12        try:
13            response = requests.get(
14                        self.url,
15                        headers={'User-Agent': self.__random_agent()})
16            response.raise_for_status()
17        except requests.HTTPError:
18            raise requests.HTTPError('Received non-200 status code.')
19        except requests.RequestException:
20            raise requests.RequestException
21        else:
22            return response.text
23    @staticmethod
24    def extract_json(html):
25        soup = BeautifulSoup(html, 'html.parser')
26        body = soup.find('body')
27        script_tag = body.find('script')
28        raw_string = script_tag.text.strip().replace('window._sharedData =', '').replace(';', '')
29        return json.loads(raw_string)
30
31    def page_metrics(self):
32        results = {}
33        try:
34            response = self.__request_url()
35            json_data = self.extract_json(response)
36            metrics = json_data['entry_data']['ProfilePage'][0]['graphql']['user']
37        except Exception as e:
38            raise e
39        else:
40            for key, value in metrics.items():
41                if key != 'edge_owner_to_timeline_media':
42                    if value and isinstance(value, dict):
43                        value = value['count']
44                        results[key] = value
45        return results
46    def post_metrics(self):
47        results = []
48        try:
49            response = self.__request_url()
50            json_data = self.extract_json(response)
51            metrics = json_data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['edges']
52        except Exception as e:
53            raise e
54        else:
55            for node in metrics:
56                node = node.get('node')
57                if node and isinstance(node,dict):
58                    results.append(node)
59        return results

Now that we have defined our object class, we can now create a scraper object and call either one of the methods.

1// Define the URL for the profile page.
2url = 'https://www.instagram.com/said_tezel/?hl=en'
3// Initiate a scraper object and call one of the methods.
4instagram = InstagramScraper(url)
5post_metrics = instagram.post_metrics()<br><br>
6``
7
8Now a list of most recent 12 posts, together with their metrics, are assigned to the `post_metrics` variable. If you wish, you can issue a `print(post_metrics)` command here to review the data.
9
10Next up, we will push the key post metrics; such as the update time (when we fetch this data), post ID, post time, likes, comments and direct image link to a MySQL database. Before you go on, make sure that you have created a MySQL table with the necessary fields set up. If you need any help with setting up a database and table, see the [MySQL documentation](https://dev.mysql.com/doc/refman/8.0/en/creating-database.html).
11
12## Updating MySQL Table With Data
13
14Within the Python script, we'll start a connection to the MySQL table, iterate through all the post metrics we have defined at `post_metrics` and create new rows at the table.
15
16```python
17// First, create a connection to your MySQL server.
18// Make sure you change the host, user, password and db values.
19connection = pymysql.connect(host=SERVER_URL,
20                             user=DB_USER,
21                             password=DB_PASSWORD,
22                             db=DB_TABLE,
23                             charset='utf8mb4',
24                             cursorclass=pymysql.cursors.DictCursor)

Next up, we will iterate through all single post metrics and write them to the database.

1// Set a datetime object to label the update time for the data in the database.
2update_time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
3
4// Iterate through the metrics and write them to database.
5for m in metrics:
6    i_id = str(m['id'])
7    i_post_time = datetime.fromtimestamp(m['taken_at_timestamp']).strftime('%Y-%m-%d %H:%M:%S')
8    i_likes = int(m['edge_liked_by']['count'])
9    i_comments = int(m['edge_media_to_comment']['count'])
10    i_media = m['display_url']
11    i_video = bool(m['is_video'])
12
13    with connection.cursor() as cursor:
14        sql = "INSERT INTO `data` (`update_time`, `post_id`, `post_time`, `post_likes`, `post_comments`, `post_media`, `post_is_video`) VALUES (%s, %s, %s, %s, %s, %s, %s)"
15        cursor.execute(sql, (update_time, i_id, i_post_time, i_likes, i_comments, i_media, i_video))
16    connection.commit()
17connection.close()

That’s it! Now every time you run this script, it will automatically fetch your latest post metrics and collect them in a MySQL database. If you want to automate this process, however, you might want to set up a cron job on your server in order to update the database at select intervals.

If you’d like to create a cron job to update the data every hour, for example, simply open the crontab edit console via crontab -e on your server, and add the following line at the bottom of the edit screen. Make sure you correctly define the path to your Python script file.

10 * * * * PATH_TO_YOUR_PYTHON_SCRIPT

You can file the full script file with everything included on my Github repo.

Simple Way To Export Instagram Data Using Python

Install Necessary Packages

Write An Instagram Scraper Object

More articles from Saïd Tezel

Reflections on a Dissertation: Brand Loyalty in Smartphones

My 3 Takes on Apple Watch