BLOG: Jan 5, 2022
Scraping data from social media platforms
When doing digital investigations, data scraping is an important skill to have. Usually, digital investigators leave it to intelligence analysts to scrape data, but it would be much more efficient for investigators to learn the skill themselves.
We have spoken to Joseph Jones, a former British Military and Law Enforcement Intelligence officer with more than 15 years of intelligence-gathering and investigative experience. In this article, he will give his best advice about how to scrape data on social media platforms.
There are some advanced tools and also some easy-to-use tools for data scraping. Joseph says that either way, anyone can learn how to use them.
– When I was teaching a group in Albania about data scraping, none of them had been working in the command line interface before. But after 2-3 training sessions with me, they downloaded all kinds of tools and experimented with them, says Joseph.
There are some tools that you probably need an expert to configure; we will talk about that further down in the article.
But the basic tools for scraping one social media platform at a time are pretty straightforward. And if you are a digital investigator, this is a skill you should learn.
– How it works now is that digital investigators rely on intelligence analysts to scrape data. But analysts already have a massive workload, and the minor cases will have to wait. That’s why I train digital investigators so they can do the scraping themselves. In this way, both analysts and investigators get a more smooth and streamlined workflow, says Joseph.
Python – An easy to learn programming language
Python is a high-level, general-purpose programming language and one of the more easiest programming languages for beginners. You need to download Python to use most of the data scraping tools we mention in this article.
– Learning Python is essential for digital investigators, but unfortunately, it is not a part of their formal training. That´s one of the reasons why I have teamed up with Paliscope to teach digital investigators to use Python in their daily work, says Joseph.
There are also many tutorials on how to use it on Youtube and other websites. For example, you can learn a lot on Pythonprogramming.net.
Once you have Python installed on your computer, you can start using the data scraping tools.
If you want to scrape data from Twitter, there is a simple tool called Twitter Scraper which you can use.
– Quite simply, with this tool, you can scrape a target Twitter account and get all the tweets from it. It is simple to use, and it will create a Microsoft Excel spreadsheet with all the tweets collected with dates, times, how many likes, how many shares, and so on, says Joseph.
There is a problem with Twitter though, to do it, you will need an API key, which is a key given by Twitter. Without that key, you won’t be able to scrape any data at all.
– To get the API key, you need to answer all sorts of questions about what you need it for and why, says Joseph.
Twitter made it difficult to scrape data after the Cambridge Analytica data scandal, where a British consulting firm collected personal data from millions of Facebook users to spread political advertising.
Scraping data from Facebook
Facebook was the heart of the Cambridge Analytica data scandal. It was on Facebook that personal data was collected, and since then, it is quite tricky to scrape data from the platform.
– Facebook doesn’t want anyone to scrape data from them. They see when a tool hits the platform and apply immediate configurations to prevent those tools from working, says Joseph.
So to scrape data from Facebook, you need to do it manually. There are a few different tools you can use, and Joseph has previously written an article on how to extract friendslist on Facebook using two Google Chrome extensions.
– Manually data scraping is not considered a significant risk for Facebook. They are more concerned about big data gathering rather than targeted low quantity scraping, says Joseph.
If you work in law enforcement and need more information from Facebook, you can send them a request. In some countries, for example in US and UK, Facebook is compelled to answer every request. But the problem with this is that it takes too much time, which can be used to catch criminals instead.
– When it comes to kidnapping, missing people, or child abuse, to name a few, law enforcement doesn’t have time to wait for social media companies to respond to their requests. They need to work fast to save lives, says Joseph.
Scraping data from Instagram
On Instagram, it is possible to scrape data such as videos, pictures, Instagram TV feeds in addition to follower and following lists. However, you can’t obtain any hidden information, which means that if the target has their Instagram account set on private, you have to befriend them to extract their data.
– You have to be connected to get the information. If you are not connected, you can only get the profile picture and name, nothing else, says Joseph.
There are many different tools to scrape an Instagram account. You find them on GitHub, and you will need Python to use all of them.
If you are going to connect with the target, you need to create a sock puppet.
TikTok is basically the same as Instagram. You can scrape data such as videos, likes, comments, and so on, as long as the profile settings are not private.
– It is usually not a problem with TikTok since most people don’t set their account to private. The whole idea with TikTok is to get many followers, so generally speaking, most people have an open account, says Joseph.
The same goes for Instagram, even though it has slightly more private accounts than on TikTok. You can use a free tool called TikTok scraper to extract data from TikTok.
Scraping data from other social media platforms
We have now gone through the most used social media platforms. But there are other platforms used in various countries around the world.
– In Russia, for example, they use a platform called VKontakte. There are no specific tools made for that, so you will have to scrape it in the same manual way as you scrape Facebook data, says Joseph.
Another example is Telegram, an instant messenger application that is widely used by people who want more privacy when compared to WhatsApp. However, criminals also prefer to use Telegram for the same reasons. CSAM criminals are even more so.
– Telegram has private groups, so you have to create a sock puppet and become a member of the group before scraping data from it. Adding to the complexity, some groups are invite-only, and digital investigators need to work hard to obtain access to such groups in order to scrape from them. Such groups used to distribute CSAM include truly awful stuff, says Joseph.
To extract data from Telegram, you use the same method as on Facebook.
Use Scrapy to extract data from more than one social media platform at a time
Until now, we have told you about some easy-to-use tools that can scrape data individually from different social media platforms. But there is also a tool from Python called Scrapy, which is more or less an all-in-one tool.
Scrapy is a powerful tool that can be used on various platforms, such as Reddit, Twitter, and other social media platforms. You can even use it to scrape data on the dark web.
– Scrapy is more advanced than the other tools, and it takes a long time to configure. But when it’s up and running, you can just let Scrapy run in the background automatically while you do other things, says Joseph.
With Scrapy, you can extract specific data by telling the system what you want. For example, you can configure it to scrape a Twitter account once a day or tell it to extract a specific page on the dark web twice a day.
– It is a great tool, but it is not easy to configure. To do it, you will probably need an expert. Helping Paliscope’s customers to implement an effective scraping platform such as Scrapy is one of many initiatives that I am undertaking, says Joseph.
Now you hopefully have more knowledge on how to scrape data on different social media platforms. But how do you get insights on all that data you have obtained to take crucial actions?
– The data you have extracted comes in various forms. It can be PDF, Excel, pictures, videos and so on. And you might have usernames and activities that you want to cross-reference. So you need a tool to do that, says Joseph.
– I think Paliscope has the perfect tool for analyzing data. You can put all of these different files into YOSE, which uses AI to cross-reference data. And then you can visualize your findings with Discovry, says Joseph.
Get in touch
Start a project
We are with you all the way.
We'll start with an initial consultation where you tell us about your needs, then go from there
We're always looking for talented people to join us, and together we can create a secure digital future for organizations & citizens worldwide