This article isn’t meant to discuss what web scraping is, or why it’s valuable to do. What I intend to focus on instead, is how modern web application architecture is changing how web scraping can/must be performed. A nice article discussing traditional web scraping just appeared in Hacker Newsletter #375 by Vinko Kodžoman. His article tipped my motivation to write this.
Up until recently, data was typically harvested by parsing a site’s markup. Browser automation frameworks allowed this to be achieved in various ways, and I’ve used both Beautiful Soup and Selenium to achieve what I needed to in the past. Vinko discusses in his article another library lxml, which I’ve not tried. His explanation of lxml and how it interacts with the DOM is good enough to allow general understanding of the way scraping is performed. Essentially, your bot reads the markup, and categorizes relevant data for you.
Entering Front-End & REST APIs
Earlier this year I was tasked with developing a scraper to collect data from a website which used Angular 2 on its front end. The requirement was that it collect business listings from each region on the site, however, the listings were only shown based on the boundaries of a Google Map frame users were able to drag around. There was no way to iterate the markup via pagination, and no directory/list of all existing regions. Traditional scraping of the markup wouldn’t work, because the content was rendered dynamically. I couldn’t figure out how to use BS4 for this task, and I was stumped.
I began to browse the site in question and examine the requests that it made via Postman. What I discovered was that life was much easier under this new type of web app architecture. AJAX calls were made via GET requests to return the data I was looking to collect. After spending a little time browsing the site to map the required endpoints of the REST API and their parameters, harvesting the data was substantially easier via their own API than it ever was in the days of parsing markup. Data was returned in JSON format, and could be added directly into a Mongo collection. Essentially, all the data could be copied from the website in a matter of half a minute or so.
Cool Story Bro, But Example?
What inspired me to write this article was a project at work. Since paperwork and money are involved, I can’t provide the examples I wish I could. However, here’s an example using a site in the Marijuana industry 😉
Example Scenario: You’re wanting to know every single place to buy marijuana in the country. Because you love marijuana!
Here is a Python script to make a POST request to the API endpoints on Leafly with the parameters that we just discovered. We’ve ignored sending any headers at all, and set the “Take” parameter to be absurd.
The output will be something like…
Sierra Vista Phytoceuticals
Hamilton Wellness Center
If this wasn’t a proof of concept, you could do much better than this by adjusting the coordinates a few times and taking multiple samples. There are a lot things to change if you were trying to make a legitimate Leafly scraper, I know. I’m just demonstrating that the ability to grab over 4k listings in a few seconds is pretty neat.
It’s getting easier to scrape large amounts of data when front-end frameworks talk to API’s with no authentication, especially if they have no limit to the request size.
P.S. If you’re ethical you will obey the terms of service posted on any site. You’ll then determine that running the provided proof of concept is wrong (don’t do it).