I’ve been intending on migrating my blog to a static site for quite some time, for security, speed, and generally just being more minimalist. The primary issue preventing me from making the change was fear of commitment. Hugo looked like an excellent option, as did Hyde (not to be confused with the Jekyll Template). I decided to go with Jekyll simply because of the size of its community, and because of GitHub Pages (which I’m still not using until they support HTTPS for custom domains).
This article isn’t meant to discuss what web scraping is, or why it’s valuable to do. What I intend to focus on instead, is how modern web application architecture is changing how web scraping can/must be performed. A nice article discussing traditional web scraping just appeared in Hacker Newsletter #375 by Vinko Kodžoman. His article tipped my motivation to write this.
Traditional Scraping
Up until recently, data was typically harvested by parsing a site’s markup. Browser automation frameworks allowed this to be achieved in various ways, and I’ve used both Beautiful Soup and Selenium to achieve what I needed to in the past. Vinko discusses in his article another library lxml, which I’ve not tried. His explanation of lxml and how it interacts with the DOM is good enough to allow general understanding of the way scraping is performed. Essentially, your bot reads the markup, and categorizes relevant data for you.
It’s been about a year, a little over actually, since I started work on my main side project. The app is a motocross track directory, which isn’t something that doesn’t exist already, but I felt existing track directories were lacking a lot of features. This lead to me creating MapMoto.
An Idea
I ride motocross a lot, not as much as a few years ago, but a lot. I’m always looking up weather before I ride, looking for hot-line numbers to call to confirm days to ride, and looking for new tracks all together, especially when traveling. I wrote down everything I wished a motocross track directory would have, and came up with the follow list.
Note: If you actually want support on the unlimited feed, and don’t want to do any hacky tricks, go support their hard work and Purchase Woocommerce product feed manager.
I came across this info somewhat by accident today while working on an XML Feed generator for a WooCommerce installation. I’ll often review the code of a couple plugins with similar functions to what I’m developing. While looking through Woocommerce Google Feed Manager I guess I found a gremlin.
UPDATE 12/5/2016:
If you’re going to attempt to integrate this into the WordPress platform, please consider using my WP Drinking Age Plugin
Background
So outside the normal grind I’ve been working on a website for a tequila brand. After a meeting with marketing I’d gathered it was important to add a drinking age gateway to the site. You see some type of these gateways on just about every alcohol brand’s site. I asked if they’d prefer to simply ask “Are you of Legal Drinking Age?”, and then have “Yes/No” buttons determine a user’s fate (1),(2), or if they’d rather have the user input their birthday (3). Apparently, and I’m not a business guy or a lawyer so don’t comment and argue this with me, the yes/no gateways hold slightly less legitimacy than the ones where a user inputs their birthday to enter the site .