I’ve been intending on migrating my blog to a static site for quite some time, for security, speed, and generally just being more minimalist. The primary issue preventing me from making the change was fear of commitment. Hugo looked like an excellent option, as did Hyde (not to be confused with the Jekyll Template). I decided to go with Jekyll simply because of the size of its community, and because of GitHub Pages (which I’m still not using until they support HTTPS for custom domains).
This article isn’t meant to discuss what web scraping is, or why it’s valuable to do. What I intend to focus on instead, is how modern web application architecture is changing how web scraping can/must be performed. A nice article discussing traditional web scraping just appeared in Hacker Newsletter #375 by Vinko Kodžoman. His article tipped my motivation to write this.
Traditional Scraping
Up until recently, data was typically harvested by parsing a site’s markup. Browser automation frameworks allowed this to be achieved in various ways, and I’ve used both Beautiful Soup and Selenium to achieve what I needed to in the past. Vinko discusses in his article another library lxml, which I’ve not tried. His explanation of lxml and how it interacts with the DOM is good enough to allow general understanding of the way scraping is performed. Essentially, your bot reads the markup, and categorizes relevant data for you.
It’s been about a year, a little over actually, since I started work on my main side project. The app is a motocross track directory, which isn’t something that doesn’t exist already, but I felt existing track directories were lacking a lot of features. This lead to me creating MapMoto.
An Idea
I ride motocross a lot, not as much as a few years ago, but a lot. I’m always looking up weather before I ride, looking for hot-line numbers to call to confirm days to ride, and looking for new tracks all together, especially when traveling. I wrote down everything I wished a motocross track directory would have, and came up with the follow list.
Note: If you actually want support on the unlimited feed, and don’t want to do any hacky tricks, go support their hard work and Purchase Woocommerce product feed manager.
I came across this info somewhat by accident today while working on an XML Feed generator for a WooCommerce installation. I’ll often review the code of a couple plugins with similar functions to what I’m developing. While looking through Woocommerce Google Feed Manager I guess I found a gremlin.
UPDATE 12/5/2016:
If you’re going to attempt to integrate this into the WordPress platform, please consider using my WP Drinking Age Plugin
Background
So outside the normal grind I’ve been working on a website for a tequila brand. After a meeting with marketing I’d gathered it was important to add a drinking age gateway to the site. You see some type of these gateways on just about every alcohol brand’s site. I asked if they’d prefer to simply ask “Are you of Legal Drinking Age?”, and then have “Yes/No” buttons determine a user’s fate (1),(2), or if they’d rather have the user input their birthday (3). Apparently, and I’m not a business guy or a lawyer so don’t comment and argue this with me, the yes/no gateways hold slightly less legitimacy than the ones where a user inputs their birthday to enter the site .
I’ve been meaning to play with honeypots for quite some time, and if I’d given it just a little more research, I’d have started much sooner. This is because shortly after deciding upon glastopf as the first on my list of honey pots to try out, I came across mhn, an open source project by Threat Stream.
The Modern Honeypot Network (mhn) makes not only launching honeypots insanely easy, but it serves as a nice way of monitoring multiple honeypots as well. Digital Ocean Droplets seemed like a cheap and safe way of getting started, and I quickly found this post by Lenny Zeltser which provides pretty good directions to anyone wanting to do this themselves.
WordPress makes up some large percentage of the web. As I’m writing this, web development firms all over the world are churning out WordPress sites for their clients. Some of these installs are vanilla and basic, yet some come with exceedingly complicated plugin/theme combinations. WordPress’ ease of use is a double edged sword. The positive side being a developer may complete a feature rich, member’s only website in one day. The negative being, a multitude of plugins and code snippets written by other developers are included in these projects (other wise they wouldn’t be completed within a day). A good developer will make good choices as to what plugins to use, a novice developer may not be able to tell, and things can become dangerous.
As of version 6.78 things began to change, first the developers removed the upload feature, and the wp-cli functionality. Newer versions of the plugin are really stripped down and include less functionality than 6.77 did. The new version exists on the Wordpress repository almost exclusively to upsell you to the paid version. The new plugin is still hackable, but there are more steps required than what’s described below, and like I said, it doesn’t work with wp-cli. The steps below to increase the plugin size don’t work on the new version.
Within the same week, my girlfriend and I both found ourselves without phones. Her Galaxy took a soaking in the ladies room, and my late Nexus 5 had ceased to charge despite all repair effort. So now, I find myself with two fresh Nexus 5’s, a white for my girlfriend and a black one for myself, running Android Lolipop 5.0.1.
I’m going to walk through the process of what I’ve done setting up the devices. They are almost completely open source, with additional security and privacy features to be installed in Part 2. This is written as a fairly high level overview of the process, so I’ll try not to get into the nittygritty. This isn’t intended as a walk-through.
Although I’ve not actually been inside yet, I’m on the email list for the Sacramento Hacker Lab. A few weeks ago they put out an email alerting local developers that their new location in Rocklin is hosting an event for Intel’s RealSense 3D camera technology. It’s not really my field but I love leaning new things, and I love any kind of conference, so I applied. A few weeks later I got called up by an event organizer and they were nice enough to grant me a spot.