MakeUseOf: What Is Web Scraping? How to Collect Data From Websites. “Think of a type of data and you can probably collect it by scraping the web. Real estate listings, sports data, email addresses of businesses in your area, and even the lyrics from your favorite artist can all be sought out and saved by writing a small script.” This article has a couple of good examples, but it’s mostly an overview (this is not meant as a criticism; it’s an incredibly broad topic that nobody could cover in one article!)
Graham Cluley: Facebook knew for years scammers were harvesting users’ details with phone number searches. Did nothing. “Facebook ignored a widely-known privacy flaw for years, allowing scammers, spammers, and other malicious parties to scoop up virtually all users’ names and profile details. As I explained way back in 2012, when I was writing for the Sophos Naked Security blog, simply entering someone’s phone number or email address into Facebook’s search box would perform a reverse look-up and tell you who it belonged to, with any information they shared publicly on their Facebook profile.”
Techdirt: Court Says Scraping Websites And Creating Fake Profiles Can Be Protected By The First Amendment. “It’s no secret that the Computer Fraud and Abuse Act (CFAA) is a mess. Originally written by a confused and panicked Congress in the wake of the 1980s movie War Games, it was supposed to be an ‘anti-hacking’ law, but was written so broadly that it has been used over and over again against any sort of ‘things that happen on a computer.’ It has been (not so jokingly) referred to as ‘the law that sticks,’ because when someone has done something “icky” using a computer, if no other law is found to be broken, someone can almost always find some weird way to interpret the CFAA to claim it’s been violated. The two most problematic parts of the CFAA are the fact that it applies to ‘unauthorized access’ or to ‘exceeding authorized access’ on any ‘computer… which is used in or affecting interstate or foreign commerce or communications.’ In 1986 that may have seemed limited. But, today, that means any computer on the internet. Which means basically any computer.”
Wolfram Blog: Web Scraping with the Wolfram Language, Part 1: Importing and Interpreting. “Do you want to do more with data available on the web? Meaningful data exploration requires computation—and the Wolfram Language is well suited to the tasks of acquiring and organizing data. I’ll walk through the process of importing information from a webpage into a Wolfram Notebook and extracting specific parts for basic computation.” oo!
Kaylin Walker: Tidy Text Mining Beer Reviews. “BeerAdvocate.com was scraped for a sample of beer reviews, resulting in a dataset of 31,550 beers and their brewery, beer style, ABV, total numerical ratings, number of text reviews, and a sample of review text. Review text was gathered only for beers with at least 5 text reviews. A minimum of 2000 characters of review text were collected for those beers, with total length ranging from 2000 to 5000 characters.”
UpGuard: Dark Cloud: Inside The Pentagon’s Leaked Internet Surveillance Archive. “While a cursory examination of the data reveals loose correlations of some of the scraped data to regional US security concerns, such as with posts concerning Iraqi and Pakistani politics, the apparently benign nature of the vast number of captured global posts, as well as the origination of many of them from within the US, raises serious concerns about the extent and legality of known Pentagon surveillance against US citizens. In addition, it remains unclear why and for what reasons the data was accumulated, presenting the overwhelming likelihood that the majority of posts captured originate from law-abiding civilians across the world.”
Online Journalism Blog: The 2nd edition of Scraping for Journalists is now live. “When I began publishing Scraping for Journalists in 2012, one of the reasons for choosing to publish online was the ability to publish chapters as I wrote them, and update the book in response to readers’ feedback. The book was finally ‘finished’ in 2013 — but earlier this year I decided to go through it from cover to cover and update everything. The result — a ‘second edition’ of Scraping for Journalists — is now live.”