Hackaday: Tired Of Web Scraping? Make The AI Do It

Hackaday: Tired Of Web Scraping? Make The AI Do It. “[James Turk] has a novel approach to the problem of scraping web content in a structured way without needing to write the kind of page-specific code web scrapers usually have to deal with. How? Just enlist the help of a natural language AI. Scrapeghost relies on OpenAI’s GPT API to parse a web page’s content, pull out and classify any salient bits, and format it in a useful way.”

Hongkiat: 5 Best Web Scraping Tools to Extract Online Data

Hongkiat: 5 Best Web Scraping Tools to Extract Online Data. “These software look for new data manually or automatically, fetching the new or updated data and storing them for easy access. For example, one may collect info about products and their prices from Amazon using a scraping tool. In this post, we’re listing the use cases of web scraping tools and the top 5 web scraping tools to collect information with zero codings.”

Noupe: An Introductory Guide On How To Do Web Scraping: Extracting Data From Your Website

Noupe: An Introductory Guide On How To Do Web Scraping: Extracting Data From Your Website. “Also known as web extraction, web scraping is a tool that helps you to gain information on products, contacts, and a lot more, even when a website doesn’t have an API (application programming interface), or grants limited access to its data. Web scraping offers a faster, more practical solution for extracting data from a website, instead of having to use the same format as the website in question, or even just copying and pasting information manually.” The headline is a little misleading because this article isn’t really a how-to. It is, however, an excellent overview of / orientation to Web scraping.

VIDEO: An introduction to HTML and CSS for data journalists (Online Journalism Blog)

Online Journalism Blog: VIDEO: An introduction to HTML and CSS for data journalists. “In this video — first made for students on the MA in Data Journalism at Birmingham City University and shared as part of a series of video posts — I provide an introduction to the aspects of HTML and CSS that are helpful for those starting out with data journalism. It is best watched alongside the previous video on responsive web design.” The video is hosted on YouTube and the captions are auto-generated. The English ones are pretty good with only a few errors.

Engadget: Meta sues a site cloner who allegedly scraped over 350,000 Instagram profiles

Engadget: Meta sues a site cloner who allegedly scraped over 350,000 Instagram profiles. “Meta is taking legal action against two prolific data scrapers. On Tuesday, the company filed separate federal lawsuits against a company called Octopus and an individual named Ekrem Ateş. According to Meta, the former is the US subsidiary of a Chinese multinational tech firm that offers data scraping-for-hire services to individuals and companies.”

Spotted Via Reddit: ISEF Database

Spotted on Reddit and hosted on GitHub: ISEF Database. In this case ISEF is the International Science and Engineering Fair. “This is a simple web scraper which gets all of the projects and abstract information from Science for Society’s website… I want someone to get inspired to do a ‘meta’ science fair project.” Looks like it’s available either as a Kaggle notebook or a delimited text file of information.

CNN: Meta wants researchers to help it avoid having users’ personal data exposed online

CNN: Meta wants researchers to help it avoid having users’ personal data exposed online. “Meta, the company formerly known as Facebook, is asking for help in avoiding having personal data about its users scraped from its platforms and posted to the web. The social media giant announced Wednesday that it is expanding its bug bounty program — which offers rewards for helping identify and fix vulnerabilities in its apps — to include scraping, in a move Meta (FB) is calling an ‘industry first’ to address an ‘internet-wide’ challenge.”

Complete Music Update: Genius tries to get its lyric lifting lawsuit against Google reinstated

Complete Music Update: Genius tries to get its lyric lifting lawsuit against Google reinstated. “Legal reps for lyrics site Genius were in the Second Circuit appeals court in the US yesterday seeking to get their client’s big old lawsuit against Google reinstated. They insisted that Genius had a legitimate legal claim against Google because the tech giant breached its terms of service.”

Kyiv Post: Facebook sues Ukrainian hacker for selling millions of users’ data

Kyiv Post: Facebook sues Ukrainian hacker for selling millions of users’ data. “Facebook is suing a Ukrainian national suspected of scraping and selling information from 178 million users on the platform in 2018-2019, according to American publication Insider. According to the court documents, the hacker accessed and sold user IDs and phone numbers, violating the terms of service of Facebook.”

The Verge: Facebook’s justification for banning third-party researchers ‘inaccurate,’ says FTC

The Verge: Facebook’s justification for banning third-party researchers ‘inaccurate,’ says FTC. “When Facebook banned the personal accounts of academics researching ad transparency and misinformation on its platform this week, it justified the decision in part by saying it was only following rules set out by the Federal Trade Commission. But the FTC itself says this is ‘inaccurate’ and that its rules require no such action, reports The Washington Post.”

Search Engine Journal: How to Use Google Sheets for Web Scraping & Campaign Building

Search Engine Journal: How to Use Google Sheets for Web Scraping & Campaign Building. “According to Google’s support page, IMPORTXML ‘imports data from any of various structured data types including XML, HTML, CSV, TSV, and RSS and ATOM XML feeds.’ Essentially, IMPORTXML is a function allows you to scrape structured data from webpages — no coding knowledge required. For example, it’s quick and easy to extract data such as page titles, descriptions, or links, but also more complex information.”

Engadget: Facebook disables accounts of NYU team looking into political ad targeting

Engadget: Facebook disables accounts of NYU team looking into political ad targeting. “Before the US election last year, a team of researchers from New York University’s engineering school launched a project to gather more data on political ads. In particular, the team wanted to know how political advertisers choose the demographic their ads target and don’t target. Shortly after the project called the NYU Ad Observatory went live, however, Facebook notified the researchers that their efforts violate its terms of service related to bulk data collection. Now, the social network has announced that it has ‘disabled the accounts, apps, Pages and platform access associated with NYU’s Ad Observatory Project and its operators…’”