Kaylin Walker: Tidy Text Mining Beer Reviews. “BeerAdvocate.com was scraped for a sample of beer reviews, resulting in a dataset of 31,550 beers and their brewery, beer style, ABV, total numerical ratings, number of text reviews, and a sample of review text. Review text was gathered only for beers with at least 5 text reviews. A minimum of 2000 characters of review text were collected for those beers, with total length ranging from 2000 to 5000 characters.”
Library of Congress: Digital Scholarship Resource Guide: Text analysis (part 4 of 7). “Clean OCR, good metadata, and richly encoded text open up the possibility for different kinds of computer-assisted text analysis. With instructions from humans (“code”), computers can identify information and patterns across large sets of texts that human researchers would be hard-pressed to discover unaided. For example, computers can find out which words in a corpus are used most and least frequently, which words occur near each other often, what linguistic features are typical of a particular author or genre, or how the mood of a plot changes throughout a novel. Franco Moretti describes this kind of analysis as ‘distant reading’, a play on the traditional critical method ‘close reading’. Distant reading implies not the page-by-page study of a few texts, but the aggregation and analysis of large amounts of data.”
Science: Want to analyze millions of scientific papers all at once? Here’s the best way to do it. “There is long-standing debate among text and data miners: whether sifting through full research papers, rather than much shorter and simpler research summaries, or abstracts, is worth the extra effort. Though it may seem obvious that full papers would give better results, some researchers say that a lot of information they contain is redundant, and that abstracts contain all that’s needed. Given the challenges of obtaining and formatting full papers for mining, stick with abstracts, they say. In an attempt to settle the debate, Søren Brunak, a bioinformatician at the Technical University of Denmark in Kongens Lyngby, and colleagues analyzed more than 15 million scientific articles published in English from 1823 to 2016.”
From Joyce Valenza at School Library Journal: JSTOR Text Analyzer. “JSTOR Labs recently announced Text Analyzer. If you have access to the JSTOR database, you’ll want share this new search strategy with your students and faculty. Upload or drag a document–an article, a Google document, a paper you are writing, a PDF or even an image–into what JSTOR is calling its magic box, and Text Analyzer will analyze it to identify prioritized terms.”
Dato Capital: Dato Capital Announces First Tool for Extracting Company Information from Documents (PRESS RELEASE). “The Company Information Extractor can process documents by entering a website URL, uploading a file or entering text directly. Accepted formats include PDF, Word, Excel, HTML and TXT files. The system scans the document and searches for mentions of companies and directors against a daily updated database of 14 million companies and 12 million directors from the United Kingdom, Spain, Luxembourg, Panama, Gibraltar, BVI, Cayman Islands and the Netherlands.” The direct link for the tool is https://en.datocapital.com/CompanyInformationExtractor .
MIT News: Cutting down the clutter in online conversations. “From Reddit to Quora, discussion forums can be equal parts informative and daunting. We’ve all fallen down rabbit holes of lengthy threads that are impossible to sift through. Comments can be redundant, off-topic or even inaccurate, but all that content is ultimately still there for us to try and untangle. Sick of the clutter, a team from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) has developed ‘Wikum,’ a system that helps users construct concise, expandable summaries that make it easier to navigate unruly discussions.”
Ars Technica: The art of the troll: New tool reveals egg users’—and Trump’s—posting patterns. “Tweets_analyzer requires a Twitter account for authentication, as well as Twitter API credentials and, of course, a tweaked Python environment. It’s not exactly something to be handed over blindly to the average tweeter. But in the right hands (and with a little patience due to Twitter API rate-limiting), it can help analyze accounts to identify networks of Twitter bots or trolls concealing their actual location and identity. “