Science: Want to analyze millions of scientific papers all at once? Here’s the best way to do it. “There is long-standing debate among text and data miners: whether sifting through full research papers, rather than much shorter and simpler research summaries, or abstracts, is worth the extra effort. Though it may seem obvious that full papers would give better results, some researchers say that a lot of information they contain is redundant, and that abstracts contain all that’s needed. Given the challenges of obtaining and formatting full papers for mining, stick with abstracts, they say. In an attempt to settle the debate, Søren Brunak, a bioinformatician at the Technical University of Denmark in Kongens Lyngby, and colleagues analyzed more than 15 million scientific articles published in English from 1823 to 2016.”
The Guardian: How can we stop algorithms telling lies? “The recent proliferation in big data models has gone largely unnoticed by the average person, but it’s safe to say that most important moments where people interact with large bureaucratic systems now involve an algorithm in the form of a scoring system. Getting into college, getting a job, being assessed as a worker, getting a credit card or insurance, voting, and even policing are in many cases done algorithmically. Moreover, the technology introduced into these systematic decisions is largely opaque, even to their creators, and has so far largely escaped meaningful regulation, even when it fails. That makes the question of which of these algorithms are working on our behalf even more important and urgent.” This is not one of those “oo oo algos are bad hide the children” articles, but in-depth and with a lot of examples. Recommended read.
University at Buffalo: The end of sneakernet?. “For researchers and companies sharing extremely large datasets, such as genome maps or satellite imagery, it can be quicker to send documents by truck or airplane. The slowdown leads to everything from lost productivity to the inability to quickly warn people of natural disasters. The University at Buffalo has received a $584,469 National Science Foundation grant to address this problem.”
Wired: AI And ‘enormous Data’ Could Make Tech Giants Harder To Topple. “ANOTHER WEEK, ANOTHER record-breaking AI research study released by Google—this time with results that are a reminder of a crucial business dynamic of the current AI boom. The ecosystem of tech companies that consumers and the economy increasingly depend on is traditionally said to be kept innovative and un-monopolistic by disruption, the process whereby smaller companies upend larger ones. But when competition in tech depends on machine learning systems powered by huge stockpiles of data, slaying a tech giant may be harder than ever.”
From The New Stack, with a hat tip to Angela G.: Big Data Simpsons. “Thanks to the work of Benjamin M. Schmidt, an assistant professor of history at Northeastern University, 25 years of dialogue from The Simpsons have been smashed into a giant data set, connected to a user-friendly search window.”
Phys.org: Open imaging data for biology. “A picture may be worth a thousand words, but only if you understand what you are looking at. The life sciences rely increasingly on 2-D, 3-D and 4-D image data, but its staggering heterogeneity and size make it extremely difficult to collate into a central resource, link to other data types and share with the research community. To address this challenge, scientists at the University of Dundee, the European Bioinformatics Institute (EMBL-EBI), the University of Bristol and the University of Cambridge have launched a prototype repository for imaging data: the Image Data Resource (IDR). This free resource, described in Nature Methods, is the first general biological image repository that stores and integrates data from multiple modalities and laboratories.”
Stanford: New database allows Stanford researchers to find disparities in officers’ treatment of minority motorists. “…These findings are based on a nationwide database – which the Stanford researchers created – of state patrol stops. The database contains key details from millions of records collected from 2011 to 2015 and is part of an effort to statistically analyze police practices. Along with the findings they are sharing today, the researchers are releasing their entire dataset, complete with online tutorials, so that policy makers, journalists and citizens can do their own analyses through this new Stanford Open Policing Project.”