University of Wisconsin-Milwaukee: UWM Team Receives Prestigious Mellon Grant for “Archive Mining”. “The ‘LGBTQ+ Audio Archive Mining Project’ will use machine learning tools and data analysis and visualization to build and process text datasets extracted from a variety of AV materials in these collections, including collections of oral histories, local television news and radio broadcasts, and early LGBTQ+ community cable programming.”
Science Blog: Simplifying How Scientists Share Data. “…often, sharing that data with other scientists – or with peer-reviewed journal editors, or funders – is difficult. The software might be proprietary, and prohibitively expensive to purchase. It might take years of training for a person to be able to manage and understand the software. Or the company that created the software might have gone out of business. A research team has developed an open-source data-management system that the scientists hope will solve all of those problems.”
Google Cloud: Big data, big world: new NOAA datasets available on Google Cloud. “A vast trove of NOAA’s environmental data is now available on Google Cloud as part of the Google Cloud Public Datasets Program and NOAA’s Big Data Project, opening up possibilities for scientific and economic advances. We are thrilled to make this valuable data available for your exploration. Google Cloud will host 5 PB of this data across our products, including BigQuery, Cloud Storage, Google Earth Engine, and Kaggle. The stored data is available at no cost, though usual charges may still apply (processing, egress of user-owned data, for example).”
New York Times: Twelve Million Phones, One Dataset, Zero Privacy. “Each piece of information in this file represents the precise location of a single smartphone over a period of several months in 2016 and 2017. The data was provided to Times Opinion by sources who asked to remain anonymous because they were not authorized to share it and could face severe penalties for doing so. The sources of the information said they had grown alarmed about how it might be abused and urgently wanted to inform the public and lawmakers.”
The Conversation: How tattoos became fashionable in Victorian England. “…we carried out the largest analysis of tattoos ever undertaken, examining 75,688 descriptions of tattoos, on 57,990 convicts in Britain and Australia from 1793 to 1925. We used data-mining techniques to extract information embedded within broader descriptive fields of criminal records, and we linked this information with extensive evidence about the personal characteristics and backgrounds of our subjects. Because the meanings of tattoos are often so difficult to fathom, we used visualisations to identify patterns of use and juxtapositions of particular designs.” This new database of tattoos is one of the new datasets from Digital Panopticon. There’s another new feature that lets you search convicts by occupation.
Library of Congress: In the Library’s Web Archives: Dig If You Will the Pictures. “The Digital Content Management section has been working on a project to extract and make available sets of files from the Library’s significant Web Archives holdings. This is another step to explore the Web Archives and make them more widely accessible and usable. Our aim in creating these sets is to identify reusable, ‘real world’ content in the Library’s digital collections, which we can provide for public access. The outcome of the project will be a series of datasets, each containing 1,000 files of related media types selected from .gov domains. We will announce and explore these datasets here on The Signal, and the data will be made available through LC Labs.”
Gengo: The 50 Best Free Datasets for Machine Learning. “What are some open datasets for machine learning? We at Gengo decided to create the ultimate cheat sheet for high quality datasets. These range from the vast (looking at you, Kaggle) or the highly specific (data for self-driving cars).”