National Library of New Zealand: Papers Past data has been set free

National Library of New Zealand: Papers Past data has been set free . “Papers Past is the National Library’s fully text searchable website containing over 150 newspapers from New Zealand and the Pacific, as well as magazines, journals and government reports. As a result of the data being released, people can now access the data from 78 New Zealand newspapers from the Albertland Gazette to the Victoria Times, all published before 1900. The data itself consists of the METS/ALTO XML files for each issue. The XML files sit in the back of Papers Past and are what allows you to locate keywords within articles.”

The Next Web: COVID-19 made your data set worthless. Now what?

The Next Web: COVID-19 made your data set worthless. Now what?. “The COVID-19 pandemic has perplexed data scientists and creators of machine learning tools as the sudden and major change in consumer behavior has made predictions based on historical data nearly useless. There is also very little point in trying to train new prediction models during the crisis, as one simply cannot predict chaos. While these challenges could shake our perception of what artificial intelligence really is (and is not), they might also foster the development of tools that could automatically adjust.”

Nature: Migrating big astronomy data to the cloud

Nature: Migrating big astronomy data to the cloud. “Astronomers typically work by asking observatories for time on a telescope and downloading the resulting data. But as the amount of data that telescopes produce grows, well, astronomically, old methods can’t keep pace. The Vera C. Rubin Observatory in Chile is geared up to collect 20 terabytes per night as part of its 10-year Legacy Survey of Space and Time (LSST), once it becomes operational in 2022. That’s as much as the Sloan Digital Sky Survey — which created the most detailed 3D maps of the Universe so far — collected in total between 2000 and 2010.”

Phys .org: An open-source data platform for researchers studying archaea

Phys .org: An open-source data platform for researchers studying archaea. “To foster scientific exchange and to advance discovery, biologists in the School of Arts & Sciences led by postdoc Stefan Schulze and professor Mecky Pohlschroder have launched the Archaeal Proteome Project (ArcPP), a web-based database to collect and make available datasets to further the work of all scientists interested in archaea, a domain of life composed of microorganisms that can dwell anywhere from deep-sea vents to the human gut.”

Bing Blogs: Extracting Covid-19 insights from Bing search data

Bing Blogs: Extracting Covid-19 insights from Bing search data . “As is true for many other topics, search engine query logs may be able to give insight into the information gaps associated with Covid-19…. We are pleased to announce that we have already made Covid-19 query data freely available on GitHub as the Bing search dataset for Coronavirus intent, with scheduled updates every month over the course of the pandemic. This dataset includes explicit Covid-19 search queries containing terms such as corona, coronavirus, and covid, as well as implicit Covid-19 queries that are used to access the same set of web page search results (using the technique of random walks on the click graph).”

Selected Datasets: A New Library of Congress Collection (Library of Congress)

Library of Congress: Selected Datasets: A New Library of Congress Collection. “Friends, data wranglers, lend me your ears; The Library of Congress’ Selected Datasets Collection is now live! You can now download datasets of the Simple English Wikipedia, the Atlas of Historical County Boundaries, sports economic data, half a million emails from Enron, and urban soil lead abatement from this online collection. This initial set of 20 datasets represents the public start of an ongoing collecting program tied to the Library’s plan to support emerging styles of data-driven research, such as text mining and machine learning.”

The Register: MIT apologizes, permanently pulls offline huge dataset that taught AI systems to use racist, misogynistic slurs

The Register: MIT apologizes, permanently pulls offline huge dataset that taught AI systems to use racist, misogynistic slurs. “The training set, built by the university, has been used to teach machine-learning models to automatically identify and list the people and objects depicted in still images. For example, if you show one of these systems a photo of a park, it might tell you about the children, adults, pets, picnic spreads, grass, and trees present in the snap. Thanks to MIT’s cavalier approach when assembling its training set, though, these systems may also label women as whores or bitches, and Black and Asian people with derogatory language. The database also contained close-up pictures of female genitalia labeled with the C-word.”

TechCrunch: Aclima and Google release a new air quality data set for researchers to investigate California pollution

TechCrunch: Aclima and Google release a new air quality data set for researchers to investigate California pollution. “As part of the Collision from Home conference, Aclima chief executive Davida Herzl released a new data set made in conjunction with Google. Free to the scientific community, the data is the culmination of four years of data collection and aggregation resulting in 42 million air quality measurements throughout the state of California.”

Centers for Medicare & Medicaid Services: Medicare COVID-19 Data Release Blog

Centers for Medicare & Medicaid Services: Medicare COVID-19 Data Release Blog. “Today, the Centers for Medicare & Medicaid Services (CMS) released preliminary data on COVID-19 derived from Medicare claims. The data provides a highly instructive picture of the impact of COVID-19 on the Medicare population, further confirming a number of long understood patterns in the disease such as the elevated risk for seniors with underlying health conditions.”

CNET: Your face mask selfies could be training the next facial recognition tool

CNET: Your face mask selfies could be training the next facial recognition tool. “Your face mask selfies aren’t just getting seen by your friends and family — they’re also getting collected by researchers looking to use them to improve facial recognition algorithms. CNET found thousands of face-masked selfies up for grabs in public data sets, with pictures taken directly from Instagram.”

Berkeley Haas: Open-source smartphone database offers a new tool for tracking coronavirus exposure

Berkeley Haas: Open-source smartphone database offers a new tool for tracking coronavirus exposure. “The Covid-19 Exposure Indices, created by Berkeley Haas Asst. Prof. Victor Couture and researchers from Yale, Princeton, the University of Chicago, and the University of Pennsylvania in collaboration with location data company PlaceIQ, is aimed at academic investigators studying the spread of the pandemic. The data sets allow researchers to visualize how people can potentially be exposed to those infected with the virus, based on cell-phone movements to and from businesses and other locations where a great deal of the exposure happens.”

FierceBiotech: Life science companies combine to form COVID-19 research database

FierceBiotech: Life science companies combine to form COVID-19 research database. “A group of major CRO, life science, data analytics, publishing and healthcare companies joined forces to release a pro bono research database to build up and integrate a central hub on the latest data out for COVID-19. On the technical side, it’s a secure repository of HIPAA-compliant, de-identified and limited patient-level data sets that will be ‘made available to public health and policy researchers to extract insights to help combat the COVID-19 pandemic,’ according to the group.”

Analytics India: A Beginner’s Guide To Using Google Colab

Analytics India: A Beginner’s Guide To Using Google Colab. “We are all familiar with the pop-up alerts of ‘memory-error’ while trying to work with a large dataset of machine learning (ML) or deep learning algorithms on Jupyter notebooks. On top of that, owning a decent GPU from an existing cloud provider has remained out of bounds due to the financial investment it entails. The machines at our disposal, unfortunately, do not have the unlimited computational ability. But the wait is finally over as we can now build large ML models without selling our properties. The credit goes to Google for launching the Colab – an online platform that allows anyone to train models with large datasets, absolutely free.”

EdScoop: Researchers publish social media data early for pandemic response

EdScoop: Researchers publish social media data early for pandemic response. “To help represent the spread and impact of the coronavirus pandemic, researchers at the Georgia State University on Monday released a data set of more than 140 million tweets related to COVID-19 as a resource for the global research community. The work is part of research that collects and tracks social media chatter to understand mobility patterns during natural disasters, but researchers decided to release their data before finalizing their own results to assist other researchers studying the current pandemic.”

Los Angeles Times: To aid coronavirus fight, The Times releases database of California cases

Los Angeles Times: To aid coronavirus fight, The Times releases database of California cases. “In an effort to aid scientists and researchers in the fight against COVID-19, The Times has released its database of California coronavirus cases to the public.To follow the virus’ spread, The Times is conducting an independent survey of dozens of local health agencies across the state. The effort, run continually throughout the day, supplies the underlying data for this site’s coronavirus tracker.”