From Avocet to Zebra Finch: big data study finds more than 50 billion birds in the world (Phys .org)

Phys .org: From Avocet to Zebra Finch: big data study finds more than 50 billion birds in the world. “There are roughly 50 billion individual birds in the world, a new big data study by UNSW Sydney suggests—about six birds for every human on the planet. The study—which bases its findings on citizen science observations and detailed algorithms—estimates how many birds belong to 9700 different bird species, including flightless birds like emus and penguins.”

NARA: NARA Datasets on the AWS Registry of Open Data

NARA: NARA Datasets on the AWS Registry of Open Data. “The metadata index for the 1940 Census dataset is 251 megabytes, and all of the 3.7 million images from the population schedules, the enumeration district maps, and the enumeration district descriptions total over 15 terabytes. This dataset reflects the 1940 Census records that are also available on NARA’s 1940 Census website and in the National Archives Catalog.”

VentureBeat: MIT study finds ‘systematic’ labeling errors in popular AI benchmark datasets

VentureBeat: MIT study finds ‘systematic’ labeling errors in popular AI benchmark datasets. “The field of AI and machine learning is arguably built on the shoulders of a few hundred papers, many of which draw conclusions using data from a subset of public datasets. Large, labeled corpora have been critical to the success of AI in domains ranging from image classification to audio classification. That’s because their annotations expose comprehensible patterns to machine learning algorithms, in effect telling machines what to look for in future datasets so they’re able to make predictions. But while labeled data is usually equated with ground truth, datasets can — and do — contain errors.”

Cal Poly Pomona: Signs of Habitability in Venus’ Clouds Found Using 1978 Probe Data

Cal Poly Pomona: Signs of Habitability in Venus’ Clouds Found Using 1978 Probe Data. “Signs of biologically relevant chemicals, including phosphine, have been found in the clouds of Venus by a team led by Rakesh Mogul, professor of biological chemistry at Cal Poly Pomona. The data was discovered in archived data from NASA’s Pioneer Venus Multiprobe, which arrived at Venus and collected data almost 42 years ago.” We stan archived data.

Nature: Large socio-economic, geographic and demographic disparities exist in exposure to school closures

Nature: Large socio-economic, geographic and demographic disparities exist in exposure to school closures. “This study introduces and analyses a U.S. School Closure and Distance Learning Database that tracks in-person visits to the vast majority of K–12 public schools in the United States from January 2019 through December 2020. Specifically, we measure year-over-year change in visits to each school throughout 2020 to determine whether the school is engaged in distance learning after the onset of the pandemic.”

Internet Archive Blog: Early Web Datasets & Researcher Opportunities

Internet Archive Blog: Early Web Datasets & Researcher Opportunities. “In July, we announced our partnership with the Archives Unleashed project as part of our ongoing effort to make new services available for scholars and students to study the archived web…. As part of our partnership, we are releasing a series of publicly available datasets created from archived web collections. Alongside these efforts, the project is also launching a Cohort Program providing funding and technical support for research teams interested in studying web archive collections.”

Yale: Yale study shows limitations of applying artificial intelligence to registry databases

Yale: Yale study shows limitations of applying artificial intelligence to registry databases. “Artificial intelligence will play a pivotal role in the future of health care, medical experts say, but so far, the industry has been unable to fully leverage this tool. A Yale study has illuminated the limitations of these analytics when applied to traditional medical databases — suggesting that the key to unlocking their value may be in the way datasets are prepared.”

Phys .org: New dataset opens Estonian soil information for versatile use

Phys .org: New dataset opens Estonian soil information for versatile use. “A comprehensive database of Estonian soils and a map application has been completed in cooperation with researchers of the University of Tartu and the Estonian University of Life Sciences. The database makes Estonian soil information easily accessible and can be used from local farm-scale to national-level big data statistical analysis and machine-learning models.”

New York University: NYU Professor Creates COVID-19 Dashboard to Compare Country and State Data

New York University: NYU Professor Creates COVID-19 Dashboard to Compare Country and State Data. “A new online dashboard, created by NYU Professor Alexej Jerschow, brings together COVID-19 data from U.S. states and countries around the world to compare cases, deaths, vaccines, and testing in a visual, user-friendly format. The tool also integrates a range of policies governments have implemented to limit the spread of COVID-19—including school closings, stay-at-home orders, and mask mandates—in an effort to compare policy responses with COVID-19 outcomes.”

CNET: Twitter wants to make easier for researchers to analyze tweets

CNET: Twitter wants to make easier for researchers to analyze tweets. “Researchers who qualify will get free access to public tweets that are older than a week and be able to retrieve a higher amount of data every month. The company said it’s improving the ways researchers filter the data so they can get more precise information from public accounts. The features are part of a new version of Twitter’s application programming interface launched last year that gives developers access to the site’s public data.”

BBC: How to investigate a firm with 60 million documents

BBC: How to investigate a firm with 60 million documents. “Ms [Yousr] Khalil and a 70-strong team faced an ocean of files, transaction data and emails spanning worldwide activities, most of them entirely innocuous. So how did they plot a course through? Artificial intelligence (AI) and a bespoke computer unlike any PC you have ever worked on played a big part in this epic data trawl. A daunting collection of 500 million documents and transactions had to be whittled down.”

Jacobs Technion – Cornell Institute: VoterFraud2020

Jacobs Technion-Cornell Institute: VoterFraud2020. “We are making publicly available VoterFraud2020, a multi-modal Twitter dataset with 7.6M tweets and 25.6M retweets from 2.6M users that includes key phrases and hashtags related to voter fraud claims between October 23rd and December 16th. The dataset also includes the full set of links and YouTube videos shared in these tweets, with data about their spread in different Twitter sub-communities.”

Pacific Northwest National Laboratory: New Machine Learning Tool Tracks Urban Traffic Congestion

Pacific Northwest National Laboratory: New Machine Learning Tool Tracks Urban Traffic Congestion. “Currently, publicly available traffic information at the street level is sparse and incomplete. Traffic engineers generally have relied on isolated traffic counts, collision statistics and speed data to determine roadway conditions. The new tool uses traffic datasets collected from UBER drivers and other publicly available traffic sensor data to map street-level traffic flow over time. It creates a big picture of city traffic using machine learning tools and the computing resources available at a national laboratory.”

US Equal Opportunity Employment Commission: EEOC Launches New Data Tool to Track Employment Trends

US Equal Employment Opportunity Commission: EEOC Launches New Data Tool to Track Employment Trends. “EEOC Explore allows users to analyze aggregate data associated with more than 56 million employees and 73,000 employers nationwide. The user-friendly tool enables stakeholders to explore and compare data trends across a number of categories, including location, sex, race and ethnicity, and industry sector without the need for experience in computer programming or statistical analysis.”