South China Morning Post: China makes ‘world’s largest satellite image database’ to train AI better

South China Morning Post: China makes ‘world’s largest satellite image database’ to train AI better. “A satellite imaging database containing detailed information of more than a million locations has been launched in China to help reduce artificial intelligence’s errors when identifying objects from space, the Chinese Academy of Sciences said on Wednesday. The fine-grained object recognition in high-resolution remote sensing imagery (FAIR1M) database was tens or even hundreds of times larger than similar data sets used in other countries, it said.”

Johns Hopkins University: Next-generation database will democratize access to massive amounts of turbulence data

Johns Hopkins University: Next-generation database will democratize access to massive amounts of turbulence data. “Led by Johns Hopkins University, a team of 10 researchers from three institutions is using a new $4 million, five-year grant from the National Science Foundation to create a next-generation turbulence database that will enable groundbreaking research in engineering and the atmospheric and ocean sciences. This powerful tool will let researchers from all over the world access data from some of the largest world-class numerical simulations of turbulent flows. Such simulations are very costly and their outputs are traditionally very difficult to share among researchers due to the data sets’ massive size.”

The Program Era Project: Limning the depths of the Iowa Writers’ Workshop’s literary influence (University of Iowa)

University of Iowa: The Program Era Project: Limning the depths of the Iowa Writers’ Workshop’s literary influence. “The Program Era Project, or PEP, uses data visualization and other computer-assisted methods to track the aesthetic and cultural influence of the Workshop since its founding in 1936. In particular, writers affiliated with the Workshop, both as alumni and/or professors, have gone on to found or teach at many other creative writing programs around the nation…. The PEP, supported by the Digital Scholarship and Publishing Studio at UI Libraries, has compiled extensive datasets that track those networks of Workshop-affiliated writers.”

NARA: NARA Datasets on the AWS Registry of Open Data

NARA: NARA Datasets on the AWS Registry of Open Data. “The metadata index for the 1940 Census dataset is 251 megabytes, and all of the 3.7 million images from the population schedules, the enumeration district maps, and the enumeration district descriptions total over 15 terabytes. This dataset reflects the 1940 Census records that are also available on NARA’s 1940 Census website and in the National Archives Catalog.”

#Election2020: the first public Twitter dataset on the 2020 US Presidential election (PubMed)

PubMed: #Election2020: the first public Twitter dataset on the 2020 US Presidential election. “The study of online chatter is paramount, especially in the wake of important voting events like the recent November 3, 2020 U.S. Presidential election and the inauguration on January 21, 2021. Limited access to social media data is often the primary obstacle that limits our abilities to study and understand online political discourse. To mitigate this impediment and empower the Computational Social Science research community, we are publicly releasing a massive-scale, longitudinal dataset of U.S. politics- and election-related tweets. This multilingual dataset encompasses over 1.2 billion tweets and tracks all salient U.S. political trends, actors, and events from 2019 to the time of this writing.”

University of Warwick: World’s largest public scenario database for testing and assuring safe Autonomous Vehicle deployments

University of Warwick: World’s largest public scenario database for testing and assuring safe Autonomous Vehicle deployments. “The Safety PoolTM Scenario Database, the largest public repository of scenarios for testing autonomous vehicles in the world, has been launched today by WMG at the University of Warwick, and Deepen AI. The database provides a diverse set of scenarios in different operational design domains (ODDs i.e. operating conditions) that can be leveraged by governments, industry and academia alike to test and benchmark Automated Driving Systems (ADSs) and use insights to inform policy and regulatory guidelines.”

VentureBeat: MIT study finds ‘systematic’ labeling errors in popular AI benchmark datasets

VentureBeat: MIT study finds ‘systematic’ labeling errors in popular AI benchmark datasets. “The field of AI and machine learning is arguably built on the shoulders of a few hundred papers, many of which draw conclusions using data from a subset of public datasets. Large, labeled corpora have been critical to the success of AI in domains ranging from image classification to audio classification. That’s because their annotations expose comprehensible patterns to machine learning algorithms, in effect telling machines what to look for in future datasets so they’re able to make predictions. But while labeled data is usually equated with ground truth, datasets can — and do — contain errors.”

Health Analytics: NIH Funds National Project to Promote COVID-19 Data Sharing

Health Analytics: NIH Funds National Project to Promote COVID-19 Data Sharing. “UC hospitals have received a $500,000 grant from NIH to enable COVID-19 data sharing on a national scale, allowing collaborations among researchers, providers, and patients. Led by the University of California, Irvine (UCI), leaders will manage a transfer of UC data on COVID-19 cases into the National COVID Cohort Collaborative’s (N3C) centralized data resource at the NIH’s National Center for Advancing Translational Sciences.”

Scientific Data: AI-assisted tracking of worldwide non-pharmaceutical interventions for COVID-19

Scientific Data: AI-assisted tracking of worldwide non-pharmaceutical interventions for COVID-19. “We present the Worldwide Non-pharmaceutical Interventions Tracker for COVID-19 (WNTRAC), a comprehensive dataset consisting of over 6,000 NPIs implemented worldwide since the start of the pandemic. WNTRAC covers NPIs implemented across 261 countries and territories, and classifies NPIs into a taxonomy of 16 NPI types. NPIs are automatically extracted daily from Wikipedia articles using natural language processing techniques and then manually validated to ensure accuracy and veracity.”

Vanderbilt: Vanderbilt scientists sketch rare star system using more than a century of astronomical observations

Vanderbilt: Vanderbilt scientists sketch rare star system using more than a century of astronomical observations. “Vanderbilt astronomers have painted their best picture yet of an RV Tauri variable—a rare type of stellar binary, in which two stars orbit each other within a sprawling disk of dust. To sketch its characteristics, the scientists mined a 130-year dataset that spans the widest range of light yet collected for one of these systems, from radio waves to X-rays.”

University of Virginia: Why Everything We Thought We Knew About Corporate Governance Is Wrong

University of Virginia: Why Everything We Thought We Knew About Corporate Governance Is Wrong. “Nearly two decades of influential scholarship on how corporations are governed and valued is based on bad data, according to new research co-authored by Cathy Hwang of the University of Virginia School of Law. The paper, ‘Cleaning Corporate Governance,’ reveals that an index cited thousands of times by scholars to measure corporate governance and shareholder rights is riddled with errors. Written by Hwang, Columbia Law School postdoctoral fellow Jens Frankenreiter, Wisconsin law professor Yaron Nili and Columbia law professor Eric L. Talley, the new research also offers a dataset with pilot data to rectify the problem, creating a clearer picture about the power dynamics that control corporations and what that might imply in terms of profit potential, valuation and long-term prospects, among other business factors.”

Harvard Business Review: 4 Ways to Democratize Data Science in Your Organization

Harvard Business Review: 4 Ways to Democratize Data Science in Your Organization. “Many organizations have begun their data science journeys by starting ‘centers of excellence,’ hiring the best data scientists they can and focusing their efforts where there is lots of data. In some respects, this makes good sense — after all, they don’t want to be late to the artificial intelligence or machine learning party. Plus, data scientists want to show off their latest tools. But is this the best way to deploy this rare resource? For most companies, we think it unlikely. Rather, we advise companies to see data science both more strategically and broadly.”

YouTube Community Contributions Archive Now Available: A Look at the Stats (DataHorde)

DataHorde: YouTube Community Contributions Archive Now Available: A Look at the Stats. “The YouTube Community Contributions Archive is now available on the Internet Archive! You can download the entire collection, or simply search for and download files for a particular video. The collection is composed of 4096 ZIP archives which contain 406,394 folders and 1,361,998 files. Compressed, the collection is 3.83GB, and once decompressed, the collection is 9.46GB.”

Jacobs Technion – Cornell Institute: VoterFraud2020

Jacobs Technion-Cornell Institute: VoterFraud2020. “We are making publicly available VoterFraud2020, a multi-modal Twitter dataset with 7.6M tweets and 25.6M retweets from 2.6M users that includes key phrases and hashtags related to voter fraud claims between October 23rd and December 16th. The dataset also includes the full set of links and YouTube videos shared in these tweets, with data about their spread in different Twitter sub-communities.”