Data Center Dynamics: Arctic World Archive adds latest data deposit in Svalbard facility

Data Center Dynamics: Arctic World Archive adds latest data deposit in Svalbard facility. “In a bi-annual ceremony (that was delayed by the pandemic), Piql added reels of data from the Norwegian Armed Forces Museum, Natural History Museum, Guttormsgaards Arkiv, The Saga Heritage Foundation, Tronrud Engineering, National Széchényi Library (National Library of Hungary), Indira Gandhi National Centre for Arts and Ministry of Culture in India, Sapio Analytics, Artemis Arts, and others. It joins data from the National Archive of Brazil, Mexico, and a huge deposit from GitHub.”

BioSpectrum Asia: Korea to establish national digital library on health and genome data by 2028

BioSpectrum Asia: Korea to establish national digital library on health and genome data by 2028. “The second pilot project will analyze the genetic makeup of 12,500 donated DNA samples from Korean patients living with a rare disease. Over the next year, the resulting data will be used by the Illumina-backed consortium to prepare for the main project in analyzing and comparing the genes of 1 million Koreans to advance the country’s medical technology and improve future public health.”

US Department of Energy: DOE invests $13.7 million for research in data reduction for science

US Department of Energy: DOE invests $13.7 million for research in data reduction for science. “Today, the U.S. Department of Energy (DOE) announced $13.7 million in funding for nine research projects that will advance the state of the art in computer science and applied mathematics. The projects – led by five universities and five DOE National Laboratories across eight states – will address the challenges of moving, storing, and processing the massive data sets produced by scientific observatories, experimental facilities, and supercomputers, accelerating the pace of scientific discoveries.”

The Register: We spoke to a Stanford prof on the tech and social impact of AI’s powerful, emerging ‘foundation models’

The Register: We spoke to a Stanford prof on the tech and social impact of AI’s powerful, emerging ‘foundation models’. “Typically, these models are giant neural networks made up of millions and billions of parameters, trained on massive amounts of data and later fine-tuned for specific tasks. For example, OpenAI’s enormous GPT-3 model is known for generating prose from prompts, though it can be adapted to translate between languages and output source code for developers. These models – drawing from vast datasets – can therefore sit at the heart of powerful tools that may disrupt business and industries, life and work. Yet right now they’re difficult to understand and control; they are imperfect; and they exhibit all sorts of biases that could harm us. And it has already been demonstrated that all of these problems can grow with model size.”

Scientific Data: The Upworthy Research Archive, a time series of 32,487 experiments in U.S. media

Scientific Data: The Upworthy Research Archive, a time series of 32,487 experiments in U.S. media . “This archive records the stimuli and outcome for every A/B test fielded by Upworthy between January 24, 2013 and April 30, 2015. In total, the archive includes 32,487 experiments, 150,817 experiment arms, and 538,272,878 participant assignments. The open access dataset is organized to support exploratory and confirmatory research, as well as meta-scientific research on ways that scientists make use of the archive.”

Analytics India: Tech Behind Storywrangler, The Analytics Tool Crawling Billions Of Social Media Posts

Analytics India: Tech Behind Storywrangler, The Analytics Tool Crawling Billions Of Social Media Posts . “In a research paper, ‘Storywrangler: A massive exploratorium for sociolinguistic, cultural, socioeconomic, and political timelines using Twitter,’ researchers from the University of Vermont, in collaboration with Charles River Analytics, and MassMutual Data Science, detailed the working of a tool that curated over 150 billion tweets containing 1 trillion 1-grams from 2008 to 2021.”

Google Blog: Using AI to map Africa’s buildings

Google Blog: Using AI to map Africa’s buildings. “Google’s Open Buildings is a new open access dataset containing the locations and geometry of buildings across most of Africa. From Lagos’ Makoko settlement to Dodoma’s refugee camps, millions of previously invisible buildings have popped up in our dataset. This improved building data helps refine the understanding of where people and communities live, providing actionable information for state and non-state actors looking to provide services from sanitation to education and vaccination.”

The Conversation: Low- and middle-income countries lack access to big data analysis – here’s how to fill the gap

The Conversation: Low- and middle-income countries lack access to big data analysis – here’s how to fill the gap . “We are two mathematicians at the University of Colorado Boulder and are part of a project called the Laboratory for Interdisciplinary Statistical Analysis that is working to develop statistical infrastructure across the world. The goal of the program is to help build data science infrastructure in developing nations. In 10 countries and counting, we have started ‘stat labs’ – academic centers that train young statisticians to collaborate on important local statistics projects.”

From Avocet to Zebra Finch: big data study finds more than 50 billion birds in the world (Phys .org)

Phys .org: From Avocet to Zebra Finch: big data study finds more than 50 billion birds in the world. “There are roughly 50 billion individual birds in the world, a new big data study by UNSW Sydney suggests—about six birds for every human on the planet. The study—which bases its findings on citizen science observations and detailed algorithms—estimates how many birds belong to 9700 different bird species, including flightless birds like emus and penguins.”

NARA: NARA Datasets on the AWS Registry of Open Data

NARA: NARA Datasets on the AWS Registry of Open Data. “The metadata index for the 1940 Census dataset is 251 megabytes, and all of the 3.7 million images from the population schedules, the enumeration district maps, and the enumeration district descriptions total over 15 terabytes. This dataset reflects the 1940 Census records that are also available on NARA’s 1940 Census website and in the National Archives Catalog.”

VentureBeat: MIT study finds ‘systematic’ labeling errors in popular AI benchmark datasets

VentureBeat: MIT study finds ‘systematic’ labeling errors in popular AI benchmark datasets. “The field of AI and machine learning is arguably built on the shoulders of a few hundred papers, many of which draw conclusions using data from a subset of public datasets. Large, labeled corpora have been critical to the success of AI in domains ranging from image classification to audio classification. That’s because their annotations expose comprehensible patterns to machine learning algorithms, in effect telling machines what to look for in future datasets so they’re able to make predictions. But while labeled data is usually equated with ground truth, datasets can — and do — contain errors.”

Cal Poly Pomona: Signs of Habitability in Venus’ Clouds Found Using 1978 Probe Data

Cal Poly Pomona: Signs of Habitability in Venus’ Clouds Found Using 1978 Probe Data. “Signs of biologically relevant chemicals, including phosphine, have been found in the clouds of Venus by a team led by Rakesh Mogul, professor of biological chemistry at Cal Poly Pomona. The data was discovered in archived data from NASA’s Pioneer Venus Multiprobe, which arrived at Venus and collected data almost 42 years ago.” We stan archived data.