Eos: Deluges of Data Are Changing Astronomical Science

Eos: Deluges of Data Are Changing Astronomical Science. “For scientists who study the cosmos, hard-to-grasp numbers are par for the course. But the sheer quantity of data flowing from modern research telescopes, to say nothing of the promised deluges of upcoming astronomical surveys, is astounding even astronomers. That embarrassment of riches has necessitated some serious data wrangling by myself and my colleagues, and it’s changing astronomical science forever.”

Federal Reserve Bank of New York: Insights from Newly Digitized Banking Data, 1867-1904

Federal Reserve Bank of New York: Insights from Newly Digitized Banking Data, 1867-1904. “Call reports—regulatory filings in which commercial banks report their assets, liabilities, income, and other information—are one of the most-used data sources in banking and finance. Though call reports were collected as far back as 1867, the underlying data are only easily accessible for the recent past: the mid-1980s onward in the case of the FDIC’s FFIEC call reports. To help researchers look farther back in time, we’ve begun creating a complete digital record of this ‘missing’ call report data.”

Scientific Data: Ten lessons for data sharing with a data commons

Scientific Data: Ten lessons for data sharing with a data commons . “A data commons is a cloud-based data platform with a governance structure that allows a community to manage, analyze and share its data. Data commons provide a research community with the ability to manage and analyze large datasets using the elastic scalability provided by cloud computing and to share data securely and compliantly, and, in this way, accelerate the pace of research. Over the past decade, a number of data commons have been developed and we discuss some of the lessons learned from this effort.”

Google Research Blog: Datasets at your fingertips in Google Search

Google Research Blog: Datasets at your fingertips in Google Search. “To facilitate discovery of content with this level of statistical detail and better distill this information from across the web, Google now makes it easier to search for datasets. You can click on any of the top three results (see below) to get to the dataset page or you can explore further by clicking ‘More datasets.’”

Bureau of Transportation Statistics: BTS Updates Datasets to National Transportation Atlas Database

Bureau of Transportation Statistics: BTS Updates Datasets to National Transportation Atlas Database. “The U.S. Department of Transportation’s Bureau of Transportation Statistics today released its winter 2023 update to the National Transportation Atlas Database (NTAD), a set of nationwide geographic databases of transportation facilities, networks, and associated infrastructure.”

Search Engine Journal: How to Block ChatGPT From Using Your Website Content

Search Engine Journal: How to Block ChatGPT From Using Your Website Content. “There is concern about the lack of an easy way to opt out of having one’s content used to train large language models (LLMs) like ChatGPT. There is a way to do it, but it’s neither straightforward nor guaranteed to work.” Unlike a lot of the “how to” articles I index, this one is fairly speculative. Useful with lots of good information, but speculative.

The Distant Librarian: Jeremy Singer-Vine’s Data Liberation Project

The Distant Librarian: Jeremy Singer-Vine’s Data Liberation Project. “Not to be confused with Canada’s Data Liberation Initiative, Jeremy Singer-Vine is spending his time on the Data Liberation Project, ‘an initiative to identify, obtain, reformat, clean, document, publish, and disseminate government datasets of public interest.’ There’s not yet a lot to look at there, but there’s plenty in the pipeline.”

Scientific Data: Caravan – A global community dataset for large-sample hydrology

Scientific Data: Caravan – A global community dataset for large-sample hydrology . “This paper introduces a dataset called Caravan (a series of CAMELS [Catchment Attributes and Meteorology for Large-sample Studies]) that standardizes and aggregates seven existing large-sample hydrology datasets. Caravan includes meteorological forcing data, streamflow data, and static catchment attributes (e.g., geophysical, sociological, climatological) for 6830 catchments. Most importantly, Caravan is both a dataset and open-source software that allows members of the hydrology community to extend the dataset to new locations by extracting forcing data and catchment attributes in the cloud.”

Data Descriptor: A crowdsourced dataset of aerial images with annotated solar photovoltaic arrays and installation metadata

Data Descriptor: A crowdsourced dataset of aerial images with annotated solar photovoltaic arrays and installation metadata . “Overhead imagery is increasingly being used to improve the knowledge of rooftop PV installations with machine learning models capable of automatically mapping these installations. However, these models cannot be reliably transferred from one region or imagery source to another without incurring a decrease in accuracy. To address this issue, known as distribution shift, and foster the development of PV array mapping pipelines, we propose a dataset containing aerial images, segmentation masks, and installation metadata (i.e., technical characteristics).”

Nature: A large dataset of scientific text reuse in Open-Access publications

Nature: A large dataset of scientific text reuse in Open-Access publications. “We present the Webis-STEREO-21 dataset, a massive collection of Scientific Text Reuse in Open-access publications. It contains 91 million cases of reused text passages found in 4.2 million unique open-access publications. Cases range from overlap of as few as eight words to near-duplicate publications and include a variety of reuse types, ranging from boilerplate text to verbatim copying to quotations and paraphrases.”

WIRED: Public Programs Are Only as Good as Their Data

WIRED: Public Programs Are Only as Good as Their Data. “Bad data is why people in the UK have been wrongly deported and accused of being illegal immigrants, as happened during the Windrush scandal. Bad data was behind a childcare benefits scandal in the Netherlands, where benefit claimants were wrongly accused of fraud because a government algorithm had been programmed to identify people with dual nationalities as more likely to commit the crime.The reality is, when it comes to collecting and analyzing national statistics, many governments around the world are severely underresourced.”

Purdue University: CGT Analysis Database Expanded w/ New Features

Purdue University: CGT Analysis Database Expanded w/ New Features. “The new version of the database captures economic flows across 160 countries and regions, 141 of which represent individual countries accounting for 99% of global output and 96% of global population. The economic flows are categorized into 65 economic sectors: 20 in agriculture and food, 25 in manufacturing and 20 in services. The latest version of GTAP reflects these flows for five reference years (2004, 2007, 2011, 2014 and 2017).”

Nature: Hunting for the best bioscience software tool? Check this database

New-to-me, from Nature: Hunting for the best bioscience software tool? Check this database. “Developed by the Chan Zuckerberg Initiative (CZI), a scientific funder based in Redwood City, California, the CZ Software Mentions data set does not catalogue formal citations, but rather mentions of software in the text of scientific articles. With 67 million mentions from nearly 20 million full-text research articles, the data set — announced on 28 September last year — is the largest-ever database of scientific-software mentions, says Dario Taraborelli, a science program officer at CZI.”