Inside Internet Archive: 10PB+ of storage in a church… oh, and a little fight to preserve truth (The Register)

The Register: Inside Internet Archive: 10PB+ of storage in a church… oh, and a little fight to preserve truth. “At the Internet Archive’s headquarters in San Francisco, California, on Wednesday, technologists, educators, archivists, and others fact-oriented folks gathered to discuss how they and the like-minded can save news from the memory hole – a conceit conjured by George Orwell to describe a political mechanism for altering the truth.”

OCLC: OCLC and Internet Archive collaborate to expand library access to digital collections

OCLC: OCLC and Internet Archive collaborate to expand library access to digital collections. “OCLC and Internet Archive are working together to make the Archive’s collection of 2.5 million digitized books easier to find and access online and through local libraries. OCLC will process metadata from the Internet Archive for its digital collection, matching to existing records in WorldCat, the world’s most comprehensive database of information about library collections, or adding a new record if one does not exist. The WorldCat record will include a link leading back to the Archive.org record. From there, searchers can examine or potentially borrow the related digital item.” This is TERRIFIC news.

BPL: Boston Public Library Transfers Sound Archives Collection to Internet Archive for Digitization, Preservation, and Public Access

A tip o’ the nib to Penny C., who tipped me to this great announcement from Boston Public Library: Boston Public Library Transfers Sound Archives Collection to Internet Archive for Digitization, Preservation, and Public Access. “Boston Public Library has approved the transfer of significant holdings from its Sound Archives Collection to the Internet Archive, a nonprofit digital library offering permanent access to historical collections for researchers, historians, and the general public. This project will catalog and digitize a major component of the BPL’s Sound Archives Collection, which will be available where rights allow to all for free online upon the project’s completion. The BPL Sound Archives Collection contains the Library’s collection of non-circulating commercial sound recordings in a variety of historical formats, including 78 rpms and LPs. The collection includes American popular music of many genres, including classical, pop, rock, jazz, and opera from the early 1900s on the 78 rpms and through the 1980s on the LPs. The collection has remained in its current state for several decades, in storage, uncataloged and inaccessible to the public.”

Spotted on Reddit: Archivarix

My IFTTT-based Reddit-monitoring tool spotted this. Unfortunately I don’t know how recent it is, but it’s interesting: an online tool called Archivarix. It’s designed for downloading Web sites from the Wayback Machine; the first 200 files are free, additional files are $5 per thousand, or a half-cent per file. From the tutorial page: “Archivarix provides complete restructuring and arrangement of the content of websites that are publicly shared in the Internet Archive. Archivarix proceeds and arranges data in such a way that all the addresses of web pages become available at previous addresses, including also the dynamic ones. The pages code can be fully processed to be brought into full conformity with all applicable standards; all missing or unclosed tags will be fixed. All counters, trackers, suspicious third-party frames and advertisements are cleaned out; CSS styles and JavaScripts are compressed if needed. Images are optimized and reduced in size without loss of quality, backlinks are cleared, 404 errors are repaired through substituting the necessary files. All this and more you can get in a single ZIP file, the content of which is adaptable to most stringent hosting requirements.”

Internet Archive: Books from 1923 to 1941 Now Liberated!

Internet Archive: Books from 1923 to 1941 Now Liberated!. “The Internet Archive is now leveraging a little known, and perhaps never used, provision of US copyright law, Section 108h, which allows libraries to scan and make available materials published 1923 to 1941 if they are not being actively sold. Elizabeth Townsend Gard, a copyright scholar at Tulane University calls this “Library Public Domain.” She and her students helped bring the first scanned books of this era available online in a collection named for the author of the bill making this necessary: The Sonny Bono Memorial Collection. Thousands more books will be added in the near future as we automate. We hope this will encourage libraries that have been reticent to scan beyond 1923 to start mass scanning their books and other works, at least up to 1942.”

Simon Willison: Recovering missing content from the Internet Archive

Simon Willison: Recovering missing content from the Internet Archive. “When I restored my blog last weekend I used the most recent SQL backup of my blog’s database from back in 2010. I thought it had all of my content from before I started my 7 year hiatus, but in watching the 404 logs I started seeing the occasional hit to something that really should have been there but wasn’t. Turns out the SQL backup I was working from was missing some content. Thank goodness then for the Wayback Machine at the Internet Archive! I tried some of the missing URLs there and found they had been captured and preserved. But how to get them back?” Neat!