GDELT Swallows and Digests 3.5 Million Books

GDELT has swallowed and digested 3.5 million books from the Internet Archive and HathiTrust. What’s bigger than big data?

“Today we are enormously excited to announce that more than 3.5 million digitized books stretching back two centuries, encompassing the complete English-language public domain collections of the Internet Archive (1.3M volumes) and HathiTrust (2.2 million volumes), have been processed using the GDELT Global Knowledge Graph and are now available in Google BigQuery. More than a billion pages stretching back 215 years have been examined to compile a list of all people, organizations, and other names, fulltext geocoded to render them fully mappable, and more than 4,500 emotions and themes compiled. All of this computed metadata is combined with all available book-level metadata, including title, author, publisher, and subject tags as provided by the contributing libraries. Even more excitingly, the complete fulltext of all Internet Archive books published 1800-1922 are included to allow you to perform your own near-realtime analyses. All of this is housed in Google BigQuery, making it possible to perform sophisticated analyses across 122 years of history in just seconds. A single line of SQL can execute even the most complex regular expression or complete JavaScript algorithm over nearly half a terabyte of fulltext in just 11 seconds and combine it with all of the extracted data above. Track emotions or themes over time or map the geography of the world as seen through books – the sky is the limit!”

%d bloggers like this: