OCR, Dirty OCR, and Digitized Collections

A talk from Ryan Cordell: ‘Q i-jtb the Raven’: Taking Dirty OCR Seriously. “I would assert that the digitized edition of the November 28, 1849 Lewisburg Chronicle, and the West Branch Farmer comprises at least six parts: an archival TIFF, a JPG, a PDF, an OCR-derived text file, an XML file, and the web interface. The image files might be classed as a species of facsimile edition, while the OCR-derived text and XML files are a new editions; all of these come together in a kind of digital variorum. Bibliographic clues are scattered among the artifact’s parts, not all of which are available through CA’s public interface. The details gleaned from these files, however, are only one part of a full bibliographic account, which should also concern itself with the institutional, financial, social, and governmental structures that lead one historical textual object to be digitized, while another is not.” I will be thinking about this article for days. Professor Cordell goes past the mechanics of digitizing to the whole context around a digitized collection, and then past that to consider the assessment/creation of the digitized collections in the first place.

