Undark: In ToxicDocs. org, a Treasure Trove of Industry Secrets

Undark: In ToxicDocs.org, a Treasure Trove of Industry Secrets . “The site officially launched last Friday with an initial 20 million pages of material focused on six toxic substances: asbestos, benzene, lead, polychlorinated biphenyl (PCB), polyvinyl chloride, and silica, and millions more pages are coming.” The whole article is worth a read; in particular, the problems solved to process five million pages of documents with OCR. “A recent batch of about 1.5 million pages only required about three days to convert to OCR.” Yow!