Parsing Scanned Images to Text: How Digitized is Your Organization?

November 4, 2020
Leave a comment

The world is digitizing its records. Remote work during the pandemic has only accelerated this trend, with workplaces realizing that the ability to physically access records is not a given. As we move into a new era of hybrid and remote work, the drive to convert hard-copy records to online versions — where staff can easily access, search, and analyze the data — is picking up steam.

Paper to pixels

Many workplaces, especially in heavily regulated industries — government, energy/utilities, accounting, law — keep their records for years. Currently, much of that is in paper form, with documents piling up over decades in cabinets, rooms, entire floors. Companies have to keep these copies long-term, whether mandated by law or out of pure necessity. But not only are hard-copy forms liable to be lost, damaged, or even forgotten — it’s “hellish” to retrieve them, as described by one U.S. EPA official who deals with public FOIA requests. People have to walk through rooms, digging through boxes and cabinets, physically reading forms to find what they want.

Data entry takes up 75% of the time and resources of a digitization project

Digitization not a trivial task, though. Imaging the records is itself a daunting task when faced with millions of records. But that pales in comparison to transcribing the images. Data entry takes up 75% of the time and resources of a digitization project, which often runs into decades and several millions of dollars. But it’s the key step that will actually unlock the data.

Image for post — Historically, humans are needed for data entry from unstructured images — but that is changing. Tolstoy has tagged this image without human intervention.

The power and limits of OCR

Optical character recognition, or OCR, is the main answer to these data entry challenges. It’s the technology that recognizes and extracts text from raw images. Google famously digitized and OCR’d millions of books from libraries around the world in the 2000’s. Their technology is still used today, embedded in many OCR services.

However, there’s an important catch. Google’s OCR, and that of many other providers out there, is built to read the text of a book — single column, top-down printed text. Introduce columns, fields, boxes, images, handwriting, and it struggles to comprehend it without further work. Yet complex, unstructured images are the bulk of the text data in the world — the iceberg’s vast submerged heft.

Complex, unstructured images are the bulk of the text data in the world — the iceberg’s vast submerged heft.

And it’s a frontier of interesting questions. How do you successfully transcribe century-old records with faded text? How do you search records for fields that variably occur anywhere in the image? What if a record included cursive handwriting — in another language?

AI-assisted digitization

These are questions Tolstoy has plumbed in projects for libraries, newspapers, and the private sector. We’ve written about them here, here, and here. What we’ve learned is that with the right AI models and training, it’s not only possible — it’s the best path forward. AI reduces the time and resources organizations need for digitization projects by orders of magnitude.

For example, a prominent university needed 3–4 staff to spend several months processing a historical document set. It involved indexing thousands of pages of legal court transcripts. Normally, someone reads through the transcripts — human dialogue — to find and tag document references. After they got in touch with us, we developed and refined the right AI model to parse the 150,000+ pages of legal transcripts in an afternoon.

In another project, a venerable newspaper was struggling to transcribe their historical broadsheets in time for their 130th Anniversary edition. Normally, someone would take months manually transcribing these papers. None of the OCR software the newspaper tried worked well, or at all. We wrote a custom OCR parser and returned all transcribed images within a week.

The future is digital

Our goal is to reduce both the cost and time barriers to entry for organizations to truly digitize their archives. With true digitization, an organization has a much more complete and powerful view of their recorded projects, assets, and clients. They are able to instantly access, share, and compare records; analyze specific aspects of the contents; track their holdings; find what they seek.

What is the state of digitization at your organization? And what thorny questions do you have about your text records that you’ll like us to tackle next?

Parsing Scanned Images to Text: How Digitized is Your Organization?

Paper to pixels

The power and limits of OCR

AI-assisted digitization

The future is digital

Leave a Reply