Case Studies

Multi-label Classification

NASA has a corpus of 50,000 scientific papers, with 12 possible categories for each paper. The lowest frequency label has around 1000 occurrences in the dataset, while the highest has around 10,000 occurrences. Each paper can have multiple labels.

We were able to assign labels with 83-85% accuracy, depending on the model used, and an F1 score of >70%, in less than a minute.

A prestigious university digitized a historical collection of one million documents from the Nuremberg War Crimes Trial. The university partnered with Tolstoy to help index the documents, by tagging all document references in their 150,000+ pages of court transcripts.

This involved two parts: 1) identifying and extracting all document mentions from the dialogue, and 2) tagging the mention as a prosecution, defense, or evidence file.

Previously, the university employed staff to do this as it required reading complex human dialogue and tagging nuanced mentions.

We were able to tag document references with 99%+ accuracy, and 92-95% recall (number of all positives captured). This saved the university several months of staff work.

Optical Character Recognition

The Wall Street Journal celebrated their 130th anniversary in 2019. As part of the celebration, they wanted to digitize and reprint articles from their entire history in a special edition.

Since many of the articles were very old, with poor, spotty scans, traditional OCR software did not pick up the text well or at all. Furthermore, many of the articles included several columns and images, which off-the-shelf OCR also struggled with.

We wrote a custom OCR script that parsed their newspaper broadsheets and clippings with 95%+ accuracy. This saved them several weeks of manual transcription.

See the final published Special Edition here.

Here’s an article about how we did it.

Entity extraction

The NY-based herbarium has a training dataset of ~300,000 OCR text from specimen labels. They are tagged with various fields, such as location, collector, date, and species information. This is manually tagged by people. Furthermore, they have ~500,000 OCR text samples without tags. Their goal is to tag fields in these untagged specimen labels.

Currently, they are able to process around 120K a year employing 2-3 people at a time. Their entire corpus includes 7.8 million specimens. At this rate, they can be finished in 65 years.

Based on a small training sample (~5,000 examples), we’re able to extract species information (family, genus, species) with 99%+ accuracy for print samples, and ~80% accuracy for handwriting samples, in minutes.

See a demo of our tool here.

Multi-label Classification

The World Bank has nearly 30,000 projects from its storied history. Each of them have multiple associated documents, including a 50-100 page document called the Project Appraisal Document (PAD). The World Bank employs a team of 40 people to assign sector and theme codes to these projects based on reading their PADs. It takes them several months to process a few thousand.

We took 10,000 PADs as training data and tuned our model to predict 11 theme codes with 80% accuracy, 78% F1 score. The per label accuracy ranges from 80-99%. This compares with the Bank team’s measured accuracy at 55% (compared with expert-labelled).

This takes our model less than a minute, versus four months of manual work with their contractors.