Case Studies

A prestigious university digitized a historical collection of one million documents from the Nuremberg War Crimes Trial. The university partnered with Tolstoy to help index the documents, by tagging all document references in their 150,000+ pages of court transcripts.

This involved two parts: 1) identifying and extracting all document mentions from the dialogue, and 2) tagging the mention as a prosecution, defense, or evidence file.

Previously, the university employed staff to do this as it required reading complex human dialogue and tagging nuanced mentions.

We were able to tag document references with 99%+ accuracy, and 92-95% recall (number of all positives captured). This saved the university several months of staff work.

Multi-label Classification

NASA has a corpus of 50,000 scientific papers, with 12 possible categories for each paper. The lowest frequency label has around 1000 occurrences in the dataset, while the highest has around 10,000 occurrences. Each paper can have multiple labels.

After model tuning, we were able to assign labels across the 50,000 papers with 83-85% accuracy and an F1 score of >70%, in less than a minute.

This compares to manual accuracy in the 50-60% range, and several months for the same workload.

Optical Character Recognition

The Wall Street Journal celebrated their 130th anniversary in 2019. As part of the celebration, they wanted to digitize and reprint articles from their entire history in a special edition.

Since many of the articles were very old, with poor, spotty scans, traditional OCR software did not pick up the text well or at all. Furthermore, many of the articles included several columns and images, which off-the-shelf OCR also struggled with.

We wrote a custom OCR script that parsed their newspaper broadsheets and clippings with 95%+ accuracy. This saved them several weeks of manual transcription.

See the final published Special Edition here.

Here’s an article about how we did it.

OCR + Field Extraction

The New York Power Authority (NYPA) has a critical, 660-megawatt power cable extending from Westchester County to Long Island, NY that experienced multiple faults beginning in October 2020, taking it out of service for 1 to 2 months at a time.

To assess the health of this asset, NYPA and its partners tested the oil at 18 to 20 different locations after each fault, generating 181 disparate test results in receipt form. These forms included print and handwritten fields.

With only disconnected paper documents, NYPA needed one holistic and connected view of the data to properly assess the situation.

Using a customized AI text extraction model, we digitized the documents into an Excel database that allowed NYPA to easily view and evaluate the information to determine critical next steps.

Tolstoy completed this exercise in only 24 hours, and the conversation to explain the issue and what was needed was a brief, one-hour discussion.

Email Categorization

During the 2020 pandemic lockdown, a large UK-based garden retailer received an influx of orders from customers staying at home. We helped them tag customer emails with 98%+ accuracy.

This saved them approx. 700 customer agent hours.

Gardening grew popular during the lockdown. The retailer quickly built up a backlog of 50,000+ customer emails between March and April — with thousands added per week.

We built a custom AI model to tag their emails (delivery inquires, cancel order, non-urgent). We also helped them discover that customers often chose the wrong categories via Freshdesk, with an accuracy of just 50-70%.

This helped the retailer accurately categorize their backlog in 2 days, versus a month of 4-5 agents reading emails. The retailer could finally triage and respond to urgent requests.

Multi-label Classification

The World Bank has nearly 30,000 projects from its storied history. Each of them have multiple associated documents, including a 50-100 page document called the Project Appraisal Document (PAD). The World Bank employs a team of 40 people to assign sector and theme codes to these projects based on reading their PADs. It takes them several months to process a few thousand.

We took 10,000 PADs as training data and tuned our model to predict 11 theme codes with 80% accuracy, 78% F1 score. The per label accuracy ranges from 80-99%. This compares with the Bank team’s measured accuracy at 55% (compared with expert-labelled).

This takes our model less than a minute, versus four months of manual work with their contractors.