NASA has a corpus of 50,000 scientific papers, with 12 possible categories for each paper. The lowest frequency label has around 1000 occurrences in the dataset, while the highest has around 10,000 occurrences. Each paper can have multiple labels.
We were able to assign labels with 83-85% accuracy, depending on the model used, and an F1 score of >70%, in less than a minute.
The NY Botanical Garden has a training dataset of ~300,000 OCR text from specimen labels. They are tagged with various fields, such as location, collector, date, and species information. This is manually tagged by people. Furthermore, they have ~500,000 OCR text samples without tags. NYBG’s goal is to tag fields in these untagged specimen labels.
Currently, they are able to process around 120K a year employing 2-3 people at a time. Their entire corpus includes 7.8 million specimens. At this rate, they can be finished in 65 years.
Based on a small training sample (~5,000 examples), we’re able to extract species information (order, family, genus) with 70-80% accuracy, in minutes. We are currently working on extracting geographic and collector information.
The World Bank has nearly 30,000 projects from its storied history. Each of them have multiple associated documents, including a 50-100 page document called the Project Appraisal Document (PAD). The World Bank employs a team of 40 people to assign sector and theme codes to these projects based on reading their PADs. It takes them several months to process a few thousand.
We took 10,000 PADs as training data and tuned our model to predict 11 theme codes with 80% accuracy, 78% F1 score. The per label accuracy ranges from 80-99%. This takes us less than a minute.