Using Machine Learning to Extract Nuremberg Document Citations
By Rosa Lin, Scott Jones, and Paul Deschner
In Harvard’s Nuremberg Trials Project, being able to link to cited documents in each trial’s transcript is a key feature of site navigation.
Each document submitted into evidence by prosecution and defense lawyers is introduced in the transcript and discussed, and the site user is offered the possibility at each document mention to click open the document and view its contents and attendant metadata.
While document references generally follow various standard patterns, deviations from the pattern large and small are numerous, and correctly identifying the type of document reference – is this a prosecution or defense exhibit, for example – can be quite tricky, often requiring teasing out contextual clues.
While manual linkage is highly accurate, it becomes infeasible over a corpus of 153,000 transcript pages and more than 100,000 document references to manually tag and classify each mention of a document, whether it be a prosecution or defense trial exhibit, or a source document from which the former were often chosen.
Read the rest of the post here: https://lil.law.harvard.edu/blog/2019/11/12/using-machine-learning-to-extract-nuremberg-trials-transcript-document-citations/