BY ROSA LIN, FOUNDER OF TOLSTOY
Artificial intelligence has been all the rage these past few years. It’s cropped up in every corner of computerized life, from the recommendations on your Amazon account, to Facebook’s face recognition in photos, to self-driving cars. Despite its ubiquity, the actual design and application of machine learning has largely been the domain of computer scientists and expert programmers – all but inaccessible to technical professionals without requisite backgrounds, let alone the average layman. However, tools are now being built to change that, and open up the immense powers of AI to a much larger swathe of users. In this post, we’ll detail some applications of machine learning relevant to professionals without a computer science background, and show how you can use Tolstoy’s tools to carry out your own projects.
Artificial intelligence, and especially text analysis, is not a new concept – it’s been around since the 1950’s. What is different in the last decade are breakthroughs in a method of AI called machine learning, or the ability of an algorithm to pick up nuanced patterns through many, many examples, and apply those patterns to new examples. This technique – described by some as statistics on steroids – has been so successful that it is the basis for how machines can recognize a face in a crowd, drive cars with human skill, and of course, as is relevant here, analyze text on an unprecedented new level.
Researchers, academics, and industry professionals are now discovering ways to apply machine learning in their work to bring out novel insights and incredible efficiency gains – we’re talking tens or hundreds of thousands times faster. Here are a few examples, to give you a sense of possible applications.
Organizations often have to deal with large numbers of documents that must be tagged or categorized in some form. This traditionally, and still today, is the work of a team of humans reading and manually tagging the documents. But large organizations can have massive amounts of documents – for example, the World Bank has tens of thousands, while NASA has millions of papers. This task can take several months to years to complete – and sometimes, depending on the number of tags, be barely feasible. NASA has 25,000 official tags – the size of the average person’s vocabulary. Additionally, the accuracy of people who have to tag documents day in and day out is often far lower than ideal – around 50-60% accuracy is typical across industries. This is an application for machine learning. With machine learning algorithms, one person can categorize tens of thousands of papers in less than a minute, and with better than human accuracy. In fact, Tolstoy’s tools were used to do this at the World Bank and NASA.
To try it for yourself on a sample of 50,000 scientific papers from NASA, click here and select “Houston, we have a lot of documents”.
DECADES TO A DAY
Other organizations grapple to extract meaningful information from text data – say, who, what, when, where from a narrative – when they have outsized quantities of it. For example, the U.S. Department of State receives thousands of cables per week from staff and contacts located around the globe, with highly contextual and descriptive information. Museums deal with tens of thousands of handwritten field notes from scientists on species and habitat data. Now this information – unstructured and untagged – must be combed over to track names, locations and changes over time. Usually, people have to pore through the text manually. However, with advancements in machine learning, it is possible to automatically extract entities. One person can now either take pre-trained models to find standard entities – say, names – from a text, or train a model themselves to find specific entities – say bird species names. And instead of taking decades – as is currently projected with the size of museum collections – it’ll take a day.
PROPAGANDA AND COMMENTS
There are many other applications for machine learning in different fields. At American Enterprise Institute, a Washington, DC-based think tank, research economists used machine learning to track Chinese propaganda in the last seven decades of the People’s Daily newspaper (Chinese newspapers, unlike American ones, arrange their paper by importance, rather than topic – thus one can track which topics received higher billing across the decades). The New York Times uses machine learning to sort through the thousands of comments they get on news stories every day (ever wonder how they have time to pick out top comments, or weed out trolls? – that’s how). Applications for machine learning in widely differing professions are everywhere, and really only limited by one’s imagination.
To sort out 2,000 public comments submitted to the U.S. Environmental Protection Agency in response to a proposed rule, click here and select “Public 2 Cents”.
We are on the vanguard of a whole new way of conducting research and analyses – much more powerful and extensive than the methods of the past. With AI and natural language processing, textual data that would’ve taken months or years for a small army to process can now take a single person a few minutes – an improvement in efficiency of not only an order of magnitude, but orders of magnitude. We at Tolstoy believe anyone with interesting datasets and information should have access to these new abilities, regardless of background or profession. And we hope if you are such a person, you’ll explore and see it for yourself.