Traditionally, computers have been programmed with step-by-step instructions to solve tasks. Certain skills like processing images or text are too complex to be described by a set of rules. Language in particular, is highly ambiguous, contextual, and contains too many exceptions.
Machine learning is a computing paradigm where computers learn by example. Machine learning involves providing input-output pairs so that the machine learns how to solve the task by understanding the relationship between the input and output.
Natural Language Processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence. NLP deals with processing, understanding, and generating text. In recent years, this field has undergone rapid developments and widespread adoption due to its embrace of machine-learning based large language models (LLMs).
Modern language models are characterised by their adherence to the ‘distributional hypothesis’. The distributional hypothesis can be best described by the adage “You shall know a word by the company it keeps”. The meaning of each word can be inferred by the meaning of the words that surround it, in context. This has been operationalized using the “self-attention” framework. Self-attention means each word “attends” to all other words in the sentence to generate its own representation - a vector (list of numbers) that encapsulates meaning.
Read "Designing Large Language Model Applications" written by Bedrock AI founder, Suhas Pai, and published by O'Reilly Media.
Modern success stories in NLP include grammar correction tools like Grammarly, translation software like Google Translate, and the auto-complete used by Gmail, Outlook and other email providers. NLP is also increasingly used in customer-service chatbots and in search engines. State-of-the-art models include Open AI’s GPT-3, Meta’s OPT, and the open-source BigScience model BLOOM, of which Bedrock AI’s CTO is a project co-chair.
Limitations of Language Models
Like all automation processes, language models still suffer from limitations. Language models are limited by the length of the input they can process at a time (typically less than 3,000 words, thus limiting the contextual information it has access to while making a decision). Financial documents like annual reports usually run into 100s of pages making financial text processing a particularly challenging field.
Language models are computationally prohibitive to train from scratch. The current approach in the field is to use open-source language models trained and published by Google, Meta, Microsoft, and other big-tech companies, and adapt or ‘fine-tune’ them according to the individual application’s needs. The base model has learned more general properties of language like grammar and the subsequent fine-tuning phase leverages this knowledge to help the model learn more fine-grained tasks.
Most open-source language models are primarily trained on web text, and struggle to adapt to text that has different characteristics from web text. Corporate disclosure is linguistically and semantically very different from web text. Market announcements and financial reports are characterised by repetitive boilerplate, financial jargon and legalese. The average sentence in an annual report of a public company is much longer than the average sentence on the web.
The Hudson Labs advantage - financial language modeling
At Hudson Labs, we went through a year-long financial NLP research phase before we launched our first product in April 2021. Our research team continues to innovate and bring forward new advances in NLP to improve the quality of our products.
We have innovated several techniques for effectively processing long-form financial text in-house that help us achieve the high quality of our products. A summary of our unique capabilities is listed below.
We have developed techniques to adapt open-source language models to the domain of securities filings and complex financial text. The initial domain adaptation process involved the collection and processing of over 1.3 terabytes of financial data. This process enables our models to understand terms like ‘goodwill impairment’, a phrase not commonly seen on the web.
Boilerplate sentences are linguistically very similar to interesting text, and are visually indistinguishable even to human non-domain experts. Our in-house boilerplate identification model can correctly classify >99 percent of sentences as being boilerplate or not. Filtering out boilerplate reduces noise and improves the quality of our input data.
A fundamental truism of data-oriented applications is the adage ‘Garbage in- Garbage out’. We have extensive processes to ensure we feed high-quality inputs to our models. Sentences are represented in vector form (a list of numbers that encode meaning, syntax and other relevant information about a sentence). The quality of the input vectors determines the extent to which a language model can be helpful in solving tasks. Our algorithms ensure the generated vectors are more amenable to modelling.
Modern machine learning techniques are extremely data-hungry. They need a lot of labelled training examples to be effective. Labelled training data is expensive to acquire, especially if the labelling requires domain expertise, as is true in the case of highly-specialised domains like corporate disclosure. On average, it takes more than a minute to annotate each example.
In the past year, adopting a new paradigm called “few-shot learning” helped alleviate this problem. High performance models can now learn to solve select tasks using just a few examples. The few-shot learning algorithms that we developed for our core product allows us to extract 328 different types of red flag types with just 1,625 labelled sentences.
While our red flag extraction models solve the needle-in-a-haystack problem of finding interesting information from disclosure text, our ranking models rank-order them in terms of relative and absolute importance. Our ranking algorithms are currently able to make fine-grained distinctions even within a particular red flag category and produce an importance score for each identified red flag. The ranking model also takes into consideration their freshness, the time period the sentence is referring to, and so on.
Training Set Selection
Current language models are susceptible to shortcut learning - a phenomenon where spurious characteristics of the training data are used as cues for making decisions. Consider an example where the model spuriously used the word ‘banana’ as a cue for predicting if a sentence were an impairment indicator, solely because the example sentences were disproportionately sourced from a banana producer’s corporate filings.
We use our in-house algorithms for selecting training sets that reduce the chances of shortcut learning. Our algorithms select training examples that give the best bang-for-the-buck in terms of the number of real-world examples that they could help the model learn to classify correctly.