An introduction to Hudson Labs co-founder, Suhas Pai and his new book
The tech world is currently abuzz with talk of Large Language Models (LLMs) and Natural Language Processing (NLP). People are scrambling for voice, credibility, and money in this new landscape. And while there are plenty of credible experts, Hudson Labs punches way above its weight as company that has been innovating with LLMs long before it became en vogue.
Our CTO and his new book on LLMs
Hudson Labs' pioneering approach is made possible by our co-founder and chief technology officer, Suhas Pai.
Suhas was born in India, and moved to Europe for college, earning a Master’s degree in Computer Science from Eindhoven University of Technology in The Netherlands. He has had a long career in software and machine learning, including working as a senior software engineer at IBM for over five years in information security.
Suhas first worked in ML in 2010, but at that time, in his words, “AI was terrible,” so he returned to security before returning back to ML and AI in 2016.
Suhas then co-founded Hudson Labs, a Toronto-based startup, with Kris Bennatti in 2020 and serves as the company’s CTO. His work at Hudson Labs includes text ranking, representation learning, and productionizing LLMs. He is active in the ML community, serving as the Chair of the Toronto Machine Learning Summit (TMLS) conference since 2021 and also NLP lead at Aggregate Intellect (AISC). He also co-led the privacy working group at Big Science as part of the BLOOM project.
Suhas never eats at the same restaurant twice, enjoys taking very long walks, sometimes for six hours or more, and has a strong impulse to reject or ignore any recommendations made by algorithms. He’s also a recovering book addict. During the period of 2014-2016, he read more than 200 books a year. He’s pulled back to about 10-15 books per year.
Suhas has put many of his ideas and experiences down in a new book, Designing Large Language Model Applications, that comes out next year from O’Reilly Media. Chapters 3 and 4 are currently available on O’Reilly’s platform.
What the book teaches and how Hudson Labs applies it
Suhas’s book explains how to put these powerful models to use in applications that people will actually want to use. This becomes especially useful when you want to use an LLM not for general knowledge but for a specific area of expertise. This is what we’ve done at Hudson Labs.
Most notably, we have built techniques to adapt open-source LLMs to securities filings and financial text. Our initial approach involved collecting and processing over 65,000 securities filings. This has allowed our models to understand terms like “goodwill impairment,” a phrase not commonly used online and, therefore, only nominally understood by a general model.
A further example is the identification and filtering of “boilerplate text”. Boilerplate text is text that doesn’t provide real information because it has either been copied (or nearly copied) from another company’s disclosure or because it is too vague/indistinct to be valuable. To the untrained eye, boilerplate language is indistinguishable from original, meaningful text that would be interesting to the reader. Our boilerplate identification model can correctly classify more than 99 percent of sentences as being boilerplate or not. This means Hudson Labs users deal with less noise and focus on the correct information.
Another approach that the book discusses that we have applied at Hudson Labs is what’s known as “sample-efficient fine-tuning.” Typically, machine learning (ML) models need lots of labeled training data to produce effective results. “Labeling” data is the process of tagging raw data with labels that adds context and information that help the model learn a particular domain.
A highly specialized and technical area like securities filings and disclosures would normally need a lot of labeled training data to develop an effective model. But at Hudson Labs, we have developed sample-efficient fine-tuning algorithms for our core product, allowing us to extract 328 different types of red flags in securities with just 1,625 labeled sentences.
These are just a few of the principles in Suhas’s book that we’ve applied at Hudson Labs.
New LLM features at Hudson Labs:
Our summaries reliably highlight the most relevant information thanks to our boilerplate filters. Unlike popular generative models, our summaries are 100% accurate and comprehensive, so you don’t have to second guess them. We also summarize all analyst questions to ensure a comprehensive summary.
Our investment memos discuss revenue drivers, business segments, customers, competitors, supply chains, and more in an understandable, structured format. Our memos read like a human wrote them. Unlike popular generative tools, our memos are linked directly to source documents and maintain factual accuracy.
Get content like this straight to your inbox. Subscribe to our newsletter on Substack.