Refinitiv Labs have pre-trained a natural language processing (NLP) model with financial and business news so that it is able to understand the nuances of financial services terminology. Data practitioners can expect greater accuracy from this model than a similar model trained with a generic corpus.
- A growing area of artificial intelligence (AI), NLP can be used to filter data to enable companies in the financial services industry to find insights and to create additional value.
- It is benefitting from algorithmic improvements and open source libraries, such as Google’s language model, BERT, which can provide users with a competitive advantage when analysing financial content.
- Refinitiv Labs have extended BERT’s training through the Reuters News Archive. As a result, BERT has attained a greater understanding of the nuances of financial services terminology, improving the accuracy of its predictions.
For more data-driven insights in your Inbox, subscribe to the Refinitiv Perspectives weekly newsletter.
Companies in the financial services industry face a common problem: a firehose of unstructured data that includes research reports, company filings and transcripts of quarterly earnings calls.
Such volumes of unstructured content are under-used because they are time-consuming to analyse, but a branch of AI called natural language processing (NLP) offers opportunities to uncover meaningful insights and generate additional value by informing decision-making processes.
NLP helps machines make sense of language
An NLP model extracts meaning from text, so that software applications and services effectively “understand” human language. Under-the-hood, an NLP model breaks language into words (tokens) and notes the relationships and context between those tokens.
NLP is a growing area of AI, in part assisted by rapid growth in infrastructure, such as computing power and data handling capacity. In addition, there have been several key algorithmic improvements and a proliferation of open-source libraries such as the BERT language model released by Google in 2018.
BERT is pre-trained on 3.3 billion words from a general domain corpus, such as Wikipedia and the open BookCorpus dataset and has a broad understanding of the English language.
While BERT can be used directly in downstream tasks – such as classification, question answering and sentiment analysis – the team at Refinitiv Labs spotted an opportunity to create a domain-specific variant of BERT and give their clients a competitive advantage when analysing financial content.
“The field is highly competitive, so we see a model that understands the terminology within our customers’ data as key to giving them a competitive edge”, says Geoff Horrell, Global Head of Innovation and Labs at Refinitiv.
Teaching BERT to understand financial terminology
Refinitiv Labs took BERT and extended it with additional training, using a filtered version of a finance-specific corpus, Reuters News Archive, which totalled further 715 million words from about two million articles, published between 1996 and 2019.
The result was a pre-trained model that understands the nuances of terminology used in business and financial news and dramatically improves the accuracy of its predictions.
The goal is that data practitioners, such as those within Refinitiv’s client companies, will ultimately be able to use it to understand the domain-specific language found with their unstructured content, although there is always the option to further fine-tune the model with additional data for their specific tasks.
There are a number of potential use cases for the use of NLP in financial services.
For example, a model that understands business terminology can be used to perform sentiment analysis on unstructured financial data such as company transcripts or filings, news articles or research.
Lessons learned from creating a domain-specific BERT
Tim Nugent from Refinitiv Labs describes the process of pre-training BERT with financial news in more detail in a paper published on arxiv.org.
The team learned that pre-training the model is an intensive task, and that it needed to be scaled up from on-premise GPUs to Google Cloud Tensor Processing Units.
The Refinitiv Labs team sanitised the corpus to keep only English language articles with specific Reuters topic codes, such as those for company news, corporate events, and economic news.
Articles excluded those using topic codes and headline keywords that were news summaries, highlights, digests, and market round-ups, because they typically contain lists of unrelated news headlines which are unsuitable for next sentence prediction.
Many financial news articles also contain ‘structured’ data in ASCII tables and tags, and the team removed these, too.
The Refinitiv team also used back-translation to deal with insufficient sample content.
They translated to French and back to English multiple times using neural translation to create an augmented example of the original text.
The approach generates diverse paraphrases that still preserve the semantics of the original text, and it lends significant improvement to tasks such as question answering.
The future for Refinitiv Labs’ domain-specific BERT model
Ievgen Goichuk, Senior Data Engineer at Refinitiv Labs, explained the next steps to further establish the model for use in financial services.
He said: “In 2021, the world is a very different place to what it was in 2019. As a result, it’s important to keep Refinitiv Labs’ domain-specific BERT model up to date and train it with more recent articles from the Reuters News Archive.”
His colleague, Stanimir Vichev, Senior Data Engineer at Refinitiv Labs, added: “There have been different iterations of BERT since it launched to the open-source community in 2018, and these may also provide extra levels of accuracy to the model.”
Refinitiv’s customers will benefit in due course from the improved accuracy of BERT in this specific domain as the Labs team productise their research for client data practitioners.
In the shorter term, there are already opportunities to build on Refinitiv’s data, tools and analytics services, such as the Refinitiv Data Platform, which provides a data exploration tool for free, across real-time, times series and reference data, as well as company, fund, sentiment, and over 3.5 million Reuters News headlines.