Problem and opportunity
Understanding financial markets through language is the next frontier in digital transformation - and ultimately in securing competitive advantage.
In order to scale ‘understanding’ and problem solving with Natural Language Processing (NLP), financial firms need to train and monitor many models . This will accumulate to approximately +100 models for a mid-sized firm that wants to deploy NLP across five or more use-cases.
Training each model to achieve an accurate and reliable level of quality is also expensive - in five years’ time, a company could invest $1bn dollars in compute time to train a single language model. Each model must also be maintained and tweaked over time, which incurs even more talent and compute costs.
However, what if upfront use of a single, domain-specific language model could cut those training costs? This is the hypothesis Refinitiv Labs set out to test.
Refinitiv Labs’ Financial Language Modelling project helps financial firms build state-of-the-art and scalable services in the financial domain using NLP.
The team trained two domain-specific flavours of Google’s language model, Bidirectional Encoder Representations from Transformers, (BERT). The model generated high quality embeddings from user-supplied texts to understand language. Embeddings are numeric representations of words; they capture the meaning and relationship between words in a given document and can be used to solve downstream NLP tasks, such as text classification, sentiment analysis, information retrieval, question answering etc.
These new BERT models grew from research into using domain-specific language models conducted by Refinitiv Labs and can integrate into NLP training pipelines - turning text training data into numeric data, on which further training (or fine-tuning) can happen.
Refinitiv Labs is currently working with a group of customers to validate and improve the new models and API. One customer recently saw a 2-4% improvement in model performance using BERT-RNA, and a 2-4% improvement in model performance using BERT-TRAN.
Financial Language Modelling in action
Refinitiv Labs’ API returns a single document embedding, or a vector of word embeddings, for two pre-trained BERT models based on the cased BERT-Base architecture.
BERT-Base is trained using English Wikipedia and the Book Corpus and consists of 12 layers, 768 hidden units, 12 attention heads per layer, for a total of 110m parameters.
Pre-trained using Reuters News Archive, this model consists of all Reuters articles published between 1996 and 2019. Refinitiv Labs’ filtered the corpus using metadata to ensure that only English language articles with Reuters topic codes that matched company news, corporate events, government finances or economic news were retained.
Additionally, the team excluded articles using topic codes and headline keywords that were news summaries, highlights, digests, and market round-ups. Such articles typically contain list or bullet points of unrelated news headlines, which are unsuitable for the next sentence prediction task within the BERT pre-training loss function.
The resulting filtered corpus consists of 2.2m articles and 715m words. Refinitiv Labs performed pre-training using a maximum sequence length of 128 for 5m steps, using 50,000 warm-up steps, a learning rate of 1e-5 and a batch size of 256. The team then pre-trained using a maximum sequence length of 512 for a further 1m steps, since long sequences are mostly needed to learn positional embeddings which can be learned relatively quickly.
The BERT-Base was pre-trained using a large corpus of earnings call transcripts, consisting of 390,000 transcripts, totalling 2.9bn words.
Pre-training was run using a batch size of 512 for 2.5m steps at a maximum sequence length of 128 tokens, and then 500,000 steps at a maximum sequence length of 512 tokens. Pre-training for both models was run on Google Cloud Tensor Processing Units (TPUs).
Key learnings for data scientists, engineers and researchers
Refinitiv Labs’ work on Financial Data Modelling makes it easier for financial firms to make the best use of their data and efficiently train and deploy NLP and ML projects. Based on their work, the team has the following practical advice for fellow practitioners:
- Ensure you have enough training samples to ensure accuracy. In classification you must have enough samples in each category and can use backtranslation to synthesize samples if needed. Backtranslation is machine translation to another language and back, for example English-French-English. This achieves paraphrasing while preserving the meaning.
- Take time to understand your textual data and properly filter and pre-process it to ensure you get the best embeddings and results.
- Make sure you clean your text data, e.g. removing ASCII tables and other structured data inside the text.
- Be aware a standard BERT model can process a maximum of 512 tokens (words) per sample.
- Review and understand the costs involved in working with these models. They often require GPU instances for both training and inference.
- Think about how you could automate your training and inference pipelines, including model validation and drift detection. This will help your ability to scale more models in production.
Explore the data Refinitiv Labs used to train their BERT models
Free datasets and notebooks
Free datasets and notebooks for data scientists, developers and quants
The data exploration tool was built by Refinitiv Labs and includes Refinitiv knowledge graph content with Jupyter notebooks.
Call your local sales team
Asia Pacific (Sub-Regional)
Australia & Pacific Islands:
+612 8066 2494
China mainland: +86 10 6627 1095
Hong Kong & Macau: +852 3077 5499
India, Bangladesh, Nepal, Maldives & Sri Lanka:
+91 22 6180 7525
Japan: +813 6743 6515
Korea: +822 3478 4303
Malaysia & Brunei: +603 7 724 0502
New Zealand: +64 9913 6203
Philippines: 180 089 094 050 (Globe) or
180 014 410 639 (PLDT)
Singapore and all non-listed ASEAN Countries:
+65 6415 5484
Taiwan: +886 2 7734 4677
Thailand & Laos: +662 844 9576