The data science benefits of Python are now being felt across financial markets. Dedicated ‘Pythonista’ Saeed Amen takes us on a tour of the best Python tools and libraries.
- The growing importance of Python tools for financial markets reflects the large ecosystem of data science libraries, such as NumPy or pandas.
- Many funds use Python to model financial markets, with banks including JP Morgan and Bank of America also hosting extensive Python-based infrastructure.
- Refinitiv offers a Python API allowing users to seamlessly access Eikon data from any in-house or third-party application, as well as integrate with Python libraries.
For anyone number crunching in financial markets, it used to be the case that traders looked no further than Excel. While quants might have used MATLAB for heavier duty prototyping, Excel was great for working with smaller datasets.
Larger datasets, however, showed the limitations of Excel, even if this could be mitigated to some extent by using VBA. These days open source programming languages R and Python have made significant inroads in financial markets.
Python — a multi-purpose language used for tasks such as web development and data science — has come into its own because of its role in the adoption of artificial intelligence and machine learning.
It is relatively easy to learn, and users have access to a vast number of online communities that offer support.
— Refinitiv Developers (@Developers) February 5, 2019
Many funds now use Python extensively to model financial markets, including the well-known quant firm AHL. Large banks, such as JP Morgan and Bank of America, also have extensive Python-based infrastructure.
Refinitiv offers an easy-to-use Python API for allowing users to seamlessly access Eikon data from any in-house or third-party application, as well as integrate with Python libraries.
Once you’ve got the market data into your application, you need to analyze it.
I find Python invaluable for analysis of financial markets, whether that’s backtesting trading strategies or any other sort of number crunching.
Python and pandas
The main reason that Python has grown in importance is because of its large ecosystem of data science libraries. In particular, these include well known libraries such as NumPy and pandas, for dealing with matrices and time series respectively.
On the machine learning side, there’s scikit-learn, and in more recent years, libraries such as TensorFlow (from Google) and PyTorch (from Facebook) have become very popular.
However, these aren’t the only useful tools for Python data scientists. In fact, there’s a whole plethora of lesser-known Python tools for financial markets that can be just as invaluable.
One great use case for Python is analyzing text and for features such as basic text matching and replacing. Most importantly, there are many natural language processing (NLP) libraries.
These include the oldest NLP Python library, NLTK. TextBlob is also a wrapper on top of NLTK, which makes it easier to use in my experience. There are also newer and faster libraries like spaCy.
Ways to speed up Python
While Python is easier to write than many other languages, which explains its popularity, it can be slow during execution. Even as a dedicated Pythonista, I have to admit this.
Luckily, there are many solutions we can draw upon to make Python code much quicker. We can speed up many mathematical computations by vectorizing operations using NumPy.
Another alternative is using Cython to write Python-like code that converts to C code, which can significantly speed up loops in Python.
There are also many Python libraries for parallelization, that can help improve I/O bound processes such as the threading library. For CPU bound processes, the standard multiprocessor library is also a starting point.
In practice, it is possible to use higher-level libraries like Concurrent.futures, asyncio and Joblib, which provide more convenient abstractions for the lower level threading and multiprocessing libraries.
Split tasks across machines
For more flexibility around distributing tasks, Celery is a good solution, and one I’ve used extensively. It offers plenty of flexibility in terms of configuration and the ability to split tasks across multiple machines.
Celery uses a message broker like Redis, to communicate between the various workers. The main caveat is that Celery can have a steep learning curve and it can take time working out the various tricks to get it to run effectively.
Pandas is a very commonly used library within finance, given the prevalence of time series within markets. However, pandas can sometimes be limiting, particularly if you use large datasets bigger than memory. This forces you to split up tasks into smaller batches. It also doesn’t always take advantage of parallelization offered by modern multi-core CPUs.
Python libraries such as Dask and Modin look like pandas on the surface. However, they implement many of the same operations to run in parallel, thus speeding up pandas’ time series processes significantly.
Dask allows you to run pandas, like operations on datasets that far exceed the size in memory, by cleverly doing all the batching in the background.
It is worth noting though that Modin is a less mature library and not as commonly used as Dask. The newer Vaex library is a set of libraries that is similar to Dask, but also includes visualization tools. In a sense, it is a one-stop shop for out-of-the-core DataFrames.
Visualization and Python tools
When it comes to showing your results to other people, visualization is key. Python is stacked full of different visualization tools. There’s Matplotlib, which is part of the SciPy stack.
There are so many Python tools for working with data in financial markets.
It is often worth checking GitHub, before you start building specific functionality to make sure you aren’t reinventing the wheel. Whether it’s machine learning, text analysis or time series analysis, it is likely there will be something for you.