Python is becoming a more popular programming language across the financial services industry because it is relatively easy to code. However, it is slow compared with other languages. In this guest blog, Saeed Amen, a quant and co-author of The Book of Alternative Data, explores approaches that you can adopt to speed up your Python code.
- While Python is simpler to write than languages such as Java or C++, it tends to be far slower. There are ways to speed up your Python code, but each will require some element of rewriting your code.
- Techniques include replacing for loops with vectorized code using Pandas or NumPy. And you can parallelize your code using Python libraries, and shift data computation outside Python.
- Cython can be used to write Python-like code that generates C code, which compiles down into fast machine code. Numba can also convert Python and NumPy code into fast machine code.
For more data-driven insights in your Inbox, subscribe to the Refinitiv Perspectives weekly newsletter.
Among the reasons for this growth is the plethora of data science libraries in Python. For example, years ago, you’d have to spend months writing a time series library if you wanted to do any financial analysis. Today, we have Pandas, a very comprehensive time series library.
Python is comparatively easier to write than languages such as C++. As a consequence, many people in financial organizations other than technologists are learning Python, such as traders and risk managers.
However, while Python is relatively intuitive and quick to write, it is also relatively slow compared with other languages. While it is faster than languages such as R, it is much slower than languages like Java and C++.
The flipside, however, is that writing something in C++ will take much longer.
Tips for quicker Python code
So how can you speed up Python?
Code profiling identifies which parts of our code are particular bottlenecks, both in terms of computation time and memory usage. When you find a bottleneck, you can try several solutions.
If you have a lot of for loops in data computations, the first thing you should try is to vectorize your code using libraries like Pandas (or NumPy, which can be faster).
You can also parallelize your code using Python libraries, such as threading and multiprocessing (and there are many high level wrappers that simplify this process, such as concurrent.futures).
Threading is ideal for more I/O bound operations such as reading from disk or downloading data. Multiprocessing, meanwhile, is more suited to computationally heavy tasks, such as risk calculations.
You can then choose to run these tasks across more computing cores to speed them up. Running on the cloud can scale the code even more. There are also other ways to parallelize your code such as by using Dask, which also enables you to work with datasets that are much bigger than memory.
Do more computation in a database
Alternatively, you can choose to run your very heavy data computations outside Python, such as inside a SQL database. The SQLAlchemy can create SQL queries in a Pythonic way.
If you are using high-frequency tick data, then a KDB database might be the answer, although you’ll need to get your head around q, which is a challenging language. The idea is to do the computation inside KDB, and then output results in Python, via qPython.
Obviously, the more of the computation you offload to KDB (and the less you do in Python), the more q you’ll need to write. Unlike many of the other solutions discussed here, KDB is not open source.
The spark analytics engine could also be used for data processing, and the Koalas library gives you a Pandas-like interface to work with time series on Spark, which makes things easier.
Cython and C code
Rewriting specific parts of code with Cython could also help. The Cython language is a superset of Python code that also lets you call C functions and declare C types. C code can then be generated by Cython, which is compiled into machine code at static time.
Numba is a just-in-time compiler, which can convert Python and NumPy code into much faster machine code. As with Cython, you will often need to rewrite your code to make Numba speed it up.
PyPy is an alternative to using CPython, and is much faster. However, it isn’t compatible with every Python library, although it has recently started to support Pandas and NumPy.
There are many ways of speeding up Python code, but they usually require rewriting some of your codebase. The trick is to spend just enough time to achieve the desired decent speed. If you end up spending a very long time, you might as well have just written in a language like C++ or Java in the first place.