Python, along with the numpy/pandas libraries, has essentially become the language of choice for the data science profession (…I’ll add a quick nod to R here).

However, it is well known that Python, although fast and easy to implement, is a slow language. Hence the need for excellent libraries like numpy to increase efficiency…but what if there was a better alternative?

Julia claims to be at least as easy and intuitive to use as Python, whilst being significantly faster to execute. Let’s put that claim to the test…

What is Julia? #

Just in case you have no idea what Julia is, here is a quick primer.

Julia is an open source language that is dynamically typed, intuitive, and easy to use like Python, but with the speed of execution of a language like C.

It has been around approximately 10 years (born in 2012), so it is a relatively new language. However, it is at a stage of maturity where you wouldn’t call it a fad.

The original creators of the language are active in a relevant field of work:

For the work we do — scientific computing, machine learning, data mining, large-scale linear algebra, distributed and parallel computing — …
julialang.org — Jeff Bezanson, Stefan Karpinski, Viral B. Shah, Alan Edelman

All in all, it is a modern language specifically designed to be used in the field of data science. The aims of the creators themselves tell you a great deal:

We want the speed of C with the dynamism of Ruby. We want a language that’s homoiconic, with true macros like Lisp, but with obvious, familiar mathematical notation like Matlab. We want something as usable for general programming as Python, as easy for statistics as R, as natural for string processing as Perl, as powerful for linear algebra as Matlab, as good at gluing programs together as the shell. Something that is dirt simple to learn, yet keeps the most serious hackers happy. We want it interactive and we want it compiled.
(Did we mention it should be as fast as C?)
julialang.org — Jeff Bezanson, Stefan Karpinski, Viral B. Shah, Alan Edelman

Sounds quite exciting right?

The basis of the speed test #

I have written an article previously looking at vectorization using the numpy library in Python:

How to Speed up Data Processing with Numpy Vectorization

Just a moment... Verifying you are human. This may take a few seconds. null - towardsdatascience.com

The speed test that will be conducted in this article will basically be an extension / comparison to this article.

How will the test work? #

A comparison will be made between the speed of execution of a simple mathematical statement:

Function 1 — Simple summation #

#Python
def sum_nums(a, b):
    return a + b

#Julia
function sum_nums(x,y)
    x + y
end

and more complicated conditional statement:

Function 2 — More complex (logic and arithmetic) #

#Python
def categorise(a, b):
    if a < 0:
        return a * 2 + b
    elif b < 0:
        return a + 2 * b
    else:
        return None

#Julia
function categorise(a, b)::Float32
    if a < 0
        return a * 2 + b
    elseif b < 0
        return a + 2 * b
    else
        return 0
    end
end

When run through the following methods:

Python- pandas.itertuples()
Python- list comprehension
Python- numpy.vectorize()
Python- native pandas method
Python- native numpy method
Julia- native method

The notebooks for this article #

Photo by Tirachard Kumtanom

The previous article included a Jupyter notebook written in Python. I have taken this notebook (unchanged) from the previous article, and re-run it using a deepnote instance which utilises Python 3.10.

The deepnote instance for both the Python runs, and the Julia runs, has the exact same basic CPU instance (i.e. hardware). This ensures that the timed results included with this article are directly comparable.

Note: I have made sure to include the CPU information in each notebook so you can see what exact hardware was used, and that they were in fact exactly the same.

Running the Julia notebook #

It is worth noting that whether you wish to use the notebooks in deepnote as I have, or in colab, you will need to setup Julia in the respective environments. This is mainly because most public online instances are currently setup for Python only (at least out of the box).

Environment Setup #

Deepnote #

As deepnote utilises docker instances, you can very easily setup a ‘local’ dockerfile to contain the install instructions for Julia. This means you don’t have to pollute the Jupyter notebook with install code, as you will have to do in Colab.

In the environment section select “Local ./Dockerfile”. This will open the actual Dockerfile where you should add the following:

FROM deepnote/python:3.10
RUN wget https://julialang-s3.julialang.org/bin/linux/x64/1.8/julia-1.8.2-linux-x86_64.tar.gz && \
    tar -xvzf julia-1.8.2-linux-x86_64.tar.gz && \
    mv julia-1.8.2 /usr/lib/ && \
    ln -s /usr/lib/julia-1.8.2/bin/julia /usr/bin/julia && \
    rm julia-1.8.2-linux-x86_64.tar.gz && \
    julia  -e "using Pkg;pkg\"add IJulia\""
ENV DEFAULT_KERNEL_NAME "julia-1.8"

You can update the above to the latest Julia version from this page, but at the time of writing 1.8.2 is the latest version.

Colab #

For colab all the download and install code will have to be included in the notebook itself, as well as refreshing the page once the install code has run.

Fortunately, Aurélien Geron (…that name will be familiar to a few here I recon) has made available on his GitHub a starter notebook for Julia in colab, which is probably the best way to get started.

The notebooks #

The raw notebooks can be found here:

notebooks/julia-python-comparison at main · thetestspecimen/notebooks Jupyter notebooks. Contribute to thetestspecimen/notebooks development by creating an account on GitHub. thetestspecimen - GitHub

…or get kickstarted in either deepnote or colab.

Python Notebook:

Julia Notebook:

The Results #

Photo by Lukas

If you haven’t read my previous article on numpy vectorization I would encourage you to (obviously!), as it will help you get an idea of how the Python methods stack up before we jump into the Julia results.

All will be summarised and compared at the end of the article, so don’t worry too much if you don’t have the time.

The input data #

Define a random number generator, and two columns of one million random numbers taken from a normal distribution, just like in the numpy vectorization article:

Function 1 — Simple summation #

Using Julia’s BenchmarkTools it is possible to automatically get a fair estimate of the functions performance, as the “@benchmark” method will automatically decide how many times to evaluate the function to attain a fair estimate of runtime. It also provides a wealth of statistics as can be seen below:

To gain a fair comparison to the Python methods the mean time will be used, which in this case is 711.7 micro seconds (or 0.71 milliseconds) to sum a 1 million element array with another 1 million element array.

Function 2 — More complex (logic and arithmetic) #

An example of what the method returns:

The benchmark:

So the more involved method results in a mean execution time of 7.62 milliseconds.

How does this compare to Python? #

Photo by Maksim Goncharenok

Now for the actual comparison. Firstly lets see what the results look like all together:

Method	Function 1 (Simple) [ms]	Function 2 (Complex) [ms]
Python(Pandas): iter tuples	419.14	419.22
Python: list comprehension	179.64	188.33
Python(Numpy): vectorize	163.07	140.78
Python(Pandas): native	0.96	-
Python(Numpy): native	0.81	-
Julia: native	0.71	7.62

Table — All Results

Figure 1 — All Results

Then we can proceed to break it down a little.

Results: Function 1 — Simple #

It is quite clear from Figure 1 for the simple summation function that there is a lot to be gained from ensuring you are using optimised libraries such as numpy where Python is concerned. The difference is so large the faster methods almost look to be zero.

I covered the reasoning for this in my previous article on numpy vectorization, so if you want more details please refer to that.

Figure 2 — Simple Function Fastest Three Results

However, even optimised Python libraries are insufficient to elevate the execution speed to the level of a language that is designed to be fast from the ground up.

As you can see in Figure 2, in this specific test Julia is 14% faster using a native inbuilt implementation, compared to an optimised library in Python (numpy) that utilises execution in C under the hood.

Results: Function 2-Complex #

Although the previous result is already impressive, in the second test Julia blows the competition out of the water.

As explained in my previous article it is not possible to implement a ‘native’ version of the complex function in numpy, so we instantly lose the closest competitors from the previous round.

Even the method ‘Vectorize’ from numpy can’t hold a candle to Julia in this instance.

Figure 3— Complex Function Fastest Two Results

Julia is a full 18 times faster than numpy vectorize at completing the more complex calculation.

So what happened? #

One of the ways numpy is so fast in certain circumstances, is that it is using pre-compiled and optimised C functions to execute the calculations. As you may be aware C is extremely fast if used correctly. However, the important point is that: If something is pre-compiled, then it is inherently fixed.

What this illustrates is that if your calculation is simple (like Function 1), and has a predefined optimised function within the numpy library, execution times are almost the same as Julia.

However, if the calculation you want to perform is a bit more convoluted or bespoke, and not covered by an optimised numpy function, you will be out of luck when it comes to speed. This is because you will have to rely on standard Python to fill the gap, which results in the large disparity we see in Figure 2 for the ‘complex’ function.

Conclusion #

In terms of answering: Is Julia really faster than Python and Numpy?

Well, yes it is, and in some cases by a large margin.

As always, it is important to take these results for what they are, a small comparison of some specific functions. Even so, the results are real and relevant all the same. Julia is a fast language, and although we haven’t touched on it much in this article, it is actually very intuative to use as well.

If you want a more general guide as to how fast Julia is compared to a wider array of languages then you can take a look at the general benchmarks on Julia’s own website:

Julia Micro-Benchmarks The official website for the Julia Language. Julia is a language that is fast, dynamic, easy to use, and open source. Click here to learn more. Jeff Bezanson, Stefan Karpinski, Viral Shah, Alan Edelman, et al. - julialang.org

Same result, it is in a league of it’s own, especially in the data science world. Unless of course you write all your code in C.

A final question… #

Image by Arek Socha from Pixabay

If Julia is so good, why doesn’t it have the same traction and recognition as Python/Pandas/NumPy/R?

I think the answer to that question is mainly time. It just hasn’t been around long enough, but the reality is that Julia is on the up, and at some point it is likely (at least in my opinion) that it will take over in the field of data science. It is already being used by the likes of Microsoft, Google, Amazon, Intel, IBM and Nasa, for example.

Python and R are industry standards at this stage, and that will take a lot of momentum to change, regardless of how good the new upstart is.

A additional factor is the availability of learning resources. Again, just due to time, and the sheer volume of people using Python and R for data science, resources to learn from are plentiful. Whereas for Julia although there is plenty of documentation, it can’t really compete for overall learning resources (yet!).

…but if you are feeling adventurous I encourage you to give Julia a go and see what you think.

🙏🙏🙏

Since you've made it this far, sharing this article on your favorite social media network would be highly appreciated. For feedback, please ping me on Twitter.

...or if you want fuel my next article, you could always:

Published 11 Nov 2022