Pre-processing data before feeding it into a machine / deep learning model is one of the most important phases in the whole process. Without properly pre-processed data it won’t matter how advanced and slick your model is, it will ultimately be inefficient and inaccurate.

One Hot Encoding is probably the most commonly utilised pre-processing method for independent categorical data, ensuring that the model can interpret the input data fairly, and without bias.

This article will explore the three most common methods of encoding categorical data using the one-hot method, and discuss why you would want to use this technique in the first place.

Introduction #

The following methods are going to be compared and discussed in this article:

Pandas — get_dummies()
Scikit-Learn — OneHotEncoder()
Keras — to_categorical()

All methods basically achieve the same outcome. However, they go about it in completely different ways, and have different features and options.

So which of these methods would be best suited for your specific circumstances?

What is One-Hot Encoding? #

Before we dive in I though it might be worthwhile giving a quick primer as to why you might want to use this method in the first place.

One hot encoding is basically a way of preparing categorical data to ensure the categories are viewed as independent of each other by the machine learning / deep learning model.

A solid example #

Let’s give a practical example to really ram the idea home. We have three categories:

Chicken
Rock
Gun

They are in no way related to each other, completely independent.

To feed these categories into a machine learning model we need to turn them into numerical values, as machine / deep learning models cannot deal with any other type of input. So how best to do this?

A chicken on the run. Photo by James Wainscoat on Unsplash

The most simple way is just to assign each category a number:

Chicken
Rock
Gun

The problem with this approach (called ordinal encoding) is that the model can infer a relationship between the categories as the numbers follow each other.

Is a gun more important than a chicken because it has a higher number? Is a chicken half a rock? If you have three chickens, is that the same as a gun?

What if these values are labels, if the model outputs 1.5 as the answer, is that some sort of chicken-rock? All of these statements are nonsense, but as the model only sees numbers, not the names that we see, inferring these things is perfectly feasible for the model.

To avoid this we need complete separation of the categories. This is what One-Hot Encoding achieves:

Chicken	Rock	Gun
1	0	0
0	1	0
0	0	1

The values are only ever one or zero (on or off). One means it is that thing, and zero means it isn’t.

So on the first row you have a chicken (no rock and no gun), the second row has a rock (no chicken and no gun) etc. As the values are either on or off, there is no possibility of relating one to the other in any way.

A quick note #

Before I dive into each of the three methods in more detail I just wanted to point out that I will be using an alternative to Colab, which I would typically use to make code available for the article.

The alternative I will be using is deepnote. It is essentially the same as Colab in that it lets you run Jupyter notebooks in an online environment (there are some differences, which I won’t go into here, but check out the website to learn more).

The main reason for this is that to demonstrate the latest features for some of the methods in this article, I needed access to Pandas 1.5.0 (the latest release at the time of writing), and I can’t seem to achieve this in Colab.

However, in deepnote I can specify a Python version (in this case 3.10), and also make my own requirements.txt to ensure the environment installs Pandas 1.5.0, not the default version.

It also allows very simple live embeds directly from the Jupyter notebook into this article (as you will see), which is very useful.

I will still make the notebook available in colab as usual, but some of the code won’t run, so just bear that in mind.

The data #

The data¹²³ is basically related to the effects of alcohol consumption on exam results. Not something that you need to remember, but incase you are interested…

Photo by Vinicius “amnx” Amano on Unsplash

As ever I have made the data available in a Jupyter notebook. You can access this in either deepnote:

(as previously mentioned, Pandas 1.5.0 is available in deepnote. Just activate Python 3.10 in the "Environment" section on the right, and create a text file in the "Files" section on the right called "requirements.txt" with the line "pandas==1.5.0" in it. Then run the notebook.)

or colab:

(some methods that follow will not work as Pandas 1.5.0 is required)

I have selected a dataset that includes a wide range of different categorical and non-categorical columns so that it is easy to see how each of the methods works depending on the datatype. The columns are as follows:

sex — binary string (‘M’ for male and ‘F’ for female)
age — standard numerical column (int)
Medu — Mother’s education — Multiclass integer representation (0 [none], 1 [primary education], 2 [5th to 9th grade], 3 [secondary education] or 4 [higher education])
Mjob — Mother’s job — Multiclass string representation — (‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
Dalc — Workday alcohol consumption — Multiclass graduated integer representation (from 1 [very low] to 5 [very high])
Walc — Weekend alcohol consumption — Multiclass graduated integer representation (from 1 [very low] to 5 [very high])
G3 — Final grade (the label) — Multiclass graduated integer representation (numeric: from 0 to 20)

An example of the top five rows is as follows:

and the data types:

Pandas — get_dummies() #

Photo by Pascal Müller on Unsplash

pandas.get_dummies(data, prefix=None, prefix_sep=’_’, dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)

Documentation

I would describe the get_dummies() method from Pandas as a very middle of the road one-hot encoder. It keeps things simple while giving a reasonable amount of options to allow you to adjust to the most common use cases.

You can very simply just pass a Pandas dataframe to get_dummies() and it will work out which columns are most suitable for one hot encoding.

However, this is not the best way to approach things as you will see:

If you review the output above you will see that only columns with type ‘Object’ have been one-hot encoded (sex and MJob). Any integer datatype columns have been ignored, which in our case is not ideal.

However, you can specify the columns you wish to encode as follows:

One thing to note is that when using get_dummies() it keeps everything within the dataframe. There are no extra arrays to deal with. It is just all neatly kept in one place. This is not the case with OneHotEncoder() or to_categorical() methods, which will be discussed in subsequent sections.

There may be specific circumstances where it is advisable, or useful to drop the first column of each one hot encoded series (for example to avoid multi-colinearity). get_dummies() has this ability built in:

Note in the above how, for example, “Medu_0” is now missing.

The way this now works is that if Medu_1 to Medu_4 are all zero, this effectively means Medu_0 (the only other alternative) is “selected”.

Previously, when Medu_0 was included (i.e. drop_first wasn’t used), there would never have been a case where all values were zero. So in effect by dropping the column we don’t lose any information about the categories, but we do reduce the overall amount of columns, and therefore the processing power needed to run the model.

There are more nuanced things to consider when deciding if dropping a column is appropriate, but as that discussion would warrant a whole article of it’s own, I will leave it for you to look into.

Additional options #

Apart from ‘drop_first’ there are also additional methods such as ‘sparse’ to produce a sparse matrix, and ‘dummy_na’ to help deal with NaN values that may be in your data.

There are also a couple of customisations available for the prefix and separators, should you need that level of flexibility.

Reversing get_dummies() with from_dummies() #

There was until very recently no available method for reversing get_dummies() from the Pandas library. You would have had to do this manually.

However, as of Pandas 1.5.0 there is a new method called from_dummies():

pandas.from_dummies(data, sep=None, default_category=None)

Documentation

This allows the reversal to be achieved without writing your own method. It can even handle the reversal of a one-hot encoding that utilised ‘drop_first’ with the use of the ‘default_category’ parameter as you will see below:

To reverse when you have used ‘drop_first’ in the encoding you must specify the dropped items:

scikit-learn — OneHotEncoder() #

Photo by Kelly Sikkema on Unsplash

The OneHotEncoder() method from Scikit-Learn is probably the most comprehensive of all the available methods for one hot encoding.

sklearn.preprocessing.OneHotEncoder(*, categories='auto', drop=None, sparse=True, dtype=<class 'numpy.float64'>, handle_unknown='error', min_frequency=None, max_categories=None)

Documentation

As you can see the method inputs above can handle:

automatically picking out the categories for one hot encoding
drop columns (not just the first, there are more extensive options available)
produce sparse matrices
handle categories that may appear in future datasets (handle_unknown)
you can limit the amount of categories returned from the encoding based on frequency or a maximum number of categories

The method also uses the fit-transform methodology, which is very useful for using this method in your input pipelines for machine and deep learning.

The encoder #

One of the differences between this method and all the others is that you create an encoder ‘object’, which stores all the parameters that will be used to encode the data.

This can therefore be referred back to, re-used and adjusted at later points in your code, making it a very flexible approach.

Once the encoder has been instantiated we can one-hot encode some data:

In this case I have used the ‘fit_transform’ method, but as with all sklearn methods that follow the ‘fit’/’transform’ pattern you can also fit and transform the data in separate steps.

What OneHotEncoder does is it extracts the columns that it thinks should be one-hot encoded and returns them as a new array.

This is different to get_dummies(), which keeps the output in the same dataframe. If you want to keep all your data contained within a dataframe with minimal effort then this is something worth considering.

It should also be noted that OneHotEncoder recognises more input columns that get_dummies() when on ‘auto’ as you can see below.

Regardless, it is still good practise to specify the columns you wish to target.

For consistency moving forward, I will encode the same columns as we looked at previously with get_dummies():

Columns that have been encoded, and the parameters of the encoder:

Reversing OneHotEncoder #

There is a very simple method for reversing the encoding, and as the encoder is saved as it’s own object (in this case ‘skencoder’), then all the original parameters used to do the one-hot encoding are saved within this object. This makes reversal very easy:

Methods #

Advanced Features #

As mentioned earlier OneHotEncoder has quite a lot of useful features making it a very flexible method to use.

I will touch on some of these methods below.

Min frequency #

This can be used to limit the encoded categories. If you have a feature that is dominated by a few significant categories, but has a lot of smaller categories, then you can effectively group the smaller categories into a single ‘other’ category.

You may find that you don’t want to specify an exact amount of records for the infrequent categories. In this case you could specify a minimum amount of records compared to the overall amount of records available. To do this you specify a fraction of the total count.

In our case there are 395 records, so to achieve the same outcome as specifying exactly 60 records as the limit, we could specify 60 / 395 = 0.152, or for simplicity 0.16 (which basically means that a category has to have 16% of the total count to be counted as significant).

Max categories #

Another way to approach the problem is to specify a maximum number of categories.

Handle Unknown #

Handle unknown is an extremely useful feature, especially when used in a pipeline for machine learning, or neural network model.

It essentially allows you to plan for a case in the future where another category may appear, without breaking your input pipeline.

For example, you may have a feature such as ‘Medu’, and in the future for some reason a ‘PhD’ category above the final category of ‘higher education’ is added to the input data. In theory, this additional category would break your input pipeline, as the amount of categories has changed.

Handle unknown allows us to avoid this.

Although I’m not going to give a concrete example of this, it is very easy to understand, especially if you have read the previous two sections on ‘max_categories’ and ‘min_frequency’.

Setting options:

‘error’ : this will just raise an error if you try to add additional category, you could say this is standard behaviour
‘ignore’ : this will cause any extra categories to be encoded with all zeros, so if there were originally 3 categories [1,0,0], [0,1,0] and [0,0,1] then the additional category (or categories) will be encoded as [0,0,0]. When inverted this will have the value ‘None’.
‘infrequent_if_exist’ : if you have implemented ‘max_categories’ or ‘min_frequency’ in your encoder, then the additional category will be mapped to ‘xxx_infrequent_sklearn’ along with any infrequent categories. Otherwise it will be treated exactly the same as ‘ignore’.

Important note: you cannot use handle_unknown=’ignore’ AND the drop category parameter (e.g. drop: ‘first’) at the same time. This is because they both produce a category with all zeros, and therefore conflict.

Drop #

As with Pandas from_dummies() you have the option to drop categories, although the options are a little more extensive.

Here are the options (as per the documentation):

None : retain all features (the default).
‘first’ : drop the first category in each feature. If only one category is present, the feature will be dropped entirely.
‘if_binary’ : drop the first category in each feature with two categories. Features with 1, or more than 2 categories, are left intact.
array : drop[i] is the category in feature X[:, i] that should be dropped.

‘first’ #

The first entry of each category will be dropped (‘sex_F’, ‘Medu_0’ and ‘Mjob_at_home’).

‘if_binary’ #

Only features with exactly two categories will be affected (in our case only ‘sex_F’ is dropped).

‘array’ #

In this case you can pick exactly which category from each feature should be dropped. We will drop ‘sex_M’, ‘Medu_3’ and ‘Mjob_other’.

Keras to_categorical #

Photo by Jan Antonin Kolar on Unsplash

The Keras method is a very simple method, and although it can be used for anything just like the other methods, it can only handle numeric values.

Therefore, if you have string categories you will have to convert them first, something that the other methods take care of automatically.

tf.keras.utils.to_categorical(y, num_classes=None, dtype='float32')

Documentation

Keras to_categorical() is probably most useful for one hot encoding the labels:

The above doesn’t tell us a lot so let’s pick out the transformation at index 5 so we can see what was encoded:

Reversal #

There is no dedicated method for reversal, but typically argmax should allow us to reverse the encoding. Argmax will also work from the output of models where the numbers may not be whole integers.

Smaller example:

All the data:

Specify categories #

One useful feature is the ability to specify how many unique categories there are. By default the amount of categories is the highest number in the array + 1. The +1 is to take account of zero.

It is worth noting that this is the minimum value you can specify. However, there may be cases where the data passed does not contain all the categories and you still wish to convert it (like a small set of test labels), in which case you should specify the number of classes.

Even though the method requires whole numbers, it can deal with the float datatype as per the below.

These are the unique classes:

This is the count of the unique classes:

There are only 18 unique classes, but we can encode as many as we want, so let’s encode 30 classes:

We can then check the shape to see that we have 30 columns / classes:

…and still reverse without issue:

Summary #

As a general roundup:

Pandas — get_dummies():

creates one hot encoded columns within a dataframe without creating a new matrix. If you prefer to keep everything within a Pandas dataframe with minimal effort this may be the method for you
can only automatically recognise non-numeric columns as categorical data
has a few useful options such as sparse matrices and dropping the first column
as of Pandas 1.5.0 has a reversal method built in

Scikit-learn — OneHotEncoder():

is designed to work with Pipelines, and so is easy to integrate into your pre-processing workflow
can automatically pick out the categories for one hot encoding, including numerical columns
drop columns (not just the first, there are more extensive options available)
produce sparse matrices
various options for handling categories that appear in future datasets (handle_unknown)
you can limit the amount of categories returned from the encoding based on frequency or a maximum number of categories
has many helper methods and attributes to keep track of your encoding and parameters

Keras — to_categorical():

a very simple method that one hot encodes only numerical data
data must be converted to ordinal numerical categories first
is probably most useful for labels
has no built in reversal method

Conclusion #

All in all, if I had to recommend any one method it would be the OneHotEncoder() from Scikit-Learn.

You could argue that the method is over complicated. However, I would argue that the method is very simple to use, and you ultimately gain both traceability and flexibility that is not achievable with any of the other methods.

The ability to combine this pre-processing method along with others into a processing pipeline, and features such as handle_unknown, are also a huge advantage when considering production ready code.

References #

[1] Aman Chauhan, Alcohol Effects On Study (2022), Kaggle, License: Attribution 4.0 International (CC BY 4.0)

[2] Paulo Cortez, Student Performance Data Set (2014), UCI Machine Learning Repository

[3] P. Cortez and A. Silva, Using Data Mining to Predict Secondary School Student Performance (2008), In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology (FUBUTEC) Conference pp. 5–12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978–9077381–39–7

🙏🙏🙏

Since you've made it this far, sharing this article on your favorite social media network would be highly appreciated. For feedback, please ping me on Twitter.

...or if you want fuel my next article, you could always:

Published 30 Oct 2022