Pre-processing data before feeding it into a machine / deep learning model is one of the most important phases in the whole process. Without properly pre-processed data it won’t matter how advanced and slick your model is, it will ultimately be inefficient and inaccurate.
One Hot Encoding is probably the most commonly utilised pre-processing method for independent categorical data, ensuring that the model can interpret the input data fairly, and without bias.
This article will explore the three most common methods of encoding categorical data using the one-hot method, and discuss why you would want to use this technique in the first place.
Introduction #
The following methods are going to be compared and discussed in this article:
- Pandas — get_dummies()
- Scikit-Learn — OneHotEncoder()
- Keras — to_categorical()
All methods basically achieve the same outcome. However, they go about it in completely different ways, and have different features and options.
So which of these methods would be best suited for your specific circumstances?
What is One-Hot Encoding? #
Before we dive in I though it might be worthwhile giving a quick primer as to why you might want to use this method in the first place.
One hot encoding is basically a way of preparing categorical data to ensure the categories are viewed as independent of each other by the machine learning / deep learning model.
A solid example #
Let’s give a practical example to really ram the idea home. We have three categories:
- Chicken
- Rock
- Gun
They are in no way related to each other, completely independent.
To feed these categories into a machine learning model we need to turn them into numerical values, as machine / deep learning models cannot deal with any other type of input. So how best to do this?
A chicken on the run. Photo by James Wainscoat on Unsplash
The most simple way is just to assign each category a number:
Chicken
Rock
Gun
The problem with this approach (called ordinal encoding) is that the model can infer a relationship between the categories as the numbers follow each other.
Is a gun more important than a chicken because it has a higher number? Is a chicken half a rock? If you have three chickens, is that the same as a gun?
What if these values are labels, if the model outputs 1.5 as the answer, is that some sort of chicken-rock? All of these statements are nonsense, but as the model only sees numbers, not the names that we see, inferring these things is perfectly feasible for the model.
To avoid this we need complete separation of the categories. This is what One-Hot Encoding achieves:
Chicken | Rock | Gun |
---|---|---|
1 | 0 | 0 |
0 | 1 | 0 |
0 | 0 | 1 |
The values are only ever one or zero (on or off). One means it is that thing, and zero means it isn’t.
So on the first row you have a chicken (no rock and no gun), the second row has a rock (no chicken and no gun) etc. As the values are either on or off, there is no possibility of relating one to the other in any way.
A quick note #
Before I dive into each of the three methods in more detail I just wanted to point out that I will be using an alternative to Colab, which I would typically use to make code available for the article.
The alternative I will be using is deepnote. It is essentially the same as Colab in that it lets you run Jupyter notebooks in an online environment (there are some differences, which I won’t go into here, but check out the website to learn more).
The main reason for this is that to demonstrate the latest features for some of the methods in this article, I needed access to Pandas 1.5.0 (the latest release at the time of writing), and I can’t seem to achieve this in Colab.
However, in deepnote I can specify a Python version (in this case 3.10), and also make my own requirements.txt to ensure the environment installs Pandas 1.5.0, not the default version.
It also allows very simple live embeds directly from the Jupyter notebook into this article (as you will see), which is very useful.
I will still make the notebook available in colab as usual, but some of the code won’t run, so just bear that in mind.
The data #
The data¹²³ is basically related to the effects of alcohol consumption on exam results. Not something that you need to remember, but incase you are interested…
Photo by Vinicius “amnx” Amano on Unsplash
As ever I have made the data available in a Jupyter notebook. You can access this in either deepnote:
(as previously mentioned, Pandas 1.5.0 is available in deepnote. Just activate Python 3.10 in the "Environment" section on the right, and create a text file in the "Files" section on the right called "requirements.txt" with the line "pandas==1.5.0" in it. Then run the notebook.)
or colab:
(some methods that follow will not work as Pandas 1.5.0 is required)
I have selected a dataset that includes a wide range of different categorical and non-categorical columns so that it is easy to see how each of the methods works depending on the datatype. The columns are as follows:
- sex — binary string (‘M’ for male and ‘F’ for female)
- age — standard numerical column (int)
- Medu — Mother’s education — Multiclass integer representation (0 [none], 1 [primary education], 2 [5th to 9th grade], 3 [secondary education] or 4 [higher education])
- Mjob — Mother’s job — Multiclass string representation — (‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
- Dalc — Workday alcohol consumption — Multiclass graduated integer representation (from 1 [very low] to 5 [very high])
- Walc — Weekend alcohol consumption — Multiclass graduated integer representation (from 1 [very low] to 5 [very high])
- G3 — Final grade (the label) — Multiclass graduated integer representation (numeric: from 0 to 20)
An example of the top five rows is as follows:
and the data types:
Pandas — get_dummies() #
Photo by Pascal Müller on Unsplash
pandas.get_dummies(data, prefix=None, prefix_sep=’_’, dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)
I would describe the get_dummies() method from Pandas as a very middle of the road one-hot encoder. It keeps things simple while giving a reasonable amount of options to allow you to adjust to the most common use cases.
You can very simply just pass a Pandas dataframe to get_dummies() and it will work out which columns are most suitable for one hot encoding.
However, this is not the best way to approach things as you will see:
If you review the output above you will see that only columns with type ‘Object’ have been one-hot encoded (sex and MJob). Any integer datatype columns have been ignored, which in our case is not ideal.
However, you can specify the columns you wish to encode as follows:
One thing to note is that when using get_dummies() it keeps everything within the dataframe. There are no extra arrays to deal with. It is just all neatly kept in one place. This is not the case with OneHotEncoder() or to_categorical() methods, which will be discussed in subsequent sections.
There may be specific circumstances where it is advisable, or useful to drop the first column of each one hot encoded series (for example to avoid multi-colinearity). get_dummies() has this ability built in:
Note in the above how, for example, “Medu_0” is now missing.
The way this now works is that if Medu_1 to Medu_4 are all zero, this effectively means Medu_0 (the only other alternative) is “selected”.
Previously, when Medu_0 was included (i.e. drop_first wasn’t used), there would never have been a case where all values were zero. So in effect by dropping the column we don’t lose any information about the categories, but we do reduce the overall amount of columns, and therefore the processing power needed to run the model.
There are more nuanced things to consider when deciding if dropping a column is appropriate, but as that discussion would warrant a whole article of it’s own, I will leave it for you to look into.
Additional options #
Apart from ‘drop_first’ there are also additional methods such as ‘sparse’ to produce a sparse matrix, and ‘dummy_na’ to help deal with NaN values that may be in your data.
There are also a couple of customisations available for the prefix and separators, should you need that level of flexibility.
Reversing get_dummies() with from_dummies() #
There was until very recently no available method for reversing get_dummies() from the Pandas library. You would have had to do this manually.
However, as of Pandas 1.5.0 there is a new method called from_dummies():
pandas.from_dummies(data, sep=None, default_category=None)
This allows the reversal to be achieved without writing your own method. It can even handle the reversal of a one-hot encoding that utilised ‘drop_first’ with the use of the ‘default_category’ parameter as you will see below:
To reverse when you have used ‘drop_first’ in the encoding you must specify the dropped items:
scikit-learn — OneHotEncoder() #
Photo by Kelly Sikkema on Unsplash
The OneHotEncoder() method from Scikit-Learn is probably the most comprehensive of all the available methods for one hot encoding.
sklearn.preprocessing.OneHotEncoder(*, categories='auto', drop=None, sparse=True, dtype=<class 'numpy.float64'>, handle_unknown='error', min_frequency=None, max_categories=None)
As you can see the method inputs above can handle:
- automatically picking out the categories for one hot encoding
- drop columns (not just the first, there are more extensive options available)
- produce sparse matrices
- handle categories that may appear in future datasets (handle_unknown)
- you can limit the amount of categories returned from the encoding based on frequency or a maximum number of categories
The method also uses the fit-transform methodology, which is very useful for using this method in your input pipelines for machine and deep learning.
The encoder #
One of the differences between this method and all the others is that you create an encoder ‘object’, which stores all the parameters that will be used to encode the data.
This can therefore be referred back to, re-used and adjusted at later points in your code, making it a very flexible approach.
Once the encoder has been instantiated we can one-hot encode some data:
In this case I have used the ‘fit_transform’ method, but as with all sklearn methods that follow the ‘fit’/’transform’ pattern you can also fit and transform the data in separate steps.
What OneHotEncoder does is it extracts the columns that it thinks should be one-hot encoded and returns them as a new array.
This is different to get_dummies(), which keeps the output in the same dataframe. If you want to keep all your data contained within a dataframe with minimal effort then this is something worth considering.
It should also be noted that OneHotEncoder recognises more input columns that get_dummies() when on ‘auto’ as you can see below.
Regardless, it is still good practise to specify the columns you wish to target.
For consistency moving forward, I will encode the same columns as we looked at previously with get_dummies():
Columns that have been encoded, and the parameters of the encoder:
Reversing OneHotEncoder #
There is a very simple method for reversing the encoding, and as the encoder is saved as it’s own object (in this case ‘skencoder’), then all the original parameters used to do the one-hot encoding are saved within this object. This makes reversal very easy: