English | 简体中文 | 繁體中文 | Русский язык | Français | Español | Português | Deutsch | 日本語 | 한국어 | Italiano | بالعربية

Categorical Data in Pandas

Pandas operation example of classified data

Data usually contains duplicate text columns in real time. Gender, country/Functions such as regions and codes are always repetitive. These are examples of classified data.
Categorical variables can only take a limited and usually fixed number of possible values. In addition to fixed length, categorical data may also have order, but cannot perform numerical operations. Categorical is a Pandas data type.

Categorical data types are very useful in the following cases

A string variable that contains only a few different values. Converting such a string variable to a categorical variable will save some memory.

The lexical order of the variable is different from the logical order ("one", "two", "three"). By converting to category and specifying the order on the category, sorting and minimum/The maximum will use logical order instead of alphabetical order.

As a signal from other Python libraries, this column should be considered as a categorical variable (for example, using appropriate statistical methods or plotting types).

Object creation

Categorical objects can be created in various ways. The following describes different methods:

Category

By specifying the dtype as "category" when creating the Pandas object.

 import pandas as pd
 s = pd.Series(["a","b","c","a"], dtype="category")
 print(s)

The running results are as follows:

 0 a
 1 b
 2 c
 3 a
 dtype: category
 Categories (3, object): [a, b, c]

The number of elements passed to the series object is4, but the category is only3. Observe the same in the output category.

pd.Categorical

Using the standard Pandas categorical constructor, we can create a categorical object.

pandas.Categorical(values, categories, ordered)

Let's look at an example-

 import pandas as pd
 cat = pd.Categorical(['a', 'b', 'c', 'a', 'b', 'c'])
 print(cat)

The running results are as follows:

 [a, b, c, a, b, c]
 Categories (3, object): [a, b, c]

Let's look at another example

 import pandas as pd
 cat = pd.Categorical(['a', 'b', 'c', 'a', 'b', 'c', 'd'], ['c', 'b', 'a'])
 print(cat)

The running results are as follows:

 [a, b, c, a, b, c, NaN]
 Categories (3, object): [c, b, a]

Here, the second parameter represents the category. Therefore, any value that does not exist in the category will be considered as NaN.
Now, let's look at the following example:

 import pandas as pd
 cat = cat = pd.Categorical(['a', 'b', 'c', 'a', 'b', 'c', 'd'], ['c', 'b', 'a'], ordered=True)
 print(cat)

The running results are as follows:

 [a, b, c, a, b, c, NaN]
 Categories (3, object): [c < b < a]

Logically, this order means a is greater than b and b is greater than c.

Description

Using the .describe() command for categorical data, we get a similar output as a string to a series or dataframe type.

 import pandas as pd
 import numpy as np
 cat = pd.Categorical(["a", "c", "c", np.nan], categories=["b", "a", "c"])
 df = pd.DataFrame({"cat": cat, "s": ["a", "c", "c", np.nan]})
 print(df.describe())
 print(df["cat"].describe())

The running results are as follows:

    cat s
count    3 3
unique   2 2
top c c
freq     2 2
count     3
unique    2
top c
freq      2
Name: cat, dtype: object

Get the attributes of the category

The obj.cat.categories command is used to obtain the categories of the object.

 import pandas as pd
 import numpy as np
 s = pd.Categorical(["a", "c", "c", np.nan], categories=["b", "a", "c"])
 print(s.categories)

The running results are as follows:

  Index([u'b', u'a', u'c'], dtype='object')

The obj.ordered command is used to obtain the order of the object.

 import pandas as pd
 import numpy as np
 cat = pd.Categorical(["a", "c", "c", np.nan], categories=["b", "a", "c"])
 print(cat.ordered)

The running results are as follows:

   False

The function returns false because we did not specify any order.

Rename category

Category renaming is completed by assigning a new value to the series.cat.categories attribute.

 import pandas as pd
 s = pd.Series(["a","b","c","a"], dtype="category")
 s.cat.categories = ["Group %s" % g for g in s.cat.categories]
 print(s.cat.categories)

The running results are as follows:

Index([u'Group a', u'Group b', u'Group c'], dtype='object')

The initial category [a, b, c] is updated by the s.cat.categories attribute of the object.

Append new category

The Categorical.add_categories() method can be used to append new categories.

 import pandas as pd
 s = pd.Series(["a","b","c","a"], dtype="category")
 s = s.cat.add_categories([4]
 print(s.cat.categories)

The running results are as follows:

Index([u'a', u'b', u'c', 4], dtype='object')

Remove category

The Categorical.remove_categories() method can be used to remove unnecessary categories.

 import pandas as pd
 s = pd.Series(["a","b","c","a"], dtype="category")
 print(("Original object:"))
 print(s)
 print(("After removal:"))
 print(s.cat.remove_categories("a"))

The running results are as follows:

 Original object:
 0 a
 1 b
 2 c
 3 a
 dtype: category
 Categories (3, object): [a, b, c]
 After removal:
 0 NaN
 1 b
 2 c
 3 NaN
 dtype: category
 Categories (2, object): [b, c]

Categorical Data Comparison

There are three cases where categorical data can be compared with other objects:

Compare equal (== and !=) with objects similar to lists with the same length as categorical data (list, series, array, ...).

When sorting == True and categories are the same, compare category data with all comparisons of another category series (==, !=, >, >=,  < and <=).< div>    

All comparisons between categorical data and scalars.

See the following example:

 import pandas as pd
 cat = pd.Series([1,2,3]).astype("category", categories=[1,2,3], ordered=True)
 cat1 = pd.Series([2,2,2]).astype("category", categories=[1,2,3], ordered=True)
 print(cat>cat1)

The running results are as follows:

 0 False
 1  False
 2  True
 dtype: bool