English | 简体中文 | 繁體中文 | Русский язык | Français | Español | Português | Deutsch | 日本語 | 한국어 | Italiano | بالعربية
Pandas example of data loss operation
In real life, data loss is always a problem. The fields of machine learning and data mining face serious challenges in the accuracy of model predictions because missing values lead to poor data quality. In these fields, missing value processing is a major focus to make the model more accurate and effective.
Let's consider an online survey for a product. Many times, people do not share all the information they have. Few people share their experiences, but do not share how long they have used the product; few people share how long they have used the product, their experiences rather than their contact information. Therefore, in one way or another, some data is always lost, which is very common in real-time situations.
Now let's see how pandas handles missing values (such as NA or NaN).
# import the pandas library import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'], columns=['one', 'two', 'three']) df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) print(df)
The running result is as follows:
one two three a -0.576991 -0.741695 0.553172 b NaN NaN NaN c 0.744328 -1.735166 1.749580 NaN replaced with '0': one two three a -0.576991 -0.741695 0.553172 b 0.000000 0.000000 0.000000 c 0.744328 -1.735166 1.749580
Using reindexing, we create a DataFrame with missing values. In the output, NaN represents not a number.
To make the detection of missing values easier (and different array dtypes), pandas provides ISNULL() and NOTNULL() functions, which are also methods for series and DataFrame objects-
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'], columns=['one', 'two', 'three']) df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) print(df['one'].isnull())
The running result is as follows:
a False b True c False d True e False f False g True h False h True
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'], columns=['one', 'two', 'three']) df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) print(df['one'].notnull())
The running result is as follows:
a True b False c True d False e True f True g False g False h True
Calculation of missing data When summarizing data, NA will be treated as zero
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'], columns=['one', 'two', 'three']) df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) If all data are not applicable, the result is not applicable
The running result is as follows:
2print(df['one'].sum())2357685917
import pandas as pd import numpy as np .01,2,3,4,5df = pd.DataFrame(index=[0, ], columns=['one', 'two'])
The running result is as follows:
print(df['one'].sum(),
Fill in missing data
Replace NaN with a scalar value
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(3, 3The following program shows how to replace 'NaN' with '0'. '], index=['a', 'c', 'e'], columns=['one', df = df.reindex(['a', 'b', 'c']) print(df) print("NaN replaced with '0':") print(df.fillna(0))
The running result is as follows:
one two three a -0.576991 -0.741695 0.553172 b NaN NaN NaN c 0.744328 -1.735166 1.749580 NaN replaced with '0': one two three a -0.576991 -0.741695 0.553172 b 0.000000 0.000000 0.000000 c 0.744328 -1.735166 1.749580
Here, we fill in zeros; conversely, we can also fill in any other value.
Using the concept of filling discussed in the 'Reindexing' chapter, we will fill in the missing values.
Method | Operation |
pad/fill | Forward fill< |
bfill/backfill | Forward fill |
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'], columns=['one', 'two', 'three']) df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) print(df.fillna(method='pad'))
The running result is as follows:
one two three a 0.077988 0.476149 0.965836 b 0.077988 0.476149 0.965836 c}} -0.390208 -0.551605 -2.301950 d -0.390208 -0.551605 -2.301950 e -2.000303 -0.788201 1.510072 f -0.930230 -0.670473 1.146615 g -0.930230 -0.670473 1.146615 h 0.085100 0.532791 0.887415
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'], columns=['one', 'two', 'three']) df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) print(df.fillna(method='backfill'))
The running result is as follows:
one two three a 0.077988 0.476149 0.965836 b -0.390208 -0.551605 -2.301950 c}} -0.390208 -0.551605 -2.301950 d -2.000303 -0.788201 1.510072 e -2.000303 -0.788201 1.510072 f -0.930230 -0.670473 1.146615 g 0.085100 0.532791 0.887415 h 0.085100 0.532791 0.887415
If you only want to exclude missing values, use the dropna function with the axis parameter together. By default, axis = 0, which means along the rows, so if any value in a row is NA, the entire row will be excluded.
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'], columns=['one', 'two', 'three']) df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) print(df.dropna())
The running result is as follows:
one two three a 0.077988 0.476149 0.965836 c}} -0.390208 -0.551605 -2.301950 e -2.000303 -0.788201 1.510072 f -0.930230 -0.670473 1.146615 h 0.085100 0.532791 0.887415
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'], columns=['one', 'two', 'three']) df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) print(df.dropna(axis=1))
The running result is as follows:
Empty DataFrame Columns: [] Index: [a, b, c, d, e, f, g, h]
Many times, we must replace a general value with a specific value. We can achieve this by applying the replace method.
Replacing NA with a scalar value is equivalent to the behavior of the fillna() function.
import pandas as pd import numpy as np df = pd.DataFrame({'one': [10,20,30,40,50,2000], 'two': [1000,0,30,40,50,60]}) print(df.replace({1000:10,2000:60})
The running result is as follows:
one two 0 10 10 1 20 0 2 30 30 3 40 40 4 50 50 5 60 60
import pandas as pd import numpy as np df = pd.DataFrame({'one': [10,20,30,40,50,2000], 'two': [1000,0,30,40,50,60]}) print(df.replace({1000:10,2000:60)
The running result is as follows:
one two 0 10 10 1 20 0 2 30 30 3 40 40 4 50 50 5 60 60