English | 简体中文 | 繁體中文 | Русский язык | Français | Español | Português | Deutsch | 日本語 | 한국어 | Italiano | بالعربية

Data Loss in Pandas

Pandas example of data loss operation

In real life, data loss is always a problem. The fields of machine learning and data mining face serious challenges in the accuracy of model predictions because missing values lead to poor data quality. In these fields, missing value processing is a major focus to make the model more accurate and effective.

When and why is data lost?

Let's consider an online survey for a product. Many times, people do not share all the information they have. Few people share their experiences, but do not share how long they have used the product; few people share how long they have used the product, their experiences rather than their contact information. Therefore, in one way or another, some data is always lost, which is very common in real-time situations.
Now let's see how pandas handles missing values (such as NA or NaN).

# import the pandas library
 import pandas as pd
 import numpy as np
 df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
 'h'], columns=['one', 'two', 'three'])
 df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
 print(df)

The running result is as follows:

       one            two            three
a  -0.576991  -0.741695  0.553172
b                NaN                NaN                NaN
c      0.744328  -1.735166  1.749580
NaN replaced with '0':
         one            two            three
a  -0.576991  -0.741695  0.553172
b      0.000000      0.000000      0.000000
c      0.744328  -1.735166  1.749580

Using reindexing, we create a DataFrame with missing values. In the output, NaN represents not a number.

Check for missing values

To make the detection of missing values easier (and different array dtypes), pandas provides ISNULL() and NOTNULL() functions, which are also methods for series and DataFrame objects-

Instance 1

 import pandas as pd
 import numpy as np
  
 df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
 'h'], columns=['one', 'two', 'three'])
 df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
 print(df['one'].isnull())

The running result is as follows:

 a  False
 b  True
 c  False
 d  True
 e  False
 f  False
 g  True
 h  False
 h     True

Instance 2

 import pandas as pd
 import numpy as np
 df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
 'h'], columns=['one', 'two', 'three'])
 df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
 print(df['one'].notnull())

The running result is as follows:

 a  True
 b  False
 c  True
 d  False
 e  True
 f  True
 g  False
 g     False
 h     True

Name: one, dtype: bool

Calculation of missing data When summarizing data, NA will be treated as zero

Instance 1

 import pandas as pd
 import numpy as np
 df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
 'h'], columns=['one', 'two', 'three'])
 df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
 If all data are not applicable, the result is not applicable

The running result is as follows:

   2print(df['one'].sum())2357685917

Instance 2

 import pandas as pd
 import numpy as np
 .01,2,3,4,5df = pd.DataFrame(index=[0,
 ], columns=['one', 'two'])

The running result is as follows:

   print(df['one'].sum(),

nan/Clean up

Fill in missing data

Pandas provides various methods for clearing missing values. The fillna function can fill in NA values with non-null data in the following ways:

Replace NaN with a scalar value

 import pandas as pd
 import numpy as np
 df = pd.DataFrame(np.random.randn(3, 3The following program shows how to replace 'NaN' with '0'.
 '], index=['a', 'c', 'e'], columns=['one',
 df = df.reindex(['a', 'b', 'c'])
 print(df)
 print("NaN replaced with '0':")
 print(df.fillna(0))

The running result is as follows:

       one            two            three
a  -0.576991  -0.741695  0.553172
b                NaN                NaN                NaN
c      0.744328  -1.735166  1.749580
NaN replaced with '0':
         one            two            three
a  -0.576991  -0.741695  0.553172
b      0.000000      0.000000      0.000000
c      0.744328  -1.735166  1.749580

Here, we fill in zeros; conversely, we can also fill in any other value.

Forward and backward fill NA

Using the concept of filling discussed in the 'Reindexing' chapter, we will fill in the missing values.

MethodOperation
pad/fillForward fill<
bfill/backfillForward fill

Instance 1

 import pandas as pd
 import numpy as np
 df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
 'h'], columns=['one', 'two', 'three'])
 df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
 print(df.fillna(method='pad'))

The running result is as follows:

       one            two            three
a      0.077988   0.476149   0.965836
b      0.077988   0.476149   0.965836
c}}  -0.390208  -0.551605  -2.301950
d  -0.390208  -0.551605  -2.301950
e  -2.000303  -0.788201   1.510072
f  -0.930230  -0.670473   1.146615
g  -0.930230  -0.670473   1.146615
h      0.085100      0.532791   0.887415

Instance 2

 import pandas as pd
 import numpy as np
 df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
 'h'], columns=['one', 'two', 'three'])
 df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
 print(df.fillna(method='backfill'))

The running result is as follows:

       one            two            three
a      0.077988   0.476149   0.965836
b  -0.390208  -0.551605  -2.301950
c}}  -0.390208  -0.551605  -2.301950
d  -2.000303  -0.788201   1.510072
e  -2.000303  -0.788201   1.510072
f  -0.930230  -0.670473   1.146615
g      0.085100      0.532791   0.887415
h      0.085100      0.532791   0.887415

Delete missing values

If you only want to exclude missing values, use the dropna function with the axis parameter together. By default, axis = 0, which means along the rows, so if any value in a row is NA, the entire row will be excluded.

Instance 1

 import pandas as pd
 import numpy as np
 df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
 'h'], columns=['one', 'two', 'three'])
 df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
 print(df.dropna())

The running result is as follows:

  
   one two three a 0.077988 0.476149 0.965836 c}} -0.390208 -0.551605 -2.301950 e -2.000303 -0.788201 1.510072 f -0.930230 -0.670473 1.146615 h 0.085100 0.532791 0.887415

Instance 2

 import pandas as pd
 import numpy as np
 df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
 'h'], columns=['one', 'two', 'three'])
 df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
 print(df.dropna(axis=1))

The running result is as follows:

 Empty DataFrame
 Columns: []
 Index: [a, b, c, d, e, f, g, h]

Replacing missing (or) general values

Many times, we must replace a general value with a specific value. We can achieve this by applying the replace method.
Replacing NA with a scalar value is equivalent to the behavior of the fillna() function.

Instance 1

 import pandas as pd
 import numpy as np
 df = pd.DataFrame({'one': [10,20,30,40,50,2000], 'two': [1000,0,30,40,50,60]})
 print(df.replace({1000:10,2000:60})

The running result is as follows:

   one two
 0 10 10
 1 20 0
 2 30 30
 3 40 40
 4 50 50
 5 60 60

Instance 2

 import pandas as pd
 import numpy as np
 df = pd.DataFrame({'one': [10,20,30,40,50,2000], 'two': [1000,0,30,40,50,60]})
 print(df.replace({1000:10,2000:60)

The running result is as follows:

   one two
 0 10 10
 1 20 0
 2 30 30
 3 40 40
 4 50 50
 5 60 60