SQL Operations in Pandas Aggregation in Pandas

Data Loss in Pandas

Pandas example of data loss operation

In real life, data loss is always a problem. The fields of machine learning and data mining face serious challenges in the accuracy of model predictions because missing values lead to poor data quality. In these fields, missing value processing is a major focus to make the model more accurate and effective.

When and why is data lost?

Let's consider an online survey for a product. Many times, people do not share all the information they have. Few people share their experiences, but do not share how long they have used the product; few people share how long they have used the product, their experiences rather than their contact information. Therefore, in one way or another, some data is always lost, which is very common in real-time situations.
Now let's see how pandas handles missing values (such as NA or NaN).

Example

#　import　the　pandas　library
　import pandas as pd
　import numpy as np
　df = pd.DataFrame(np.random.randn(5,　3), index=['a', 'c', 'e', 'f',
　'h'], columns=['one', 'two', 'three'])
　df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
　print(df)

The running result is as follows:

　　　　　　　one            two            three
a　　-0.576991　　-0.741695　　0.553172
b                NaN                NaN                NaN
c      0.744328　　-1.735166　　1.749580
NaN replaced with '0':
　　　　　　　　　one            two            three
a　　-0.576991　　-0.741695　　0.553172
b      0.000000      0.000000      0.000000
c      0.744328　　-1.735166　　1.749580

Using reindexing, we create a DataFrame with missing values. In the output, NaN represents not a number.

Check for missing values

To make the detection of missing values easier (and different array dtypes), pandas provides ISNULL() and NOTNULL() functions, which are also methods for series and DataFrame objects-

Instance 1

Example

　import pandas as pd
　import numpy as np
　　
　df = pd.DataFrame(np.random.randn(5,　3), index=['a', 'c', 'e', 'f',
　'h'], columns=['one', 'two', 'three'])
　df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
　print(df['one'].isnull())

The running result is as follows:

　a　　False
　b　　True
　c　　False
　d　　True
　e　　False
　f　　False
　g　　True
　h　　False
　h     True

Instance 2

Example

　import pandas as pd
　import numpy as np
　df = pd.DataFrame(np.random.randn(5,　3), index=['a', 'c', 'e', 'f',
　'h'], columns=['one', 'two', 'three'])
　df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
　print(df['one'].notnull())

The running result is as follows:

　a　　True
　b　　False
　c　　True
　d　　False
　e　　True
　f　　True
　g　　False
　g     False
　h     True

Name: one, dtype: bool

Calculation of missing data When summarizing data, NA will be treated as zero

Instance 1

Example

　import pandas as pd
　import numpy as np
　df = pd.DataFrame(np.random.randn(5,　3), index=['a', 'c', 'e', 'f',
　'h'], columns=['one', 'two', 'three'])
　df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
　If all data are not applicable, the result is not applicable

The running result is as follows:

　　　2print(df['one'].sum())2357685917

Instance 2

Example

　import pandas as pd
　import numpy as np
　.01,2,3,4,5df = pd.DataFrame(index=[0,
　], columns=['one', 'two'])

The running result is as follows:

　　　print(df['one'].sum(),

nan/Clean up

Fill in missing data

Pandas provides various methods for clearing missing values. The fillna function can fill in NA values with non-null data in the following ways:

Replace NaN with a scalar value

Example

　import pandas as pd
　import numpy as np
　df = pd.DataFrame(np.random.randn(3,　3The following program shows how to replace 'NaN' with '0'.
　'], index=['a', 'c', 'e'], columns=['one',
　df = df.reindex(['a', 'b', 'c'])
　print(df)
　print("NaN replaced with '0':")
　print(df.fillna(0))

The running result is as follows:

　　　　　　　one            two            three
a　　-0.576991　　-0.741695　　0.553172
b                NaN                NaN                NaN
c      0.744328　　-1.735166　　1.749580
NaN replaced with '0':
　　　　　　　　　one            two            three
a　　-0.576991　　-0.741695　　0.553172
b      0.000000      0.000000      0.000000
c      0.744328　　-1.735166　　1.749580

Here, we fill in zeros; conversely, we can also fill in any other value.

Forward and backward fill NA

Using the concept of filling discussed in the 'Reindexing' chapter, we will fill in the missing values.

Method	Operation
pad/fill	Forward fill<
bfill/backfill	Forward fill

Instance 1

Example

　import pandas as pd
　import numpy as np
　df = pd.DataFrame(np.random.randn(5,　3), index=['a', 'c', 'e', 'f',
　'h'], columns=['one', 'two', 'three'])
　df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
　print(df.fillna(method='pad'))

The running result is as follows:

　　　　　　　one            two            three
a      0.077988　　　0.476149　　　0.965836
b      0.077988　　　0.476149　　　0.965836
c}}　　-0.390208　　-0.551605　　-2.301950
d　　-0.390208　　-0.551605　　-2.301950
e　　-2.000303　　-0.788201　　　1.510072
f　　-0.930230　　-0.670473　　　1.146615
g　　-0.930230　　-0.670473　　　1.146615
h      0.085100      0.532791　　　0.887415

Instance 2

Example

　import pandas as pd
　import numpy as np
　df = pd.DataFrame(np.random.randn(5,　3), index=['a', 'c', 'e', 'f',
　'h'], columns=['one', 'two', 'three'])
　df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
　print(df.fillna(method='backfill'))

The running result is as follows:

　　　　　　　one            two            three
a      0.077988　　　0.476149　　　0.965836
b　　-0.390208　　-0.551605　　-2.301950
c}}　　-0.390208　　-0.551605　　-2.301950
d　　-2.000303　　-0.788201　　　1.510072
e　　-2.000303　　-0.788201　　　1.510072
f　　-0.930230　　-0.670473　　　1.146615
g      0.085100      0.532791　　　0.887415
h      0.085100      0.532791　　　0.887415

Delete missing values

If you only want to exclude missing values, use the dropna function with the axis parameter together. By default, axis = 0, which means along the rows, so if any value in a row is NA, the entire row will be excluded.

Instance 1

Example

　import pandas as pd
　import numpy as np
　df = pd.DataFrame(np.random.randn(5,　3), index=['a', 'c', 'e', 'f',
　'h'], columns=['one', 'two', 'three'])
　df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
　print(df.dropna())

The running result is as follows:

　　
　　　one two three a 0.077988　0.476149　0.965836　c}}　-0.390208　-0.551605　-2.301950 e　-2.000303　-0.788201　1.510072　f　-0.930230　-0.670473　1.146615　h 0.085100 0.532791　0.887415

Instance 2

Example

　import pandas as pd
　import numpy as np
　df = pd.DataFrame(np.random.randn(5,　3), index=['a', 'c', 'e', 'f',
　'h'], columns=['one', 'two', 'three'])
　df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
　print(df.dropna(axis=1))

The running result is as follows:

　Empty DataFrame
　Columns: []
　Index: [a, b, c, d, e, f, g, h]

Replacing missing (or) general values

Many times, we must replace a general value with a specific value. We can achieve this by applying the replace method.
Replacing NA with a scalar value is equivalent to the behavior of the fillna() function.

Instance 1

Example

　import pandas as pd
　import numpy as np
　df = pd.DataFrame({'one': [10,20,30,40,50,2000], 'two': [1000,0,30,40,50,60]})
　print(df.replace({1000:10,2000:60})

The running result is as follows:

Instance 2

Example

　import pandas as pd
　import numpy as np
　df = pd.DataFrame({'one': [10,20,30,40,50,2000], 'two': [1000,0,30,40,50,60]})
　print(df.replace({1000:10,2000:60)

The running result is as follows:

SQL Operations in Pandas Aggregation in Pandas

Pandas tutorial

Data Loss in Pandas

When and why is data lost?

Check for missing values

Instance 1

Instance 2

Name: one, dtype: bool

Instance 1

Instance 2

nan/Clean up

Pandas provides various methods for clearing missing values. The fillna function can fill in NA values with non-null data in the following ways:

Forward and backward fill NA

Instance 1

Instance 2

Delete missing values

Instance 1

Instance 2

Replacing missing (or) general values

Instance 1

Instance 2