English | 简体中文 | 繁體中文 | Русский язык | Français | Español | Português | Deutsch | 日本語 | 한국어 | Italiano | بالعربية
This tutorial is prepared for those who want to learn the basic knowledge and various functions of pandas. It is especially useful for those engaged in data cleaning and analysis work. After completing this tutorial, you will find that you have a moderate level of professional knowledge, from which you can obtain a higher level of professional knowledge.
Before learning pandas, you should have a basic understanding of computer programming terminology. A basic understanding of any programming language is a plus. The pandas library uses most of the features of NumPy. It is recommended that you read tutorials about NumPy first before continuing with this tutorial.
Pandas is suitable for processing the following types of data:
Table data similar to SQL or Excel tables, containing heterogeneous columns; Elements of NumPy arrays must have the same data type, so they have the same size in memory. Ordered and unordered (non-fixed frequency) time series data; Matrix data with row and column labels, including homogeneous or heterogeneous data; Observational or statistical datasets of any other form do not need to be pre-tagged when transferred into Pandas data structures.
Pandas' main data structures are Series (one-dimensional data) and DataFrame (two-dimensional data), which are sufficient to handle most typical use cases in fields such as finance, statistics, social sciences, and engineering. For R users, DataFrame provides more features than R language data.frame. Pandas is developed based on NumPy and can be perfectly integrated with other third-party scientific computing support libraries. Pandas is like a universal Swiss Army knife, and the following only lists some of its advantages:
Handle missing data in floating-point and non-floating-point data, represented as NaN; Variable size: insert or delete columns of multi-dimensional objects such as DataFrame; Automatic and explicit data alignment: explicitly align objects with a set of tags, or ignore tags and automatically align with data during Series, DataFrame calculations; Powerful and flexible grouping (group by) function: split-Application-Combine datasets, aggregate and transform data; Easily convert irregular and differently indexed data in Python and NumPy data structures into DataFrame objects; Perform operations such as slicing, fancy indexing, and subset decomposition on large datasets based on intelligent tags; Intuitively merge,**Join**Data Sets; Flexibly reshape,**Pivot**Data Sets; Axis Support for Structured Labels: A single axis supports multiple labels; Mature IO Tools: Reads data from various sources such as text files (CSV and other files with delimiters), Excel files, databases, etc., using the ultra-fast HDF5 Format Saving / Data Loading; Time Series: Supports date range generation, frequency conversion, moving window statistics, moving window linear regression, date shifting, and other time series functions.
These features are mainly designed to address the pain points of other programming languages and research environments. Data processing generally involves several stages: data cleaning and preparation, data analysis and modeling, data visualization and tabulation, and Pandas is the ideal tool for data processing.
Pandas is fast. Many of the underlying algorithms of Pandas are optimized with Cython. However, to maintain generality, some performance must be sacrificed. If focused on a specific function, it is possible to develop dedicated tools that are faster than Pandas. Pandas is a dependency of statsmodels and is an important part of the statistical computing ecosystem in Python. Pandas has been widely used in the financial field.
$ pip install pandas $ python -i >>> pandaspd >>> df = pd.() >>> print(df) Empty DataFrame Columns: [] Index: []