0
# Pandas
1
2
Pandas is a comprehensive Python data analysis library that provides powerful, flexible, and expressive data structures designed for working with structured and time series data. It offers extensive functionality for data manipulation, cleaning, transformation, and analysis including data alignment, merging, reshaping, grouping, and statistical operations.
3
4
## Package Information
5
6
- **Package Name**: pandas
7
- **Package Type**: library
8
- **Language**: Python
9
- **Installation**: `pip install pandas`
10
11
## Core Imports
12
13
```python
14
import pandas as pd
15
```
16
17
Common imports for specific functionality:
18
19
```python
20
import pandas as pd
21
from pandas import DataFrame, Series, Index
22
```
23
24
## Basic Usage
25
26
```python
27
import pandas as pd
28
import numpy as np
29
30
# Create a DataFrame from dictionary
31
data = {
32
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
33
'age': [25, 30, 35, 28],
34
'city': ['New York', 'London', 'Tokyo', 'Paris'],
35
'salary': [50000, 60000, 70000, 55000]
36
}
37
df = pd.DataFrame(data)
38
39
# Basic operations
40
print(df.head()) # Display first 5 rows
41
print(df.info()) # Display DataFrame info
42
print(df.describe()) # Statistical summary
43
44
# Data selection and filtering
45
young_employees = df[df['age'] < 30]
46
high_earners = df[df['salary'] > 55000]
47
48
# Create a Series
49
ages = pd.Series([25, 30, 35, 28], name='ages')
50
print(ages.mean()) # Calculate mean age
51
52
# Read data from files
53
df_csv = pd.read_csv('data.csv')
54
df_excel = pd.read_excel('data.xlsx')
55
56
# Basic data manipulation
57
df['bonus'] = df['salary'] * 0.1 # Add new column
58
df_sorted = df.sort_values('salary') # Sort by salary
59
df_grouped = df.groupby('city')['salary'].mean() # Group and aggregate
60
```
61
62
## Architecture
63
64
Pandas is built around three fundamental data structures:
65
66
- **Series**: One-dimensional labeled array capable of holding any data type
67
- **DataFrame**: Two-dimensional labeled data structure with heterogeneous columns
68
- **Index**: Immutable sequence used for indexing and alignment
69
70
The library integrates seamlessly with NumPy, providing optimized performance through vectorized operations, and serves as the foundation for the Python data science ecosystem, including integration with Jupyter notebooks, matplotlib, scikit-learn, and hundreds of domain-specific analysis libraries.
71
72
## Capabilities
73
74
### Core Data Structures
75
76
The fundamental data structures that form the foundation of pandas: DataFrame, Series, and various Index types. These structures provide the building blocks for all data manipulation operations.
77
78
```python { .api }
79
class DataFrame:
80
def __init__(self, data=None, index=None, columns=None, dtype=None, copy=None): ...
81
82
class Series:
83
def __init__(self, data=None, index=None, dtype=None, name=None, copy=None, fastpath=False): ...
84
85
class Index:
86
def __init__(self, data=None, dtype=None, copy=False, name=None, tupleize_cols=True): ...
87
```
88
89
[Core Data Structures](./core-data-structures.md)
90
91
### Data Input/Output
92
93
Comprehensive I/O capabilities for reading and writing data in various formats including CSV, Excel, JSON, SQL databases, HDF5, Parquet, and many statistical file formats.
94
95
```python { .api }
96
def read_csv(filepath_or_buffer, **kwargs): ...
97
def read_excel(io, **kwargs): ...
98
def read_json(path_or_buf, **kwargs): ...
99
def read_sql(sql, con, **kwargs): ...
100
def read_parquet(path, **kwargs): ...
101
```
102
103
[Data Input/Output](./data-io.md)
104
105
### Data Manipulation and Reshaping
106
107
Functions for combining, reshaping, and transforming data including merging, concatenation, pivoting, melting, and advanced data restructuring operations.
108
109
```python { .api }
110
def concat(objs, axis=0, join='outer', **kwargs): ...
111
def merge(left, right, how='inner', on=None, **kwargs): ...
112
def pivot_table(data, values=None, index=None, columns=None, **kwargs): ...
113
def melt(data, id_vars=None, value_vars=None, **kwargs): ...
114
```
115
116
[Data Manipulation](./data-manipulation.md)
117
118
### Time Series and Date Handling
119
120
Comprehensive time series functionality including date/time parsing, time zone handling, frequency conversion, resampling, and specialized time-based operations.
121
122
```python { .api }
123
def date_range(start=None, end=None, periods=None, freq=None, **kwargs): ...
124
def to_datetime(arg, **kwargs): ...
125
class Timestamp:
126
def __init__(self, ts_input=None, freq=None, tz=None, **kwargs): ...
127
```
128
129
[Time Series](./time-series.md)
130
131
### Data Types and Missing Data
132
133
Extension data types, missing data handling, and type conversion utilities including nullable integer/boolean types, categorical data, and advanced missing value operations.
134
135
```python { .api }
136
def isna(obj): ...
137
def notna(obj): ...
138
class Categorical:
139
def __init__(self, values, categories=None, ordered=None, dtype=None, fastpath=False): ...
140
```
141
142
[Data Types](./data-types.md)
143
144
### Statistical and Mathematical Operations
145
146
Built-in statistical functions, mathematical operations, and data analysis utilities including descriptive statistics, correlation analysis, and numerical computations.
147
148
```python { .api }
149
def cut(x, bins, **kwargs): ...
150
def qcut(x, q, **kwargs): ...
151
def factorize(values, **kwargs): ...
152
def value_counts(values, **kwargs): ...
153
```
154
155
[Statistics and Math](./statistics-math.md)
156
157
### Configuration and Options
158
159
Pandas configuration system for controlling display options, computational behavior, and library-wide settings.
160
161
```python { .api }
162
def get_option(pat): ...
163
def set_option(pat, value): ...
164
def reset_option(pat): ...
165
def option_context(*args): ...
166
```
167
168
[Configuration](./configuration.md)
169
170
### Plotting and Visualization
171
172
Comprehensive plotting capabilities including basic plot types, statistical visualizations, and advanced multivariate analysis plots built on matplotlib.
173
174
```python { .api }
175
def scatter_matrix(frame, **kwargs): ...
176
def parallel_coordinates(frame, class_column, **kwargs): ...
177
def andrews_curves(frame, class_column, **kwargs): ...
178
def radviz(frame, class_column, **kwargs): ...
179
```
180
181
[Plotting](./plotting.md)
182
183
### API and Type Checking
184
185
Type checking utilities and data type validation functions for working with pandas data structures and ensuring data quality.
186
187
```python { .api }
188
def is_numeric_dtype(arr_or_dtype): ...
189
def is_datetime64_dtype(arr_or_dtype): ...
190
def is_categorical_dtype(arr_or_dtype): ...
191
def infer_dtype(value, **kwargs): ...
192
```
193
194
[API Types](./api-types.md)
195
196
### Error Handling
197
198
Exception and warning classes for proper error handling in pandas applications, including parsing errors, performance warnings, and data validation errors.
199
200
```python { .api }
201
class ParserError(ValueError): ...
202
class PerformanceWarning(Warning): ...
203
class SettingWithCopyWarning(Warning): ...
204
class DtypeWarning(Warning): ...
205
```
206
207
[Errors](./errors.md)
208
209
## Types
210
211
```python { .api }
212
# Core scalar types
213
class Timestamp:
214
"""Pandas timestamp object."""
215
pass
216
217
class Timedelta:
218
"""Pandas timedelta object."""
219
pass
220
221
class Period:
222
"""Pandas period object."""
223
pass
224
225
class Interval:
226
"""Pandas interval object."""
227
pass
228
229
# Missing value sentinels
230
NA: object # Pandas missing value sentinel
231
NaT: object # Not-a-Time for datetime/timedelta
232
233
# Common type aliases
234
Scalar = Union[str, int, float, bool, Timestamp, Timedelta, Period, Interval]
235
ArrayLike = Union[list, tuple, np.ndarray, Series, Index]
236
Axes = Union[int, str, Sequence[Union[int, str]]]
237
```