DataFrame
它是Pandas中的一个表格型的数据结构,包含有一组有序的列,每列可以是不同的值类型(数值、字符串、布尔型等),DataFrame即有行索引也有列索引,可以被看做是由Series组成的字典。
Series
它是一种类似于一维数组的对象,是由一组数据(各种NumPy数据类型)以及一组与之相关的数据标签(即索引)组成。仅由一组数据也可产生简单的Series对象。
import pandas as pd
import numpy as np
DataFrame的创建
- 根据字典创建,每一个键值对看作一个Series
data = {
\'name\':[\'A\',\'B\',\'C\',\'D\',\'D\'],
\'year\':[2018,None,2020,2021,2022],
\'price\':[0,1,2,3,4]
}
df = pd.DataFrame(data)
df
|
name |
year |
price |
0 |
A |
2018.0 |
0 |
1 |
B |
NaN |
1 |
2 |
C |
2020.0 |
2 |
3 |
D |
2021.0 |
3 |
4 |
D |
2022.0 |
4 |
# 指定索引的值
df = pd.DataFrame(data, index=[\'one\',\'two\',\'three\',\'four\',\'five\'])
df
|
name |
year |
price |
one |
A |
2018.0 |
0 |
two |
B |
NaN |
1 |
three |
C |
2020.0 |
2 |
four |
D |
2021.0 |
3 |
five |
D |
2022.0 |
4 |
- 根据numpy的ndarray创建
df_1 = pd.DataFrame(np.array([[\'A\',2018,0],[\'B\',2019,1],[\'C\',2020,2],[\'D\',2021,3],[\'E\',2022,4]]),columns=[\'name\',\'year\',\'price\'])
df_1
|
name |
year |
price |
0 |
A |
2018 |
0 |
1 |
B |
2019 |
1 |
2 |
C |
2020 |
2 |
3 |
D |
2021 |
3 |
4 |
E |
2022 |
4 |
DataFrame的基本属性
- df.index:返回df的索引,即行标签
df.index
Index([\'one\', \'two\', \'three\', \'four\', \'five\'], dtype=\'object\')
for i in range(len(df.index)):
print(df.index[i])
one
two
three
four
five
- df.columns:返回df的列名,即列标签
df.columns
Index([\'name\', \'year\', \'price\'], dtype=\'object\')
for i in range(len(df.columns)):
print(df.columns[i])
name
year
price
- df.dtypes:返回df每一列的数据类型
df.dtypes
name object
year float64
price int64
dtype: object
- df.values:以numpy的形式返回df中的值
df.values
array([[\'A\', 2018.0, 0],
[\'B\', nan, 1],
[\'C\', 2020.0, 2],
[\'D\', 2021.0, 3],
[\'D\', 2022.0, 4]], dtype=object)
DataFrame的操作方法
- df.astype: 转换指定数据类型
df.astype({\'price\': \'int32\'}).dtypes
name object
year float64
price int32
dtype: object
- df.convert_dtypes:自动转换最佳数据类型(pandas==1.0.0以上)
df.convert_dtypes().dtypes
name string
year Int64
price Int64
dtype: object
- df.isna/df.notna: 检测缺失值和未缺失值
df.isna()
|
name |
year |
price |
one |
False |
False |
False |
two |
False |
True |
False |
three |
False |
False |
False |
four |
False |
False |
False |
five |
False |
False |
False |
df.notna()
|
name |
year |
price |
one |
True |
True |
True |
two |
True |
False |
True |
three |
True |
True |
True |
four |
True |
True |
True |
five |
True |
True |
True |
- df.head: 获取表格的前几行
df.head(3)
|
name |
year |
price |
one |
A |
2018.0 |
0 |
two |
B |
NaN |
1 |
three |
C |
2020.0 |
2 |
- df.at: 根据行/列的名称获取表格中对应的单个值
df.at[\'two\',\'name\']
\'B\'
- df.iat: 根据行/列的序号获取表格中对应的单个值
df.iat[1,1]
nan
# 修改赋值
df.iat[1,2]=None
df
|
name |
year |
price |
one |
A |
2018.0 |
0 |
two |
B |
NaN |
1 |
three |
C |
2020.0 |
2 |
four |
D |
2021.0 |
3 |
five |
D |
2022.0 |
4 |
- df.loc:通过标签或布尔数组访问一组行和列。功能太多,可访问官方文档
df.loc[\'one\']
name A
year 2018
price 0
Name: one, dtype: object
df.loc[[\'one\',\'four\']]
|
name |
year |
price |
one |
A |
2018.0 |
0 |
four |
D |
2021.0 |
3 |
df.loc[\'one\',\'name\']
\'A\'
df.loc[\'one\':\'three\',\'name\']
one A
two B
three C
Name: name, dtype: object
# 通过对行标记bool类型来显示所需要的行
df.loc[[True,True,True,False,True]]
|
name |
year |
price |
one |
A |
2018.0 |
0 |
two |
B |
NaN |
1 |
three |
C |
2020.0 |
2 |
five |
D |
2022.0 |
4 |
- df.iloc: 按照位置索引来选取数据
df.iloc[2]
name C
year 2020
price 2
Name: three, dtype: object
df.iloc[2:4]
|
name |
year |
price |
three |
C |
2020.0 |
2 |
four |
D |
2021.0 |
3 |
df.iloc[[2,4]]
|
name |
year |
price |
three |
C |
2020.0 |
2 |
five |
D |
2022.0 |
4 |
df.iloc[[2,4],[2]]
- df.isin: DataFrame中是否包含这个元素
df.isin([2,3])
|
name |
year |
price |
one |
False |
False |
False |
two |
False |
False |
False |
three |
False |
False |
True |
four |
False |
False |
True |
five |
False |
False |
False |
- df.groupby: 对DataFrame进行分组
df.groupby([\'name\'])
df.groupby([\'name\']).mean()
|
year |
price |
name |
|
|
A |
2018.0 |
0.0 |
B |
NaN |
1.0 |
C |
2020.0 |
2.0 |
D |
2021.5 |
3.5 |
df.groupby([\'name\']).sum()
|
year |
price |
name |
|
|
A |
2018.0 |
0 |
B |
0.0 |
1 |
C |
2020.0 |
2 |
D |
4043.0 |
7 |
11.== df.drop==: 从行或列删除指定的标签
# 删除列
df.drop([\'name\'],axis=1)
|
year |
price |
one |
2018.0 |
0 |
two |
NaN |
1 |
three |
2020.0 |
2 |
four |
2021.0 |
3 |
five |
2022.0 |
4 |
df.drop(columns=[\'name\'])
|
year |
price |
one |
2018.0 |
0 |
two |
NaN |
1 |
three |
2020.0 |
2 |
four |
2021.0 |
3 |
five |
2022.0 |
4 |
# 通过标签索引名称删除行
df.drop([\'one\'])
|
name |
year |
price |
two |
B |
NaN |
1 |
three |
C |
2020.0 |
2 |
four |
D |
2021.0 |
3 |
five |
D |
2022.0 |
4 |
- 根据表格画柱状图
# 画在一幅图里
ax = df.plot.bar(rot=0)
# 画在多幅图中
axes = df.plot.bar(rot=0, subplots=True)
axes[1].legend(loc=2)