发布时间:2024-03-09 15:01
import numpy as np
import pandas as pd
#导入numpy、pandas模块
Series 数据结构(带有标签的一维数组)
# Series 是带有标签的一维数组,可以保存任何数据类型(整数,字符串,浮点数,Python对象等),轴标签统称为索引
s = pd.Series(np.random.rand(5))
print(s)
print(type(s))
# 查看数据、数据类型
print(s.index,type(s.index))
print(s.values,type(s.values))
# .index查看series索引,类型为rangeindex
# .values查看series值,类型是ndarray
# 核心:series相比于ndarray,是一个自带索引index的数组 → 一维数组 + 对应索引
# 所以当只看series的值的时候,就是一个ndarray
# series和ndarray较相似,索引切片功能差别不大
# series和dict相比,series更像一个有顺序的字典(dict本身不存在顺序),其索引原理与字典相似(一个用key,一个用index)
0 0.467169
1 0.054509
2 0.965586
3 0.507866
4 0.694997
dtype: float64
RangeIndex(start=0, stop=5, step=1)
[0.46716932 0.05450883 0.96558627 0.50786619 0.69499677]
Series 创建方法一:由字典创建,字典的key就是index,values就是values
dic = {\'a\':1 ,\'b\':2 , \'c\':3, \'4\':4, \'5\':5}
s = pd.Series(dic)
print(s)
# 注意:key肯定是字符串,假如values类型不止一个会怎么样? → dic = {\'a\':1 ,\'b\':\'hello\' , \'c\':3, \'4\':4, \'5\':5}
4 4
5 5
a 1
b 2
c 3
dtype: int64
Series 创建方法二:由数组创建(一维数组)
arr = np.random.randn(5)
s = pd.Series(arr)
print(arr)
print(s)
# 默认index是从0开始,步长为1的数字
s = pd.Series(arr, index = [\'a\',\'b\',\'c\',\'d\',\'e\'],dtype = np.object)
print(s)
# index参数:设置index,长度保持一致
# dtype参数:设置数值类型
[-2.56328023 0.87233579 0.47630666 1.91715736 -1.26924024]
0 -2.563280
1 0.872336
2 0.476307
3 1.917157
4 -1.269240
dtype: float64
a -2.56328
b 0.872336
c 0.476307
d 1.91716
e -1.26924
dtype: object
Series 名称属性:name
s1 = pd.Series(np.random.randn(5))
print(s1)
print(\'-----\')
s2 = pd.Series(np.random.randn(5),name = \'test\')
print(s2)
print(s1.name, s2.name,type(s2.name))
# name为Series的一个参数,创建一个数组的 名称
# .name方法:输出数组的名称,输出格式为str,如果没用定义输出名称,输出为None
s3 = s2.rename(\'hehehe\')
print(s3)
print(s3.name, s2.name)
# .rename()重命名一个数组的名称,并且新指向一个数组,原数组不变
0 -1.285306
1 -0.586416
2 -1.966362
3 -1.507387
4 0.622088
dtype: float64
-----
0 -0.763427
1 -1.588831
2 -1.676116
3 0.453159
4 -0.874990
Name: test, dtype: float64
None test
0 -0.763427
1 -1.588831
2 -1.676116
3 0.453159
4 -0.874990
Name: hehehe, dtype: float64
hehehe test
Dataframe 数据结构(带有行列标签的二维数组)
# Dataframe是一个表格型的数据结构,“带有标签的二维数组”。
# Dataframe带有index(行标签)和columns(列标签)
data = {\'name\':[\'Jack\',\'Tom\',\'Mary\'],
\'age\':[18,19,20],
\'gender\':[\'m\',\'m\',\'w\']}
frame = pd.DataFrame(data)
print(frame)
print(type(frame))
print(frame.index,\'\\n该数据类型为:\',type(frame.index))
print(frame.columns,\'\\n该数据类型为:\',type(frame.columns))
print(frame.values,\'\\n该数据类型为:\',type(frame.values))
# 查看数据,数据类型为dataframe
# .index查看行标签
# .columns查看列标签
# .values查看值,数据类型为ndarray
age gender name
0 18 m Jack
1 19 m Tom
2 20 w Mary
RangeIndex(start=0, stop=3, step=1)
该数据类型为:
Index([\'age\', \'gender\', \'name\'], dtype=\'object\')
该数据类型为:
[[18 \'m\' \'Jack\']
[19 \'m\' \'Tom\']
[20 \'w\' \'Mary\']]
该数据类型为:
创建方式三种
Dataframe 创建方法一:由数组/list组成的字典
# 创建方法:pandas.Dataframe()
data1 = {\'a\':[1,2,3],
\'b\':[3,4,5],
\'c\':[5,6,7]}
data2 = {\'one\':np.random.rand(3),
\'two\':np.random.rand(3)} # 这里如果尝试 \'two\':np.random.rand(4) 会怎么样?
print(data1)
print(data2)
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
print(df1)
print(df2)
# 由数组/list组成的字典 创建Dataframe,columns为字典key,index为默认数字标签
# 字典的值的长度必须保持一致!
df1 = pd.DataFrame(data1, columns = [\'b\',\'c\',\'a\',\'d\'])
print(df1)
df1 = pd.DataFrame(data1, columns = [\'b\',\'c\'])
print(df1)
# columns参数:可以重新指定列的顺序,格式为list,如果现有数据中没有该列(比如\'d\'),则产生NaN值
# 如果columns重新指定时候,列的数量可以少于原数据
df2 = pd.DataFrame(data2, index = [\'f1\',\'f2\',\'f3\']) # 这里如果尝试 index = [\'f1\',\'f2\',\'f3\',\'f4\'] 会怎么样?
print(df2)
# index参数:重新定义index,格式为list,长度必须保持一致
{\'a\': [1, 2, 3], \'b\': [3, 4, 5], \'c\': [5, 6, 7]}
{\'one\': array([0.62316529, 0.75502024, 0.07149083]), \'two\': array([0.18033469, 0.90637716, 0.6449568 ])}
a b c
0 1 3 5
1 2 4 6
2 3 5 7
one two
0 0.623165 0.180335
1 0.755020 0.906377
2 0.071491 0.644957
b c a d
0 3 5 1 NaN
1 4 6 2 NaN
2 5 7 3 NaN
b c
0 3 5
1 4 6
2 5 7
one two
f1 0.623165 0.180335
f2 0.755020 0.906377
f3 0.071491 0.644957
Dataframe 创建方法二:由Series组成的字典
data1 = {\'one\':pd.Series(np.random.rand(2)),
\'two\':pd.Series(np.random.rand(3))} # 没有设置index的Series
data2 = {\'one\':pd.Series(np.random.rand(2), index = [\'a\',\'b\']),
\'two\':pd.Series(np.random.rand(3),index = [\'a\',\'b\',\'c\'])} # 设置了index的Series
print(data1)
print(data2)
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
print(df1)
print(df2)
# 由Seris组成的字典 创建Dataframe,columns为字典key,index为Series的标签(如果Series没有指定标签,则是默认数字标签)
# Series可以长度不一样,生成的Dataframe会出现NaN值
{\'two\': 0 0.331382
1 0.508265
2 0.615997
dtype: float64, \'one\': 0 0.857739
1 0.165800
dtype: float64}
{\'two\': a 0.826446
b 0.983392
c 0.187749
dtype: float64, \'one\': a 0.920073
b 0.215178
dtype: float64}
one two
0 0.857739 0.331382
1 0.165800 0.508265
2 NaN 0.615997
one two
a 0.920073 0.826446
b 0.215178 0.983392
c NaN 0.187749
Dataframe 创建方法三:通过二维数组直接创建
ar = np.random.rand(9).reshape(3,3)
print(ar)
df1 = pd.DataFrame(ar)
df2 = pd.DataFrame(ar, index = [\'a\', \'b\', \'c\'], columns = [\'one\',\'two\',\'three\']) # 可以尝试一下index或columns长度不等于已有数组的情况
print(df1)
print(df2)
# 通过二维数组直接创建Dataframe,得到一样形状的结果数据,如果不指定index和columns,两者均返回默认数字格式
# index和colunms指定长度与原数组保持一致
[[0.33940056 0.77384698 0.25308293]
[0.28151251 0.02875986 0.7516066 ]
[0.34746659 0.25245068 0.68979615]]
0 1 2
0 0.339401 0.773847 0.253083
1 0.281513 0.028760 0.751607
2 0.347467 0.252451 0.689796
one two three
a 0.339401 0.773847 0.253083
b 0.281513 0.028760 0.751607
c 0.347467 0.252451 0.689796
索引与切片(行用 (i)loc,列不用。单列字符串,多列就列表)
Series和Datafram索引的原理一样,我们以Dataframe的索引为主来学习
列索引:df[‘列名’] (Series不存在列索引)
行索引:df.loc[]、df.iloc[]
选择列 / 选择行 / 切片 / 布尔判断
import numpy as np
import pandas as pd
# 导入numpy、pandas模块
选择行与列
df = pd.DataFrame(np.random.rand(12).reshape(3,4)*100,
index = [\'one\',\'two\',\'three\'],
columns = [\'a\',\'b\',\'c\',\'d\'])
print(df)
data1 = df[\'a\']
data2 = df[[\'a\',\'c\']]
print(data1,type(data1))
print(data2,type(data2))
print(\'-----\')
# 按照列名选择列,只选择一列输出Series,选择多列输出Dataframe
data3 = df.loc[\'one\']
data4 = df.loc[[\'one\',\'two\']]
print(data3,type(data3))
print(data4,type(data4))
# 按照index选择行,只选择一行输出Series,选择多行输出Dataframe
a b c d
one 12.091343 5.285528 50.926279 43.411861
two 49.502460 68.476758 96.639658 13.522464
three 2.368430 72.006476 65.543572 95.790480
one 12.091343
two 49.502460
three 2.368430
Name: a, dtype: float64
a c
one 12.091343 50.926279
two 49.502460 96.639658
three 2.368430 65.543572
-----
a 12.091343
b 5.285528
c 50.926279
d 43.411861
Name: one, dtype: float64
a b c d
one 12.091343 5.285528 50.926279 43.411861
two 49.502460 68.476758 96.639658 13.522464
# df[]默认选择列,[]中写列名(所以一般数据colunms都会单独制定,不会用默认数字列名,以免和index冲突)
# 单选列为Series,print结果为Series格式
# 多选列为Dataframe,print结果为Dataframe格式
# 核心笔记:df[col]一般用于选择列,[]中写列名
#核心笔记:df.loc[label]主要针对index选择行,同时支持指定index,及默认数字index
#loc与iloc 的区别,前者末端包括,后者不包括,前者’自闭‘
布尔型索引
# 多用于索引行
df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
index = [\'one\',\'two\',\'three\',\'four\'],
columns = [\'a\',\'b\',\'c\',\'d\'])
print(df)
print(\'------\')
b1 = df < 20
print(b1,type(b1))
print(df[b1]) # 也可以书写为 df[df < 20]
print(\'------\')
# 不做索引则会对数据每个值进行判断
# 索引结果保留 所有数据:True返回原数据,False返回值为NaN
b2 = df[\'a\'] > 50
print(b2,type(b2))
print(df[b2]) # 也可以书写为 df[df[\'a\'] > 50]
print(\'------\')
# 单列做判断
# 索引结果保留 单列判断为True的行数据,包括其他列
b3 = df[[\'a\',\'b\']] > 50
print(b3,type(b3))
print(df[b3]) # 也可以书写为 df[df[[\'a\',\'b\']] > 50]
print(\'------\')
# 多列做判断
# 索引结果保留 所有数据:True返回原数据,False返回值为NaN
# 注意这里报错的话,更新一下pandas → conda update pandas
b4 = df.loc[[\'one\',\'three\']] < 50
print(b4,type(b4))
print(df[b4]) # 也可以书写为 df[df.loc[[\'one\',\'three\']] < 50]
print(\'------\')
# 多行做判断
# 索引结果保留 所有数据:True返回原数据,False返回值为NaN
a b c d
one 9.970915 10.403036 61.548723 33.807531
two 64.147530 20.837293 47.027831 24.937798
three 79.262523 6.433300 6.151607 96.142251
four 97.723110 8.283644 70.768830 6.523142
------
a b c d
one True True False False
two False False False False
three False True True False
four False True False True
a b c d
one 9.970915 10.403036 NaN NaN
two NaN NaN NaN NaN
three NaN 6.433300 6.151607 NaN
four NaN 8.283644 NaN 6.523142
------
one False
two True
three True
four True
Name: a, dtype: bool
a b c d
two 64.147530 20.837293 47.027831 24.937798
three 79.262523 6.433300 6.151607 96.142251
four 97.723110 8.283644 70.768830 6.523142
------
a b
one False False
two True False
three True False
four True False
a b c d
one NaN NaN NaN NaN
two 64.147530 NaN NaN NaN
three 79.262523 NaN NaN NaN
four 97.723110 NaN NaN NaN
------
a b c d
one True True False True
three False True True False
a b c d
one 9.970915 10.403036 NaN 33.807531
two NaN NaN NaN NaN
three NaN 6.433300 6.151607 NaN
four NaN NaN NaN NaN
------
多重索引:比如同时索引行和列
# 先选择列再选择行 —— 相当于对于一个数据,先筛选字段,再选择数据量
df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
index = [\'one\',\'two\',\'three\',\'four\'],
columns = [\'a\',\'b\',\'c\',\'d\'])
print(df)
print(\'------\')
print(df[\'a\'].loc[[\'one\',\'three\']]) # 选择a列的one,three行
print(df[[\'b\',\'c\',\'d\']].iloc[::2]) # 选择b,c,d列的one,three行
print(df[df[\'a\'] < 50].iloc[:2]) # 选择满足判断索引的前两行数据
a b c d
one 98.661560 29.514835 54.308770 85.895547
two 27.937505 7.272639 38.820131 93.830862
three 75.479305 80.195558 16.024623 63.068741
four 48.927145 38.935594 18.076788 48.773935
------
one 98.661560
three 75.479305
Name: a, dtype: float64
b c d
one 29.514835 54.308770 85.895547
three 80.195558 16.024623 63.068741
a b c d
two 27.937505 7.272639 38.820131 93.830862
four 48.927145 38.935594 18.076788 48.773935
基本技巧
数据查看、转置 / 添加、修改、删除值 / 对齐 / 排序
数据查看、转置
df = pd.DataFrame(np.random.rand(16).reshape(8,2)*100,
columns = [\'a\',\'b\'])
print(df.head(2))
print(df.tail())
# .head()查看头部数据
# .tail()查看尾部数据
# 默认查看5条
print(df.T)
# .T 转置
a b
0 64.231620 24.222954
1 3.004779 92.549576
a b
3 54.787062 17.264577
4 13.106864 5.500618
5 8.631310 79.109355
6 22.107241 94.901685
7 29.034599 54.156278
0 1 2 3 4 5 \\
a 64.231620 3.004779 25.002825 54.787062 13.106864 8.631310
b 24.222954 92.549576 87.818090 17.264577 5.500618 79.109355
6 7
a 22.107241 29.034599
b 94.901685 54.156278
添加与修改
df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
columns = [\'a\',\'b\',\'c\',\'d\'])
print(df)
df[\'e\'] = 10
df.loc[4] = 20
print(df)
# 新增列/行并赋值
df[\'e\'] = 20
df[[\'a\',\'c\']] = 100
print(df)
# 索引后直接修改值
a b c d
0 14.342082 52.604100 26.561995 60.441731
1 20.331108 43.537490 1.020098 7.171418
2 35.226542 9.573718 99.273254 0.867227
3 47.511549 56.783730 47.580639 67.007725
a b c d e
0 14.342082 52.604100 26.561995 60.441731 10
1 20.331108 43.537490 1.020098 7.171418 10
2 35.226542 9.573718 99.273254 0.867227 10
3 47.511549 56.783730 47.580639 67.007725 10
4 20.000000 20.000000 20.000000 20.000000 20
a b c d e
0 100 52.604100 100 60.441731 20
1 100 43.537490 100 7.171418 20
2 100 9.573718 100 0.867227 20
3 100 56.783730 100 67.007725 20
4 100 20.000000 100 20.000000 20
删除 del 列/ drop()行
df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
columns = [\'a\',\'b\',\'c\',\'d\'])
print(df)
del df[\'a\']
print(df)
print(\'-----\')
# del语句 - 删除列
print(df.drop(0))
print(df.drop([1,2]))
print(df)
print(\'-----\')
# drop()删除行,inplace=False → 删除后生成新的数据,不改变原数据
print(df.drop([\'d\'], axis = 1))
print(df)
# drop()删除列,需要加上axis = 1,inplace=False → 删除后生成新的数据,不改变原数据
a b c d
0 71.238538 6.121303 77.988034 44.047009
1 34.018365 78.192855 50.467246 81.162337
2 86.311980 44.341469 49.789445 35.657665
3 78.073272 31.457479 74.385014 24.655976
b c d
0 6.121303 77.988034 44.047009
1 78.192855 50.467246 81.162337
2 44.341469 49.789445 35.657665
3 31.457479 74.385014 24.655976
-----
b c d
1 78.192855 50.467246 81.162337
2 44.341469 49.789445 35.657665
3 31.457479 74.385014 24.655976
b c d
0 6.121303 77.988034 44.047009
3 31.457479 74.385014 24.655976
b c d
0 6.121303 77.988034 44.047009
1 78.192855 50.467246 81.162337
2 44.341469 49.789445 35.657665
3 31.457479 74.385014 24.655976
-----
b c
0 6.121303 77.988034
1 78.192855 50.467246
2 44.341469 49.789445
3 31.457479 74.385014
b c d
0 6.121303 77.988034 44.047009
1 78.192855 50.467246 81.162337
2 44.341469 49.789445 35.657665
3 31.457479 74.385014 24.655976
对齐
df1 = pd.DataFrame(np.random.randn(10, 4), columns=[\'A\', \'B\', \'C\', \'D\'])
df2 = pd.DataFrame(np.random.randn(7, 3), columns=[\'A\', \'B\', \'C\'])
print(df1 + df2)
# DataFrame对象之间的数据自动按照列和索引(行标签)对齐
A B C D
0 2.815874 -0.151485 -0.309208 NaN
1 2.294022 -0.355077 1.810512 NaN
2 1.947371 -0.550546 0.782233 NaN
3 0.893228 -0.961349 0.901582 NaN
4 -0.256011 -0.383676 -0.699367 NaN
5 1.343078 -0.797174 1.437726 NaN
6 0.727636 0.894328 -2.924819 NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN
排序1 - 按值排序 .sort_values
# 同样适用于Series
df1 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
columns = [\'a\',\'b\',\'c\',\'d\'])
print(df1)
print(df1.sort_values([\'a\'], ascending = True)) # 升序
print(df1.sort_values([\'a\'], ascending = False)) # 降序
print(\'------\')
# ascending参数:设置升序降序,默认升序
# 单列排序
df2 = pd.DataFrame({\'a\':[1,1,1,1,2,2,2,2],
\'b\':list(range(8)),
\'c\':list(range(8,0,-1))})
print(df2)
print(df2.sort_values([\'a\',\'c\']))
# 多列排序,按列顺序排序
# 注意inplace参数
a b c d
0 3.570340 86.589786 98.001894 50.984826
1 46.210527 83.412502 74.492003 73.251724
2 91.115360 6.083352 34.314697 56.784097
3 24.781722 67.813376 65.004312 31.788198
a b c d
0 3.570340 86.589786 98.001894 50.984826
3 24.781722 67.813376 65.004312 31.788198
1 46.210527 83.412502 74.492003 73.251724
2 91.115360 6.083352 34.314697 56.784097
a b c d
2 91.115360 6.083352 34.314697 56.784097
1 46.210527 83.412502 74.492003 73.251724
3 24.781722 67.813376 65.004312 31.788198
0 3.570340 86.589786 98.001894 50.984826
------
a b c
0 1 0 8
1 1 1 7
2 1 2 6
3 1 3 5
4 2 4 4
5 2 5 3
6 2 6 2
7 2 7 1
a b c
3 1 3 5
2 1 2 6
1 1 1 7
0 1 0 8
7 2 7 1
6 2 6 2
5 2 5 3
4 2 4 4
排序2 - 索引排序 .sort_index
df1 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
index = [5,4,3,2],
columns = [\'a\',\'b\',\'c\',\'d\'])
df2 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
index = [\'h\',\'s\',\'x\',\'g\'],
columns = [\'a\',\'b\',\'c\',\'d\'])
print(df1)
print(df1.sort_index())
print(df2)
print(df2.sort_index())
# 按照index排序
# 默认 ascending=True, inplace=False
a b c d
5 80.932585 71.991854 64.582943 23.443231
4 82.054030 87.459058 12.108433 83.047490
3 56.329863 14.926822 47.884418 59.880352
2 0.347007 69.794103 74.375345 12.736429
a b c d
2 0.347007 69.794103 74.375345 12.736429
3 56.329863 14.926822 47.884418 59.880352
4 82.054030 87.459058 12.108433 83.047490
5 80.932585 71.991854 64.582943 23.443231
a b c d
h 53.041921 93.834097 13.423132 82.702020
s 0.003814 75.721426 73.086606 20.597472
x 32.678307 58.369155 70.487505 24.833117
g 46.232889 19.365147 9.872537 98.246438
a b c d
g 46.232889 19.365147 9.872537 98.246438
h 53.041921 93.834097 13.423132 82.702020
s 0.003814 75.721426 73.086606 20.597472
x 32.678307 58.369155 70.487505 24.833117
数值计算和统计基础
常用数学、统计方法
基本参数:axis(默认0,默认列)、skipna(默认True,默认逃NaN)
import numpy as np
import pandas as pd
df = pd.DataFrame({\'key1\':[4,5,3,np.nan,2],
\'key2\':[1,2,np.nan,4,5],
\'key3\':[1,2,3,\'j\',\'k\']},
index = [\'a\',\'b\',\'c\',\'d\',\'e\'])
print(df)
print(df[\'key1\'].dtype,df[\'key2\'].dtype,df[\'key3\'].dtype)
print(\'-----\')
m1 = df.mean()
print(m1,type(m1))
print(\'单独统计一列:\',df[\'key2\'].mean())
print(\'-----\')
# np.nan :空值
# .mean()计算均值
# 只统计数字列
# 可以通过索引单独统计一列
m2 = df.mean(axis=1)
print(m2)
print(\'-----\')
# axis参数:默认为0,以列来计算,axis=1,以行来计算,这里就按照行来汇总了
m3 = df.mean(skipna=False)
print(m3)
print(\'-----\')
# skipna参数:是否忽略NaN,默认True,如False,有NaN的列统计结果仍未NaN
key1 key2 key3
a 4.0 1.0 1
b 5.0 2.0 2
c 3.0 NaN 3
d NaN 4.0 j
e 2.0 5.0 k
float64 float64 object
-----
key1 3.5
key2 3.0
dtype: float64
单独统计一列: 3.0
-----
a 2.5
b 3.5
c 3.0
d 4.0
e 3.5
dtype: float64
-----
key1 NaN
key2 NaN
dtype: float64
-----
主要数学计算方法,可用于Series和DataFrame(1)
df = pd.DataFrame({\'key1\':np.arange(10),
\'key2\':np.random.rand(10)*10})
print(df)
print(\'-----\')
print(df.count(),\'→ count统计非Na值的数量\\n\')
print(df.min(),\'→ min统计最小值\\n\',df[\'key2\'].max(),\'→ max统计最大值\\n\')
print(df.quantile(q=0.75),\'→ quantile统计分位数,参数q确定位置\\n\')
print(df.sum(),\'→ sum求和\\n\')
print(df.mean(),\'→ mean求平均值\\n\')
print(df.median(),\'→ median求算数中位数,50%分位数\\n\')
print(df.std(),\'\\n\',df.var(),\'→ std,var分别求标准差,方差\\n\')
print(df.skew(),\'→ skew样本的偏度\\n\')
print(df.kurt(),\'→ kurt样本的峰度\\n\')
key1 key2
0 0 6.792638
1 1 1.049023
2 2 5.441224
3 3 4.667631
4 4 2.053692
5 5 9.813006
6 6 5.074884
7 7 1.526651
8 8 8.519215
9 9 3.543486
-----
key1 10
key2 10
dtype: int64 → count统计非Na值的数量
key1 0.000000
key2 1.049023
dtype: float64 → min统计最小值
9.81300585173231 → max统计最大值
key1 6.750000
key2 6.454785
Name: 0.75, dtype: float64 → quantile统计分位数,参数q确定位置
key1 45.00000
key2 48.48145
dtype: float64 → sum求和
key1 4.500000
key2 4.848145
dtype: float64 → mean求平均值
key1 4.500000
key2 4.871257
dtype: float64 → median求算数中位数,50%分位数
key1 3.027650
key2 2.931062
dtype: float64
key1 9.166667
key2 8.591127
dtype: float64 → std,var分别求标准差,方差
key1 0.000000
key2 0.352466
dtype: float64 → skew样本的偏度
key1 -1.20000
key2 -0.79798
dtype: float64 → kurt样本的峰度
主要数学计算方法,可用于Series和DataFrame(2)
df[\'key1_s\'] = df[\'key1\'].cumsum()
df[\'key2_s\'] = df[\'key2\'].cumsum()
print(df,\'→ cumsum样本的累计和\\n\')
df[\'key1_p\'] = df[\'key1\'].cumprod()
df[\'key2_p\'] = df[\'key2\'].cumprod()
print(df,\'→ cumprod样本的累计积\\n\')
print(df.cummax(),\'\\n\',df.cummin(),\'→ cummax,cummin分别求累计最大值,累计最小值\\n\')
# 会填充key1,和key2的值
key1 key2 key1_s key2_s
0 0 6.792638 0 6.792638
1 1 1.049023 1 7.841661
2 2 5.441224 3 13.282885
3 3 4.667631 6 17.950515
4 4 2.053692 10 20.004208
5 5 9.813006 15 29.817213
6 6 5.074884 21 34.892097
7 7 1.526651 28 36.418749
8 8 8.519215 36 44.937963
9 9 3.543486 45 48.481450 → cumsum样本的累计和
key1 key2 key1_s key2_s key1_p key2_p
0 0 6.792638 0 6.792638 0 6.792638
1 1 1.049023 1 7.841661 0 7.125633
2 2 5.441224 3 13.282885 0 38.772160
3 3 4.667631 6 17.950515 0 180.974131
4 4 2.053692 10 20.004208 0 371.665151
5 5 9.813006 15 29.817213 0 3647.152301
6 6 5.074884 21 34.892097 0 18508.874743
7 7 1.526651 28 36.418749 0 28256.595196
8 8 8.519215 36 44.937963 0 240724.006055
9 9 3.543486 45 48.481450 0 853002.188425 → cumprod样本的累计积
key1 key2 key1_s key2_s key1_p key2_p
0 0.0 6.792638 0.0 6.792638 0.0 6.792638
1 1.0 6.792638 1.0 7.841661 0.0 7.125633
2 2.0 6.792638 3.0 13.282885 0.0 38.772160
3 3.0 6.792638 6.0 17.950515 0.0 180.974131
4 4.0 6.792638 10.0 20.004208 0.0 371.665151
5 5.0 9.813006 15.0 29.817213 0.0 3647.152301
6 6.0 9.813006 21.0 34.892097 0.0 18508.874743
7 7.0 9.813006 28.0 36.418749 0.0 28256.595196
8 8.0 9.813006 36.0 44.937963 0.0 240724.006055
9 9.0 9.813006 45.0 48.481450 0.0 853002.188425
key1 key2 key1_s key2_s key1_p key2_p
0 0.0 6.792638 0.0 6.792638 0.0 6.792638
1 0.0 1.049023 0.0 6.792638 0.0 6.792638
2 0.0 1.049023 0.0 6.792638 0.0 6.792638
3 0.0 1.049023 0.0 6.792638 0.0 6.792638
4 0.0 1.049023 0.0 6.792638 0.0 6.792638
5 0.0 1.049023 0.0 6.792638 0.0 6.792638
6 0.0 1.049023 0.0 6.792638 0.0 6.792638
7 0.0 1.049023 0.0 6.792638 0.0 6.792638
8 0.0 1.049023 0.0 6.792638 0.0 6.792638
9 0.0 1.049023 0.0 6.792638 0.0 6.792638 → cummax,cummin分别求累计最大值,累计最小值
唯一值:.unique()
s = pd.Series(list(\'asdvasdcfgg\'))
sq = s.unique()
print(s)
print(sq,type(sq))
print(pd.Series(sq))
# 得到一个唯一值数组
# 通过pd.Series重新变成新的Series
sq.sort()
print(sq)
# 重新排序
0 a
1 s
2 d
3 v
4 a
5 s
6 d
7 c
8 f
9 g
10 g
dtype: object
[\'a\' \'s\' \'d\' \'v\' \'c\' \'f\' \'g\']
0 a
1 s
2 d
3 v
4 c
5 f
6 g
dtype: object
[\'a\' \'c\' \'d\' \'f\' \'g\' \'s\' \'v\']
值计数:.value_counts()
sc = s.value_counts(sort = False) # 也可以这样写:pd.value_counts(sc, sort = False)
print(sc)
# 得到一个新的Series,计算出不同值出现的频率
# sort参数:排序,默认为True
d 2
a 2
s 2
c 1
f 1
g 2
v 1
dtype: int64
成员资格:.isin()
s = pd.Series(np.arange(10,15))
df = pd.DataFrame({\'key1\':list(\'asdcbvasd\'),
\'key2\':np.arange(4,13)})
print(s)
print(df)
print(\'-----\')
print(s.isin([5,14]))
print(df.isin([\'a\',\'bc\',\'10\',8]))
# 用[]表示
# 得到一个布尔值的Series或者Dataframe
0 10
1 11
2 12
3 13
4 14
dtype: int32
key1 key2
0 a 4
1 s 5
2 d 6
3 c 7
4 b 8
5 v 9
6 a 10
7 s 11
8 d 12
-----
0 False
1 False
2 False
3 False
4 True
dtype: bool
key1 key2
0 True False
1 False False
2 False False
3 False False
4 False True
5 False False
6 True False
7 False False
8 False False
文本数据(.str)
Pandas针对字符串配备的一套方法,使其易于对数组的每个元素进行操作
通过str访问,且自动排除丢失/ NA值
s = pd.Series([\'A\',\'b\',\'C\',\'bbhello\',\'123\',np.nan,\'hj\'])
df = pd.DataFrame({\'key1\':list(\'abcdef\'),
\'key2\':[\'hee\',\'fv\',\'w\',\'hija\',\'123\',np.nan]})
print(s)
print(df)
print(\'-----\')
print(s.str.count(\'b\'))
print(df[\'key2\'].str.upper())
print(\'-----\')
# 直接通过.str调用字符串方法
# 可以对Series、Dataframe使用
# 自动过滤NaN值
df.columns = df.columns.str.upper()
print(df)
# df.columns是一个Index对象,也可使用.str
0 A
1 b
2 C
3 bbhello
4 123
5 NaN
6 hj
dtype: object
key1 key2
0 a hee
1 b fv
2 c w
3 d hija
4 e 123
5 f NaN
-----
0 0.0
1 1.0
2 0.0
3 2.0
4 0.0
5 NaN
6 0.0
dtype: float64
0 HEE
1 FV
2 W
3 HIJA
4 123
5 NaN
Name: key2, dtype: object
-----
KEY1 KEY2
0 a hee
1 b fv
2 c w
3 d hija
4 e 123
5 f NaN
字符串常用方法(2) - strip(去空格)
s = pd.Series([\' jack\', \'jill \', \' jesse \', \'frank\'])
df = pd.DataFrame(np.random.randn(3, 2), columns=[\' Column A \', \' Column B \'],
index=range(3))
print(s)
print(df)
print(\'-----\')
print(s.str.strip()) # 去除字符串中的空格
print(s.str.lstrip()) # 去除字符串中的左空格
print(s.str.rstrip()) # 去除字符串中的右空格
df.columns = df.columns.str.strip()
print(df)
# 这里去掉了columns的前后空格,但没有去掉中间空格
0 jack
1 jill
2 jesse
3 frank
dtype: object
Column A Column B
0 -1.110964 -0.607590
1 2.043887 0.713886
2 0.840672 -0.854777
-----
0 jack
1 jill
2 jesse
3 frank
dtype: object
0 jack
1 jill
2 jesse
3 frank
dtype: object
0 jack
1 jill
2 jesse
3 frank
dtype: object
Column A Column B
0 -1.110964 -0.607590
1 2.043887 0.713886
2 0.840672 -0.854777
字符串常用方法(3) - replace(替换)
df = pd.DataFrame(np.random.randn(3, 2), columns=[\' Column A \', \' Column B \'],
index=range(3))
df.columns = df.columns.str.replace(\' \',\'-\')
print(df)
# 替换
df.columns = df.columns.str.replace(\'-\',\'hehe\',n=1)
print(df)
# n:替换个数
-Column-A- -Column-B-
0 -0.148261 -1.821841
1 0.920267 -2.327533
2 0.083585 -0.613041
heheColumn-A- heheColumn-B-
0 -0.148261 -1.821841
1 0.920267 -2.327533
2 0.083585 -0.613041
字符串常用方法(4) - split、rsplit(分裂,后者从右往左)
s = pd.Series([\'a,b,c\',\'1,2,3\',[\'a,,,c\'],np.nan])
print(s.str.split(\',\'))
print(\'-----\')
# 类似字符串的split
print(s.str.split(\',\')[0])
print(\'-----\')
# 直接索引得到一个list
print(s.str.split(\',\').str[0])#只要第一个字符
print(s.str.split(\',\').str.get(1))
print(\'-----\')
# 可以使用get或[]符号访问拆分列表中的元素
print(s.str.split(\',\', expand=True))
print(s.str.split(\',\', expand=True, n = 1))
print(s.str.rsplit(\',\', expand=True, n = 1))
print(\'-----\')
# 可以使用expand可以轻松扩展此操作以返回DataFrame
# n参数限制分割数
# rsplit类似于split,反向工作,即从字符串的末尾到字符串的开头
df = pd.DataFrame({\'key1\':[\'a,b,c\',\'1,2,3\',[\':,., \']],
\'key2\':[\'a-b-c\',\'1-2-3\',[\':-.- \']]})
print(\'-----\')
print(df)
print(df[\'key2\'].str.split(\'-\'))
# Dataframe使用split
0 [a, b, c]
1 [1, 2, 3]
2 NaN
3 NaN
dtype: object
-----
[\'a\', \'b\', \'c\']
-----
0 a
1 1
2 NaN
3 NaN
dtype: object
0 b
1 2
2 NaN
3 NaN
dtype: object
-----
0 1 2
0 a b c
1 1 2 3
2 NaN NaN NaN
3 NaN NaN NaN
0 1
0 a b,c
1 1 2,3
2 NaN NaN
3 NaN NaN
0 1
0 a,b c
1 1,2 3
2 NaN NaN
3 NaN NaN
-----
-----
key1 key2
0 a,b,c a-b-c
1 1,2,3 1-2-3
2 [:,., ] [:-.- ]
0 [a, b, c]
1 [1, 2, 3]
2 NaN
Name: key2, dtype: object
字符串索引
s = pd.Series([\'A\',\'b\',\'C\',\'bbhello\',\'123\',np.nan,\'hj\'])
df = pd.DataFrame({\'key1\':list(\'abcdef\'),
\'key2\':[\'hee\',\'fv\',\'w\',\'hija\',\'123\',np.nan]})
print(\"-----\")
print(s.str[0]) # 取第一个字符串
print(\"-----\")
print(s.str[:2]) # 取前两个字符串
print(df[\'key2\'].str[0])
# str之后和字符串本身索引方式相同
-----
0 A
1 b
2 C
3 b
4 1
5 NaN
6 h
dtype: object
-----
0 A
1 b
2 C
3 bb
4 12
5 NaN
6 hj
dtype: object
0 h
1 f
2 w
3 h
4 1
5 NaN
Name: key2, dtype: object
合并、连接、去重、替换
Pandas具有全功能的,高性能内存中连接操作,与SQL等关系数据库非常相似
pd.merge(left, right, how=‘inner’, on=None, left_on=None, right_on=None,left_index=False, right_index=False, sort=True,suffixes=(‘x’, ‘y’), copy=True, indicator=False)
pd.concat(objs, axis=0, join=‘outer’, join_axes=None, ignore_index=False,keys=None, levels=None, names=None, verify_integrity=False,copy=True)
.replace()
.duplicated()
merge合并 → 类似excel的vlookup
df1 = pd.DataFrame({\'key\': [\'K0\', \'K1\', \'K2\', \'K3\'],
\'A\': [\'A0\', \'A1\', \'A2\', \'A3\'],
\'B\': [\'B0\', \'B1\', \'B2\', \'B3\']})
df2 = pd.DataFrame({\'key\': [\'K0\', \'K1\', \'K2\', \'K3\'],
\'C\': [\'C0\', \'C1\', \'C2\', \'C3\'],
\'D\': [\'D0\', \'D1\', \'D2\', \'D3\']})
df3 = pd.DataFrame({\'key1\': [\'K0\', \'K0\', \'K1\', \'K2\'],
\'key2\': [\'K0\', \'K1\', \'K0\', \'K1\'],
\'A\': [\'A0\', \'A1\', \'A2\', \'A3\'],
\'B\': [\'B0\', \'B1\', \'B2\', \'B3\']})
df4 = pd.DataFrame({\'key1\': [\'K0\', \'K1\', \'K1\', \'K2\'],
\'key2\': [\'K0\', \'K0\', \'K0\', \'K0\'],
\'C\': [\'C0\', \'C1\', \'C2\', \'C3\'],
\'D\': [\'D0\', \'D1\', \'D2\', \'D3\']})
print(pd.merge(df1, df2, on=\'key\'))
print(\'------\')
# left:第一个df
# right:第二个df
# on:参考键
print(pd.merge(df3, df4, on=[\'key1\',\'key2\']))
# 多个链接键
A B key C D
0 A0 B0 K0 C0 D0
1 A1 B1 K1 C1 D1
2 A2 B2 K2 C2 D2
3 A3 B3 K3 C3 D3
------
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A2 B2 K1 K0 C1 D1
2 A2 B2 K1 K0 C2 D2
参数how → 合并方式
print(pd.merge(df3, df4,on=[\'key1\',\'key2\'], how = \'inner\'))
print(\'------\')
# inner:默认,取交集
print(pd.merge(df3, df4, on=[\'key1\',\'key2\'], how = \'outer\'))
print(\'------\')
# outer:取并集,数据缺失范围NaN
print(pd.merge(df3, df4, on=[\'key1\',\'key2\'], how = \'left\'))
print(\'------\')
# left:按照df3为参考合并,数据缺失范围NaN
print(pd.merge(df3, df4, on=[\'key1\',\'key2\'], how = \'right\'))
# right:按照df4为参考合并,数据缺失范围NaN
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A2 B2 K1 K0 C1 D1
2 A2 B2 K1 K0 C2 D2
------
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A1 B1 K0 K1 NaN NaN
2 A2 B2 K1 K0 C1 D1
3 A2 B2 K1 K0 C2 D2
4 A3 B3 K2 K1 NaN NaN
5 NaN NaN K2 K0 C3 D3
------
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A1 B1 K0 K1 NaN NaN
2 A2 B2 K1 K0 C1 D1
3 A2 B2 K1 K0 C2 D2
4 A3 B3 K2 K1 NaN NaN
------
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A2 B2 K1 K0 C1 D1
2 A2 B2 K1 K0 C2 D2
3 NaN NaN K2 K0 C3 D3
参数 left_on, right_on, left_index, right_index → 当键不为一个列时,可以单独设置左键与右键
df1 = pd.DataFrame({\'lkey\':list(\'bbacaab\'),
\'data1\':range(7)})
df2 = pd.DataFrame({\'rkey\':list(\'abd\'),
\'date2\':range(3)})
print(pd.merge(df1, df2, left_on=\'lkey\', right_on=\'rkey\'))
print(\'------\')
# df1以‘lkey’为键,df2以‘rkey’为键
df1 = pd.DataFrame({\'key\':list(\'abcdfeg\'),
\'data1\':range(7)})
df2 = pd.DataFrame({\'date2\':range(100,105)},
index = list(\'abcde\'))
print(pd.merge(df1, df2, left_on=\'key\', right_index=True))
# df1以‘key’为键,df2以index为键
# left_index:为True时,第一个df以index为键,默认False
# right_index:为True时,第二个df以index为键,默认False
# 所以left_on, right_on, left_index, right_index可以相互组合:
# left_on + right_on, left_on + right_index, left_index + right_on, left_index + right_index
lkey data1 rkey date2
0 b 0 b 1
1 b 1 b 1
2 b 6 b 1
3 a 2 a 0
4 a 4 a 0
5 a 5 a 0
------
key data1 date2
0 a 0 100
1 b 1 101
2 c 2 102
3 d 3 103
5 e 5 104
连接:concat
s1 = pd.Series([1,2,3])
s2 = pd.Series([2,3,4])
print(pd.concat([s1,s2]))
print(\'-----\')
# 默认axis=0,行+行 上下相连
s3 = pd.Series([1,2,3],index = [\'a\',\'c\',\'h\'])
s4 = pd.Series([2,3,4],index = [\'b\',\'e\',\'d\'])
print(pd.concat([s3,s4]).sort_index())
print(pd.concat([s3,s4], axis=1))
print(\'-----\')
# axis=1,列+列,成为一个Dataframe 左右相连
0 1
1 2
2 3
0 2
1 3
2 4
dtype: int64
-----
a 1
b 2
c 2
d 4
e 3
h 3
dtype: int64
0 1
a 1.0 NaN
b NaN 2.0
c 2.0 NaN
d NaN 4.0
e NaN 3.0
h 3.0 NaN
-----
去重 .duplicated
s = pd.Series([1,1,1,1,2,2,2,3,4,5,5,5,5])
print(s.duplicated())
print(s[s.duplicated() == False])
print(\'-----\')
# 判断是否重复
# 通过布尔判断,得到不重复的值
s_re = s.drop_duplicates()
print(s_re)
print(\'-----\')
# drop.duplicates移除重复
# inplace参数:是否替换原值,默认False
df = pd.DataFrame({\'key1\':[\'a\',\'a\',3,4,5],
\'key2\':[\'a\',\'a\',\'b\',\'b\',\'c\']})
print(df.duplicated())
print(df[\'key2\'].duplicated())
# Dataframe中使用duplicated
0 False
1 True
2 True
3 True
4 False
5 True
6 True
7 False
8 False
9 False
10 True
11 True
12 True
dtype: bool
0 1
4 2
7 3
8 4
9 5
dtype: int64
-----
0 1
4 2
7 3
8 4
9 5
dtype: int64
-----
0 False
1 True
2 False
3 False
4 False
dtype: bool
0 False
1 True
2 False
3 True
4 False
Name: key2, dtype: bool
替换 .replace
s = pd.Series(list(\'ascaazsd\'))
print(s.replace(\'a\', np.nan))
print(s.replace([\'a\',\'s\'] ,np.nan))
print(s.replace({\'a\':\'hello world!\',\'s\':123}))
# 可一次性替换一个值或多个值
# 可传入列表或字典
0 NaN
1 s
2 c
3 NaN
4 NaN
5 z
6 s
7 d
dtype: object
0 NaN
1 NaN
2 c
3 NaN
4 NaN
5 z
6 NaN
7 d
dtype: object
0 hello world!
1 123
2 c
3 hello world!
4 hello world!
5 z
6 123
7 d
dtype: object
数据分组
分组统计 - groupby功能
Dataframe在行(axis=0)或列(axis=1)上进行分组,将一个函数应用到各个分组并产生一个新值,然后函数执行结果被合并到最终的结果对象中。
df.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)
分组
df = pd.DataFrame({\'A\' : [\'foo\', \'bar\', \'foo\', \'bar\',\'foo\', \'bar\', \'foo\', \'foo\'],
\'B\' : [\'one\', \'one\', \'two\', \'three\', \'two\', \'two\', \'one\', \'three\'],
\'C\' : np.random.randn(8),
\'D\' : np.random.randn(8)})
print(df)
print(\'------\')
print(df.groupby(\'A\'), type(df.groupby(\'A\')))
print(\'------\')
# 直接分组得到一个groupby对象,是一个中间数据,没有进行计算
a = df.groupby(\'A\').mean()
b = df.groupby([\'A\',\'B\']).mean()
c = df.groupby([\'A\'])[\'D\'].mean() # 以A分组,算D的平均值
print(a,type(a),\'\\n\',a.columns)
print(b,type(b),\'\\n\',b.columns)
print(c,type(c))
# 通过分组后的计算,得到一个新的dataframe
# 默认axis = 0,以行来分组
# 可单个或多个([])列分组
A B C D
0 foo one 1.135465 0.129953
1 bar one 0.369776 -0.764419
2 foo two 0.127738 0.094981
3 bar three -1.089670 0.840816
4 foo two 0.093698 -0.559945
5 bar two -0.616583 -0.363614
6 foo one 0.670724 0.301891
7 foo three 1.034713 -0.132406
------
------
C D
A
bar -0.445493 -0.095739
foo 0.612468 -0.033105
Index([\'C\', \'D\'], dtype=\'object\')
C D
A B
bar one 0.369776 -0.764419
three -1.089670 0.840816
two -0.616583 -0.363614
foo one 0.903094 0.215922
three 1.034713 -0.132406
two 0.110718 -0.232482
Index([\'C\', \'D\'], dtype=\'object\')
A
bar -0.095739
foo -0.033105
Name: D, dtype: float64
分组 - 可迭代对象
df = pd.DataFrame({\'X\' : [\'A\', \'B\', \'A\', \'B\'], \'Y\' : [1, 4, 3, 2]})
print(df)
print(df.groupby(\'X\'), type(df.groupby(\'X\')))
print(\'-----\')
print(list(df.groupby(\'X\')), \'→ 可迭代对象,直接生成list\\n\')
print(list(df.groupby(\'X\'))[0], \'→ 以元祖形式显示\\n\')
for n,g in df.groupby(\'X\'):
print(n)
print(g)
print(\'###\')
print(\'-----\')
# n是组名,g是分组后的Dataframe
print(df.groupby([\'X\']).get_group(\'A\'),\'\\n\')
print(df.groupby([\'X\']).get_group(\'B\'),\'\\n\')
print(\'-----\')
# .get_group()提取分组后的组
grouped = df.groupby([\'X\'])
print(grouped.groups)
print(grouped.groups[\'A\']) # 也可写:df.groupby(\'X\').groups[\'A\']
print(\'-----\')
# .groups:将分组后的groups转为dict
# 可以字典索引方法来查看groups里的元素
sz = grouped.size()
print(sz,type(sz))
print(\'-----\')
# .size():查看分组后的长度
df = pd.DataFrame({\'A\' : [\'foo\', \'bar\', \'foo\', \'bar\',\'foo\', \'bar\', \'foo\', \'foo\'],
\'B\' : [\'one\', \'one\', \'two\', \'three\', \'two\', \'two\', \'one\', \'three\'],
\'C\' : np.random.randn(8),
\'D\' : np.random.randn(8)})
grouped = df.groupby([\'A\',\'B\']).groups
print(df)
print(grouped)
print(grouped[(\'foo\', \'three\')])
# 按照两个列进行分组
X Y
0 A 1
1 B 4
2 A 3
3 B 2
-----
[(\'A\', X Y
0 A 1
2 A 3), (\'B\', X Y
1 B 4
3 B 2)] → 可迭代对象,直接生成list
(\'A\', X Y
0 A 1
2 A 3) → 以元祖形式显示
A
X Y
0 A 1
2 A 3
###
B
X Y
1 B 4
3 B 2
###
-----
X Y
0 A 1
2 A 3
X Y
1 B 4
3 B 2
-----
{\'A\': Int64Index([0, 2], dtype=\'int64\'), \'B\': Int64Index([1, 3], dtype=\'int64\')}
Int64Index([0, 2], dtype=\'int64\')
-----
X
A 2
B 2
dtype: int64
-----
A B C D
0 foo one -0.304245 0.660739
1 bar one 1.264029 -1.135984
2 foo two 0.305469 0.865855
3 bar three 0.003346 -0.804999
4 foo two 0.584652 0.289705
5 bar two 0.903023 -1.586324
6 foo one 0.260300 0.994691
7 foo three 0.293223 0.864264
{(\'bar\', \'one\'): Int64Index([1], dtype=\'int64\'), (\'bar\', \'three\'): Int64Index([3], dtype=\'int64\'), (\'bar\', \'two\'): Int64Index([5], dtype=\'int64\'), (\'foo\', \'one\'): Int64Index([0, 6], dtype=\'int64\'), (\'foo\', \'three\'): Int64Index([7], dtype=\'int64\'), (\'foo\', \'two\'): Int64Index([2, 4], dtype=\'int64\')}
Int64Index([7], dtype=\'int64\')
分组计算函数方法
s = pd.Series([1, 2, 3, 10, 20, 30], index = [1, 2, 3, 1, 2, 3])
grouped = s.groupby(level=0) # 唯一索引用.groupby(level=0),将同一个index的分为一组
print(grouped)
print(grouped.first(),\'→ first:非NaN的第一个值\\n\')
print(grouped.last(),\'→ last:非NaN的最后一个值\\n\')
print(grouped.sum(),\'→ sum:非NaN的和\\n\')
print(grouped.mean(),\'→ mean:非NaN的平均值\\n\')
print(grouped.median(),\'→ median:非NaN的算术中位数\\n\')
print(grouped.count(),\'→ count:非NaN的值\\n\')
print(grouped.min(),\'→ min、max:非NaN的最小值、最大值\\n\')
print(grouped.std(),\'→ std,var:非NaN的标准差和方差\\n\')
print(grouped.prod(),\'→ prod:非NaN的积\\n\')
1 1
2 2
3 3
dtype: int64 → first:非NaN的第一个值
1 10
2 20
3 30
dtype: int64 → last:非NaN的最后一个值
1 11
2 22
3 33
dtype: int64 → sum:非NaN的和
1 5.5
2 11.0
3 16.5
dtype: float64 → mean:非NaN的平均值
1 5.5
2 11.0
3 16.5
dtype: float64 → median:非NaN的算术中位数
1 2
2 2
3 2
dtype: int64 → count:非NaN的值
1 1
2 2
3 3
dtype: int64 → min、max:非NaN的最小值、最大值
1 6.363961
2 12.727922
3 19.091883
dtype: float64 → std,var:非NaN的标准差和方差
1 10
2 40
3 90
dtype: int64 → prod:非NaN的积
多函数计算:agg()
df = pd.DataFrame({\'a\':[1,1,2,2],
\'b\':np.random.rand(4),
\'c\':np.random.rand(4),
\'d\':np.random.rand(4),})
print(df)
print(df.groupby(\'a\').agg([\'mean\',np.sum]))
print(df.groupby(\'a\')[\'b\'].agg({\'result1\':np.mean,
\'result2\':np.sum}))
# 函数写法可以用str,或者np.方法
# 可以通过list,dict传入,当用dict时,key名为columns → 更新pandas后会出现警告
# 尽量用list传入
a b c d
0 1 0.758848 0.375900 0.962917
1 1 0.430484 0.322437 0.402809
2 2 0.285699 0.230663 0.525483
3 2 0.676740 0.191693 0.874899
b c d
mean sum mean sum mean sum
a
1 0.594666 1.189331 0.349169 0.698337 0.682863 1.365727
2 0.481219 0.962438 0.211178 0.422356 0.700191 1.400382
result1 result2
a
1 0.594666 1.189331
2 0.481219 0.962438
数据读取
读取普通分隔数据:read_table
# 可以读取txt,csv
import os
os.chdir(\'C:/Users/数据分析库pandas\')
data1 = pd.read_table(\'data1.txt\', delimiter=\',\',header = 0, index_col=2)
print(data1)
# delimiter:用于拆分的字符,也可以用sep:sep = \',\'
# header:用做列名的序号,默认为0(第一行)
# index_col:指定某列为行索引,否则自动索引0, 1, .....
# read_table主要用于读取简单的数据,txt/csv
va1 va2 va4
va3
3 1 2 4
4 2 3 5
5 3 4 6
6 4 5 7
读取csv数据:read_csv
# 先熟悉一下excel怎么导出csv
data2 = pd.read_csv(\'data2.csv\',encoding = \'utf-8\')
print(data2.head())
# encoding:指定字符集类型,即编码,通常指定为\'utf-8\'
# 大多数情况先将excel导出csv,再读取
省级政区代码 省级政区名称 地市级政区代码 地市级政区名称 年份 党委书记姓名 出生年份 出生月份 籍贯省份代码 籍贯省份名称 \\
0 130000 河北省 130100 石家庄市 2000 陈来立 NaN NaN NaN NaN
1 130000 河北省 130100 石家庄市 2001 吴振华 NaN NaN NaN NaN
2 130000 河北省 130100 石家庄市 2002 吴振华 NaN NaN NaN NaN
3 130000 河北省 130100 石家庄市 2003 吴振华 NaN NaN NaN NaN
4 130000 河北省 130100 石家庄市 2004 吴振华 NaN NaN NaN NaN
... 民族 教育 是否是党校教育(是=1,否=0) 专业:人文 专业:社科 专业:理工 专业:农科 专业:医科 入党年份 工作年份
0 ... NaN 硕士 1.0 NaN NaN NaN NaN NaN NaN NaN
1 ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0 NaN NaN
2 ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0 NaN NaN
3 ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0 NaN NaN
4 ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0 NaN NaN
[5 rows x 23 columns]
读取excel数据:read_excel
data3 = pd.read_excel(\'地市级党委书记数据库(2000-10).xlsx\',sheet_name=\'中国人民共和国地市级党委书记数据库(2000-10)\',header=0)
print(data3.head())
# io :文件路径。
# sheetname:返回多表使用sheetname=[0,1],若sheetname=None是返回全表 → ① int/string 返回的是dataframe ②而none和list返回的是dict
# header:指定列名行,默认0,即取第一行
# index_col:指定列为索引列,也可以使用u”strings”
省级政区代码 省级政区名称 地市级政区代码 地市级政区名称 年份 党委书记姓名 出生年份 出生月份 籍贯省份代码 籍贯省份名称 \\
0 130000 河北省 130100 石家庄市 2000 陈来立 NaN NaN NaN NaN
1 130000 河北省 130100 石家庄市 2001 吴振华 NaN NaN NaN NaN
2 130000 河北省 130100 石家庄市 2002 吴振华 NaN NaN NaN NaN
3 130000 河北省 130100 石家庄市 2003 吴振华 NaN NaN NaN NaN
4 130000 河北省 130100 石家庄市 2004 吴振华 NaN NaN NaN NaN
... 民族 教育 是否是党校教育(是=1,否=0) 专业:人文 专业:社科 专业:理工 专业:农科 专业:医科 入党年份 工作年份
0 ... NaN 硕士 1.0 NaN NaN NaN NaN NaN NaN NaN
1 ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0 NaN NaN
2 ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0 NaN NaN
3 ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0 NaN NaN
4 ... NaN 本科 0.0 0.0 0.0 1.0 0.0 0.0 NaN NaN
[5 rows x 23 columns]