pandas 是基于numpy构建的库,加上numpy,主要用于科学运算和数据处理。
也是一个让我忘记昂贵的MATLAB,并且不得不复习SQL的库..
一般引入规定:
In [105]: from pandas import Series,DataFrame
In [106]: import pandas as pd
In [107]: import numpy as np
Series
类似一维数组,有一组数据和一组与之相关的索引组成。
In [68]: o2 = Series([4,-7,35,99])
In [69]: o2
Out[69]:
0 4
1 -7
2 35
3 99
dtype: int64
In [70]: o2 = Series([4,-7,35,99],index=['a','b','c','d'])
In [71]: o2
Out[71]:
a 4
b -7
c 35
d 99
dtype: int64
DataFrame
表格型数据结构,可以看成是一系列Series组成的字典(共用同一个索引)。
In [21]: frame = DataFrame(np.arange(9).reshape(3,3),index=['a','b','c'],column
...: s=['Ohio','Texas','Califor'])
In [22]: frame
Out[22]:
Ohio Texas Califor
a 0 1 2
b 3 4 5
c 6 7 8
- 在算数方法中填充值
1.1. 两个长度不同的数组,直接相加,不存在/不对应的值会广播NaN
1.2.NaN
可以用fill_value
填充值
In [31]: df2 = DataFrame(np.arange(20.).reshape(4,5),columns=list('abcde'))
In [32]: df1 = DataFrame(np.arange(12.).reshape(3,4),columns=list('abcd'))
In [33]: df1
Out[33]:
a b c d
0 0.0 1.0 2.0 3.0
1 4.0 5.0 6.0 7.0
2 8.0 9.0 10.0 11.0
In [34]: df2
Out[34]:
a b c d e
0 0.0 1.0 2.0 3.0 4.0
1 5.0 6.0 7.0 8.0 9.0
2 10.0 11.0 12.0 13.0 14.0
3 15.0 16.0 17.0 18.0 19.0
In [35]: df1 + df2
Out[35]:
a b c d e
0 0.0 2.0 4.0 6.0 NaN
1 9.0 11.0 13.0 15.0 NaN
2 18.0 20.0 22.0 24.0 NaN
3 NaN NaN NaN NaN NaN
In [36]:
In [36]: df1.add(df2,fill_value=0)
Out[36]:
a b c d e
0 0.0 2.0 4.0 6.0 4.0
1 9.0 11.0 13.0 15.0 9.0
2 18.0 20.0 22.0 24.0 14.0
3 15.0 16.0 17.0 18.0 19.0
- DataFrame和Series之间的运算--广播
2.1. 一般是沿行做广播运算
2.2. 沿列做广播运算需要运用算术方法
In [41]: arr = np.arange(12.).reshape(3,4)
In [42]: arr
Out[42]:
array([[ 0., 1., 2., 3.],
[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.]])
In [43]: arr - arr[0]
Out[43]:
array([[ 0., 0., 0., 0.],
[ 4., 4., 4., 4.],
[ 8., 8., 8., 8.]])
- 函数的映射和应用
一般是使用lambda和写函数式
#lambda
In [56]: frame
Out[56]:
b d e
Utah 0.073770 -0.264937 1.085603
Ohio 1.274547 0.820050 0.056422
Texas 1.346414 1.786314 -0.311222
Oregon 0.571323 -0.731404 0.502011
In [57]: f = lambda x : x.max() - x.min()
In [58]: frame.apply(f)
Out[58]:
b 1.272643
d 2.517719
e 1.396825
dtype: float64
In [59]: frame.apply(f,axis=1)
Out[59]:
Utah 1.350540
Ohio 1.218125
Texas 2.097536
Oregon 1.302727
dtype: float64
#f(x)
In [60]: def f(x):
...: return Series([x.min(),x.max()],index=['min','max'])
In [61]: frame.apply(f)
Out[61]:
b d e
min 0.073770 -0.731404 -0.311222
max 1.346414 1.786314 1.085603
- 汇总和计算描述统计
In [70]: df
Out[70]:
0 1 2
a 1.037884 0.932937 0.480702
a -1.453084 -1.039968 0.306588
b 0.352103 0.083231 -0.264383
b 0.628823 -0.454043 -0.993764
In [71]: df.describe()
Out[71]:
0 1 2
count 4.000000 4.000000 4.000000
mean 0.141432 -0.119461 -0.117714
std 1.099703 0.838233 0.665109
min -1.453084 -1.039968 -0.993764
25% -0.099194 -0.600524 -0.446728
50% 0.490463 -0.185406 0.021103
75% 0.731088 0.295658 0.350117
max 1.037884 0.932937 0.480702
- 处理缺失值
In [89]: df1
Out[89]:
0 1 2
0 1.700089 NaN NaN
1 0.209934 NaN NaN
2 -1.300037 NaN NaN
3 -0.044868 NaN 1.712725
4 0.624518 NaN -0.559871
5 -1.036317 1.075744 1.267794
6 -0.201066 0.268681 -0.356206
In [90]: df1.fillna(0)
Out[90]:
0 1 2
0 1.700089 0.000000 0.000000
1 0.209934 0.000000 0.000000
2 -1.300037 0.000000 0.000000
3 -0.044868 0.000000 1.712725
4 0.624518 0.000000 -0.559871
5 -1.036317 1.075744 1.267794
6 -0.201066 0.268681 -0.356206
In [91]: df1
Out[91]:
0 1 2
0 1.700089 NaN NaN
1 0.209934 NaN NaN
2 -1.300037 NaN NaN
3 -0.044868 NaN 1.712725
4 0.624518 NaN -0.559871
5 -1.036317 1.075744 1.267794
6 -0.201066 0.268681 -0.356206
In [92]: df1.fillna({1:0.5,2:33})
Out[92]:
0 1 2
0 1.700089 0.500000 33.000000
1 0.209934 0.500000 33.000000
2 -1.300037 0.500000 33.000000
3 -0.044868 0.500000 1.712725
4 0.624518 0.500000 -0.559871
5 -1.036317 1.075744 1.267794
6 -0.201066 0.268681 -0.356206
- 层次化索引/多层索引
6.1. 基础就是多层索引
In [100]: data = Series(np.random.rand(10),index=[['a','a','a','b','b','b','c',
...: 'c','d','d'],[1,2,3,1,2,3,1,2,2,3]])
In [101]: data
Out[101]:
a 1 0.676413
2 0.623518
3 0.414257
b 1 0.434586
2 0.905924
3 0.726079
c 1 0.693546
2 0.708168
d 2 0.667362
3 0.789808
dtype: float64
In [102]: data.index
Out[102]:
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2
]])
6.2. 通过unstack,可以将其从Series转化为DataFrame
In [114]: data.unstack()
Out[114]:
1 2 3
a 0.676413 0.623518 0.414257
b 0.434586 0.905924 0.726079
c 0.693546 0.708168 NaN
d NaN 0.667362 0.789808
6.3. unstack的逆运算是stack
In [115]: data.unstack().stack()
Out[115]:
a 1 0.676413
2 0.623518
3 0.414257
b 1 0.434586
2 0.905924
3 0.726079
c 1 0.693546
2 0.708168
d 2 0.667362
3 0.789808
dtype: float64
6.4. DataFrame每条轴都可以做多层索引
In [118]: frame =DataFrame(np.arange(12).reshape(4,3),
...: index = [['a','a','b','b'],[1,2,1,2]],
...: columns = [['city1','city1','city2'],['G','R','G']])
In [120]: frame
Out[120]:
city1 city2
G R G
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
In [121]: frame.index.names = ['key1','key2']
In [122]: frame.columns.names = ['citys','color']
In [123]: frame
Out[123]:
citys city1 city2
color G R G
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
In [124]:
- 把DataFrame的列当成索引使用
7.1. set_index , 把DataFrame的列当成索引使用, 可以选择是否保留原列
7.2. reset_index 将7.1.恢复原样
#7.1. set_index
In [134]: f
Out[134]:
a b c d
0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
3 3 4 two 0
4 4 3 two 1
5 5 2 two 2
6 6 1 two 3
In [135]: f.set_index(['c','d'])
Out[135]:
a b
c d
one 0 0 7
1 1 6
2 2 5
two 0 3 4
1 4 3
2 5 2
3 6 1
In [136]: f.set_index(['c','d'],drop=False)
Out[136]:
a b c d
c d
one 0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
two 0 3 4 two 0
1 4 3 two 1
2 5 2 two 2
3 6 1 two 3
# 7.2. reset_index example
In [137]: frame2= f.set_index(['c','d'])
In [139]: frame2
Out[139]:
a b
c d
one 0 0 7
1 1 6
2 2 5
two 0 3 4
1 4 3
2 5 2
3 6 1
In [140]: frame2.reset_index()
Out[140]:
c d a b
0 one 0 0 7
1 one 1 1 6
2 one 2 2 5
3 two 0 3 4
4 two 1 4 3
5 two 2 5 2
6 two 3 6 1
- 面板数据/三维版DataFrame
书里提到比较少用,一般可以降到二维。
我觉得这个pandas功能也很像excel VB语言,果然语言都是很相似的,原理是矩阵和逻辑,要用再查参考书。
话说,数据分析在排障也很好用啊,万万没想到
2018.7.20