Python pandas包总结

Last updated on Mar 6, 2020 9 min read Python

numpy和pandas是使用python进行数据分析的两个基本工具。numpy在线性代数中运用较多，而pandas则更多的用来分析表结构的数据。numpy与pandas都有一维和二维数据结构。

1 安装`pandas`

使用pip可以快速的安装pandas.

在终端进行操作.

pip install pandas

import pandas as pd

需要同时安装numpy包.

而已查看每个包的版本.

pd.__version__

## '0.24.2'

2 Pandas数据结构介绍

2.1 series

series是一维数组,类似于R中的向量.可以使用Series()函数构建.可以将列表或者字典转变为series.在字典中,键是没有顺序的,但是转变为series之后,就是有顺序的了.不仅可以按照名字来索引,也可以按照顺序来索引了.这样跟R中的向量更加类似了.

data = pd.Series([1,2,3,4])
data

## 0    1
## 1    2
## 2    3
## 3    4
## dtype: int64

population_dict = {'California': 38332521,
'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}
data3 = pd.Series(population_dict)
data3

## California    38332521
## Texas         26448193
## New York      19651127
## Florida       19552860
## Illinois      12882135
## dtype: int64

data3[0]

## 38332521

他是自带索引的.第一“列”是索引,第二列是value.分别都是他的属性.因此可以访问.

默认的索引是数字索引,从0开始,当然也可以自定义索引.

data2 = pd.Series([0,1,2,3], index=['a', 'b', 'c', 'd'])
data2

## a    0
## b    1
## c    2
## d    3
## dtype: int64

index = data.index
value = data.values
index

## RangeIndex(start=0, stop=4, step=1)

value

## array([1, 2, 3, 4], dtype=int64)

type(index)

## <class 'pandas.core.indexes.range.RangeIndex'>

type(value)

## <class 'numpy.ndarray'>

跟R中的向量一样,可以通过中括号,来索引,切片series.对于索引为文字的,还支持文字连续索引.

data[0]

## 1

data[1:3]

## 1    2
## 2    3
## dtype: int64

data2['a']

## 0

data3['California':'Illinois']

## California    38332521
## Texas         26448193
## New York      19651127
## Florida       19552860
## Illinois      12882135
## dtype: int64

2.2 DataFrame(数据框)

数据框就跟R中的数据框或者说matrix就更像了.是二维数据.

创建数据框.可以从字典转变而来.

area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

## California    423967
## Texas         695662
## New York      141297
## Florida       170312
## Illinois      149995
## dtype: int64

data3

## California    38332521
## Texas         26448193
## New York      19651127
## Florida       19552860
## Illinois      12882135
## dtype: int64

area

## California    423967
## Texas         695662
## New York      141297
## Florida       170312
## Illinois      149995
## dtype: int64

type(data3)

## <class 'pandas.core.series.Series'>

type(area)

## <class 'pandas.core.series.Series'>

states = pd.DataFrame({'population': data3,
'area': area})

states

##             population    area
## California    38332521  423967
## Texas         26448193  695662
## New York      19651127  141297
## Florida       19552860  170312
## Illinois      12882135  149995

这是一个两列的数据框,其中列名就是原来字典的键.

数据框也有index和value属性.

states.index

## Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

states.values

## array([[38332521,   423967],
##        [26448193,   695662],
##        [19651127,   141297],
##        [19552860,   170312],
##        [12882135,   149995]], dtype=int64)

states.columns

## Index(['population', 'area'], dtype='object')

columns属性可以看作是数据框的列名,而index属性可以看作是其行名.

2.2.1 数据框的索引和切片操作

如何对数据框进行索引和操作呢?

取出数据框的行.

直接用中括号然后用数字加冒号即可.和python其他部分一样,包括左半部分,不包括右半部分.取出行之后,还是一个数据框.不会存在降维.

states[0:1]

##             population    area
## California    38332521  423967

states[0:2]

##             population    area
## California    38332521  423967
## Texas         26448193  695662

type(states[0:1])

## <class 'pandas.core.frame.DataFrame'>

取出数据框的列.

states["area"]

## California    423967
## Texas         695662
## New York      141297
## Florida       170312
## Illinois      149995
## Name: area, dtype: int64

type(states["area"])

## <class 'pandas.core.series.Series'>

如果取出某一列,需要使用列名进行提取,并且提取之后,直接降维,变成了series.

跟为复杂的切片操作,需要使用数据框的属性,loc和iloc. loc和iloc的不同在于前者用列名和行名来进行切片和索引,而后者则使用index. loc后面需要跟着中括号,然后第一个参数用来描述.

如果只选择一行或者一列,或者是连续的切片(中间用冒号),是可以不写为列表格式,但是如果是多个话,需要用列表形式.

states.loc

## <pandas.core.indexing._LocIndexer object at 0x00000000521F5318>

states.loc["California", ["area"]]

## area    423967
## Name: California, dtype: int64

type(states.loc["California", ["area"]])

## <class 'pandas.core.series.Series'>

states.loc["California":"Texas", ["area"]]

##               area
## California  423967
## Texas       695662

type(states.loc["California":"Texas", ["area"]])

## <class 'pandas.core.frame.DataFrame'>

states.iloc

## <pandas.core.indexing._iLocIndexer object at 0x00000000533D84A8>

states.iloc[0,0]

## 38332521

type(states.iloc[0,0])

## <class 'numpy.int64'>

states.iloc[0:1,0:1]

##             population
## California    38332521

states.iloc[[0,2],[0,1]]

##             population    area
## California    38332521  423967
## New York      19651127  141297

type(states.iloc[[0,2],[0,1]])

## <class 'pandas.core.frame.DataFrame'>

2.3 索引(index)对象

series和数据框格式都有一个index对象,用来指示数据,对于数据框来说其实是行名.

ind = pd.Index([2, 3, 5, 7, 11])
ind

## Int64Index([2, 3, 5, 7, 11], dtype='int64')

type(ind)

## <class 'pandas.core.indexes.numeric.Int64Index'>

2.3.1 Index as immutable array

索引对象可以像一个array一样操作.比如索引和切片.

ind

## Int64Index([2, 3, 5, 7, 11], dtype='int64')

ind[0]

## 2

ind[:2]

## Int64Index([2, 3], dtype='int64')

index对象还有很多array的属性.

print(ind.size, ind.shape, ind.ndim, ind.dtype)

## 5 (5,) 1 int64

其中ndim可以用来看数据框的行数和列数.

但是index对象的值不能够改变.这一点跟array是不同的.

2.3.2 index作为一个有顺序的set

index对象很多时候跟python内置的set对象很类似.可以做并集,交集以及查找不同等操作.

indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])
indA & indB # 交集

## Int64Index([3, 5, 7], dtype='int64')

indA | indB # 并集

## Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

indA ^ indB # 两个集合中互不相同的部分

## Int64Index([1, 2, 9, 11], dtype='int64')

3 索引和选择

对pandas中的series和数据框数据做索引和选择.上面简单介绍一些,这里做详细的系统介绍.比如索引(indexing),切片(slcing),筛选(masking)等.

3.1 Series中的数据选择

series跟python内置的字典非常的类似.也跟Numpy中的array非常像.

3.1.1 series作为字典

可以使用键进行选择.

import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=['a', 'b', 'c', 'd'])

data["a"]

## 0.25

当然其他的一些操作也可以用于series.

'a' in data

## True

data.keys()

## Index(['a', 'b', 'c', 'd'], dtype='object')

list(data.items())

## [('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

data["e"] = 1.25
data

## a    0.25
## b    0.50
## c    0.75
## d    1.00
## e    1.25
## dtype: float64

3.1.2 series作为一维的array

可以按照array的办法对series进行切片.

data['a':'c']

## a    0.25
## b    0.50
## c    0.75
## dtype: float64

data[0:2]

## a    0.25
## b    0.50
## dtype: float64

data > 0.3

## a    False
## b     True
## c     True
## d     True
## e     True
## dtype: bool

data < 0.8

## a     True
## b     True
## c     True
## d    False
## e    False
## dtype: bool

data + 1

## a    1.25
## b    1.50
## c    1.75
## d    2.00
## e    2.25
## dtype: float64

data[(data > 0.3) & (data < 0.8)]
##不连续切片

## b    0.50
## c    0.75
## dtype: float64

data[['a', 'e']]

## a    0.25
## e    1.25
## dtype: float64

需要注意的是,如果使用键进行切片操作,最后一个是包括在最终结果中的,而如果使用数字位置进行切片,最后不包括在结果中

3.1.3 索引器(indexers):`loc`,`iloc`和`ix`

series中有一点需要注意的是,也是比较让人困惑的是,如果index是数字,那么索引的时候,给出数字,是按照index名字进行索引的,但是如果使用数字切片,则是按照位置进行的.举个例子:

data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data
# explicit index when indexing

## 1    a
## 3    b
## 5    c
## dtype: object

data[1]

## 'a'

data[1:3]

## 3    b
## 5    c
## dtype: object

为了避免在选取数据时造成误差,所以series也提供了属性:索引器来进行切片.

对于loc属性,里面都是按照index名字来索引的,不是位置.

data.loc[1]

## 'a'

data.loc[1:3]

## 1    a
## 3    b
## dtype: object

data.loc[[1,3]]

## 1    a
## 3    b
## dtype: object

对于iloc属性,则里面都是按照位置来进行索引和切片的.

data.iloc[1]

## 'b'

data.iloc[1:3]

## 3    b
## 5    c
## dtype: object

第三个属性,ix,他是上面两种属性的混合,对于series对象来说,idx属性是和iloc一样的,他主要是用于数据框对象的.

python推荐大家还是使用名字进行索引和切片(explicit is better than implicit).

3.2 数据框中的数据选择

大多数情况下,数据框更像是一个二维的array,有时候也可以看作是字典.

3.2.1 将数据库看作字典

area = pd.Series({'California': 423967, 'Texas': 695662,
'New York': 141297, 'Florida': 170312,
'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
'New York': 19651127, 'Florida': 19552860,
'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

##               area       pop
## California  423967  38332521
## Texas       695662  26448193
## New York    141297  19651127
## Florida     170312  19552860
## Illinois    149995  12882135

data.ndim

## 2

data.columns

## Index(['area', 'pop'], dtype='object')

data.index

## Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

可以直接使用列名来获得某一列.得到的是series对象.

data['area']

## California    423967
## Texas         695662
## New York      141297
## Florida       170312
## Illinois      149995
## Name: area, dtype: int64

type(data['area'])

## <class 'pandas.core.series.Series'>

还可以使用属性格式(attribute-style)格式来获得某列.

data.area

## California    423967
## Texas         695662
## New York      141297
## Florida       170312
## Illinois      149995
## Name: area, dtype: int64

type(data.area)

## <class 'pandas.core.series.Series'>

得到的也是series对象.

两者得到的是完全一样的.

data.area is data['area']

## True

但是这种方法还是不太好,比如:

列名不是string,而是数字.
列名和数据框自带方法或者属性重名.

如果列名和方法有冲突,那么使用.得到的就是方法而不是列.

data.pop is data['pop']

## False

data['density'] = data['pop'] / data['area']
data

##               area       pop     density
## California  423967  38332521   90.413926
## Texas       695662  26448193   38.018740
## New York    141297  19651127  139.076746
## Florida     170312  19552860  114.806121
## Illinois    149995  12882135   85.883763

3.2.2 将数据框作为二维的array

data.values

## array([[4.23967000e+05, 3.83325210e+07, 9.04139261e+01],
##        [6.95662000e+05, 2.64481930e+07, 3.80187404e+01],
##        [1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
##        [1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
##        [1.49995000e+05, 1.28821350e+07, 8.58837628e+01]])

我们可以对数据框进行类似于array的操作,比如倒置:

data.T

##            California         Texas      New York       Florida      Illinois
## area     4.239670e+05  6.956620e+05  1.412970e+05  1.703120e+05  1.499950e+05
## pop      3.833252e+07  2.644819e+07  1.965113e+07  1.955286e+07  1.288214e+07
## density  9.041393e+01  3.801874e+01  1.390767e+02  1.148061e+02  8.588376e+01

T是一个属性,而不是方法,所以后面不需要使用括号.

data.values[0]

## array([4.23967000e+05, 3.83325210e+07, 9.04139261e+01])

type(data.values[0])

## <class 'numpy.ndarray'>

data['area']

## California    423967
## Texas         695662
## New York      141297
## Florida       170312
## Illinois      149995
## Name: area, dtype: int64

type(data['area'])

## <class 'pandas.core.series.Series'>

还是使用loc,iloc和ix属性进行选择更加高校.

data.iloc[:3, :2]

##               area       pop
## California  423967  38332521
## Texas       695662  26448193
## New York    141297  19651127

data.loc[:'Illinois', :'pop']

##               area       pop
## California  423967  38332521
## Texas       695662  26448193
## New York    141297  19651127
## Florida     170312  19552860
## Illinois    149995  12882135

ix属性允许进行混合切片,也就是列和行分别使用数字位置和名字进行切片.

data.ix[:3, :'pop']

##               area       pop
## California  423967  38332521
## Texas       695662  26448193
## New York    141297  19651127
## 
## D:\software\python\python.exe:1: DeprecationWarning: 
## .ix is deprecated. Please use
## .loc for label based indexing or
## .iloc for positional indexing
## 
## See the documentation here:
## http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated

data.density

## California     90.413926
## Texas          38.018740
## New York      139.076746
## Florida       114.806121
## Illinois       85.883763
## Name: density, dtype: float64

data.density > 100

## California    False
## Texas         False
## New York       True
## Florida        True
## Illinois      False
## Name: density, dtype: bool

data.loc[data.density > 100, ['pop', 'density']]

##                pop     density
## New York  19651127  139.076746
## Florida   19552860  114.806121

也可以修改数据框.

data.iloc[0, 2] = 90
data

##               area       pop     density
## California  423967  38332521   90.000000
## Texas       695662  26448193   38.018740
## New York    141297  19651127  139.076746
## Florida     170312  19552860  114.806121
## Illinois    149995  12882135   85.883763

4 Pandas中的数据操作

4.1 Index alignment in Series

import numpy as np
import pandas as pd
area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193,
'New York': 19651127}, name='population')
population / area

## Alaska              NaN
## California    90.413926
## New York            NaN
## Texas         38.018740
## dtype: float64

area.index | population.index

## Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')

series还可以相加.

A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A + B

## 0    NaN
## 1    5.0
## 2    9.0
## 3    NaN
## dtype: float64

相加的时候,按照名字相同的进行加法运算,如果只有其中一方有,那么会返回0值.

还可以使用方法来代替运算符.

A.add(B, fill_value=0)

## 0    2.0
## 1    5.0
## 2    9.0
## 3    5.0
## dtype: float64

4.2 Index alignment in DataFrame

同样的,数据框也可以相加,按照行名和列名.

rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10, 4))
A = pd.DataFrame(rng.randint(0, 20, (2, 2)),
columns=list('AB'))
A

##     A   B
## 0   6  18
## 1  10  10

B = pd.DataFrame(rng.randint(0, 10, (3, 3)),
columns=list('BAC'))
B

##    B  A  C
## 0  7  4  3
## 1  7  7  2
## 2  5  4  1

A + B

##       A     B   C
## 0  10.0  25.0 NaN
## 1  17.0  17.0 NaN
## 2   NaN   NaN NaN

同样可以使用add方法.

A.stack()

## 0  A     6
##    B    18
## 1  A    10
##    B    10
## dtype: int32

##     A   B
## 0   6  18
## 1  10  10

fill = A.stack().mean()
fill

## 11.0

A.add(B, fill_value=fill)

##       A     B     C
## 0  10.0  25.0  14.0
## 1  17.0  17.0  13.0
## 2  15.0  16.0  12.0

lists Python operators and their equivalent Pandas object methods

Python operator	Pandas method(s)
+	add()
-	sub(), subtract()
*	mul(), multiply()
/	truediv(), div(), divide()
//	floordiv()
%	mod()
**	pow()

4.3 数据框和series之间的操作

A = rng.randint(10, size=(3, 4))
A

## array([[7, 5, 1, 4],
##        [0, 9, 5, 8],
##        [0, 9, 2, 6]])

type(A)

## <class 'numpy.ndarray'>

A[0]

## array([7, 5, 1, 4])

A - A[0]

## array([[ 0,  0,  0,  0],
##        [-7,  4,  4,  4],
##        [-7,  4,  1,  2]])

df = pd.DataFrame(A, columns=list('QRST'))
df

##    Q  R  S  T
## 0  7  5  1  4
## 1  0  9  5  8
## 2  0  9  2  6

df.iloc[0]##第一行

## Q    7
## R    5
## S    1
## T    4
## Name: 0, dtype: int32

df - df.iloc[0]

##    Q  R  S  T
## 0  0  0  0  0
## 1 -7  4  4  4
## 2 -7  4  1  2

5 导入数据

和R类似,可以使用pandas读取不同格式的本地的数据.函数总结如下:

Function	Meaning
`read_csv()`	从CSV文件导入数据
`pd.read_table(filename)`	从限定分隔符的文本文件导入数据
`pd.read_excel(filename)`	从Excel文件导入数据
`pd.read_sql(query, connection_object)`	从SQL表/库导入数据
`pd.read_json(json_string)`	从JSON格式的字符串导入数据
`pd.read_html(url)`	解析URL、字符串或者HTML文件，抽取其中的tables表格
`pd.read_clipboard()`	从你的粘贴板获取内容，并传给read_table()
`pd.DataFrame(dict)`	从字典对象导入数据，Key是列名，Value是数据

和R中函数一样,如果想要知道每个函数的所有参数及含义,可以使用help()函数来获得.

import os as os
os.getcwd()

## 'D:\\my github\\shen\\content\\en\\post\\2019-11-23-python-pandas-summary'

test = {"a":1, "b":2}
test

## {'a': 1, 'b': 2}

test2 = pd.DataFrame(test, index = [1,2])
# help(pd.DataFrame)
test2

##    a  b
## 1  1  2
## 2  1  2

Blog Chinese

Xiaotao Shen

Postdoctoral Research Fellow

Metabolomics, Multi-omics, Bioinformatics, Systems Biology.