Python中与dplyr类似的数据处理包

最近更新于 Mar 6, 2020 Python

1 dplython
- 1.1 安装
- 1.2 简单介绍
2 dfply
- 2.1 安装
- 2.2 函数概览

1 dplython

dplython是python借鉴R中dplyr包开发的,用于数据tidy的包.基本语法跟R非常像,而且还支持管道函数(操作符>>).对于我这种从R转python的人来说,用起来会更加熟悉和方便.

官网在这.

1.1 安装

pip install dplython

也可以从github上安装最新开发版本.

pip install git+https://github.com/dodger487/dplython.git

1.2 简单介绍

加载所需要的包.

import pandas as pd
import numpy as np
import dplython as dply
from dplython import (DplyFrame, X, diamonds, select, sift,
  sample_n, sample_frac, head, arrange, mutate, group_by,
  summarize, DelayFunction)

支持管道符操作,但是如果要将每一个操作写作一行,需要使用小括号将其括起来.

diamonds >> head

##    Unnamed: 0  carat      cut color clarity  ...  table  price     x     y     z
## 0           1   0.23    Ideal     E     SI2  ...   55.0    326  3.95  3.98  2.43
## 1           2   0.21  Premium     E     SI1  ...   61.0    326  3.89  3.84  2.31
## 2           3   0.23     Good     E     VS1  ...   65.0    327  4.05  4.07  2.31
## 3           4   0.29  Premium     I     VS2  ...   58.0    334  4.20  4.23  2.63
## 4           5   0.31     Good     J     SI2  ...   58.0    335  4.34  4.35  2.75
## 
## [5 rows x 11 columns]

(
diamonds >>
head(5)
)

##    Unnamed: 0  carat      cut color clarity  ...  table  price     x     y     z
## 0           1   0.23    Ideal     E     SI2  ...   55.0    326  3.95  3.98  2.43
## 1           2   0.21  Premium     E     SI1  ...   61.0    326  3.89  3.84  2.31
## 2           3   0.23     Good     E     VS1  ...   65.0    327  4.05  4.07  2.31
## 3           4   0.29  Premium     I     VS2  ...   58.0    334  4.20  4.23  2.63
## 4           5   0.31     Good     J     SI2  ...   58.0    335  4.34  4.35  2.75
## 
## [5 rows x 11 columns]

再dplyr中,进行行筛选的函数为filter(),但是在dplython中,函数名为sift(),暂时不知道为什么要改名,猜想可能是由于和python自带函数冲突.

列名不能直接引用,需要使用dplython中的X来引用.

X.foo will refer to the “foo” column of the DataFrame in which it is later applied.

(
diamonds >> 
sift(X.carat > 4) >> 
select(X.carat, X.cut, X.depth, X.price)
)

##        carat      cut  depth  price
## 25998   4.01  Premium   61.0  15223
## 25999   4.01  Premium   62.5  15223
## 27130   4.13     Fair   64.8  17329
## 27415   5.01     Fair   65.5  18018
## 27630   4.50     Fair   65.8  18531

(diamonds >>
  sample_n(10) >>
  arrange(X.carat) >>
  select(X.carat, X.cut, X.depth, X.price))

##        carat        cut  depth  price
## 34904   0.34    Premium   59.2    880
## 41186   0.39      Ideal   61.6   1201
## 40502   0.43  Very Good   60.7   1143
## 52362   0.54      Ideal   61.1   2502
## 53392   0.77       Good   63.8   2665
## 6651    0.90    Premium   62.7   4092
## 6643    0.90      Ideal   62.7   4091
## 5532    0.92  Very Good   62.1   3854
## 9229    1.02    Premium   61.9   4558
## 27546   2.05    Premium   61.8  18318

(diamonds >>
  mutate(carat_bin=X.carat.round()) >>
  group_by(X.cut, X.carat_bin) >>
  summarize(avg_price=X.price.mean()))

##           cut  carat_bin     avg_price
## 0        Fair        0.0   1027.979275
## 1        Fair        1.0   3305.754579
## 2        Fair        2.0   9588.269737
## 3        Fair        3.0  13466.823529
## 4        Fair        4.0  15842.666667
## 5        Fair        5.0  18018.000000
## 6        Good        0.0    786.054191
## 7        Good        1.0   3815.307879
## 8        Good        2.0  11096.950321
## 9        Good        3.0  15924.176471
## 10      Ideal        0.0    863.908535
## 11      Ideal        1.0   4213.864948
## 12      Ideal        2.0  12838.984078
## 13      Ideal        3.0  16156.681818
## 14      Ideal        4.0  12587.000000
## 15    Premium        0.0    863.329085
## 16    Premium        1.0   4382.906453
## 17    Premium        2.0  12337.020064
## 18    Premium        3.0  15636.047619
## 19    Premium        4.0  16335.000000
## 20  Very Good        0.0    766.354590
## 21  Very Good        1.0   4135.271007
## 22  Very Good        2.0  12281.851661
## 23  Very Good        3.0  15053.555556
## 24  Very Good        4.0  15984.000000

需要注意的是,使用管道函数,第一行为要处理的数据,第二行开始,都要缩进.

但是作者已经3年没有维护这个包了,所以感觉有点悬,有可能废掉了,所以还是别用了.

2 dfply

另外一个将dplyr移植到python的包是dfply,可能更加可靠一些.

官网如下:

https://github.com/kieferk/dfply#the--and--pipe-operators

https://github.com/kieferk/dfply

2.1 安装

pip install dfply

2.2 函数概览

2.2.1 `>>`和`>>=`管道操作符

在R中,管道操作符为%>%,在python中,改为>>.

from dfply import *
diamonds >> head(5)

##    carat      cut color clarity  depth  table  price     x     y     z
## 0   0.23    Ideal     E     SI2   61.5   55.0    326  3.95  3.98  2.43
## 1   0.21  Premium     E     SI1   59.8   61.0    326  3.89  3.84  2.31
## 2   0.23     Good     E     VS1   56.9   65.0    327  4.05  4.07  2.31
## 3   0.29  Premium     I     VS2   62.4   58.0    334  4.20  4.23  2.63
## 4   0.31     Good     J     SI2   63.3   58.0    335  4.34  4.35  2.75

(
diamonds >>
  head(5)
)

##    carat      cut color clarity  depth  table  price     x     y     z
## 0   0.23    Ideal     E     SI2   61.5   55.0    326  3.95  3.98  2.43
## 1   0.21  Premium     E     SI1   59.8   61.0    326  3.89  3.84  2.31
## 2   0.23     Good     E     VS1   56.9   65.0    327  4.05  4.07  2.31
## 3   0.29  Premium     I     VS2   62.4   58.0    334  4.20  4.23  2.63
## 4   0.31     Good     J     SI2   63.3   58.0    335  4.34  4.35  2.75

同样的,如果每一个操作都开始新的一行的话,需要使用小括号将他们括起来.

将处理之后的数据框赋予一个新的变量名的时候,需要注意变量名等号和小括号的左半部分需要在同一行.

new = (
diamonds >>
  tail(5)
)
new

##        carat        cut color clarity  depth  table  price     x     y     z
## 53935   0.72      Ideal     D     SI1   60.8   57.0   2757  5.75  5.76  3.50
## 53936   0.72       Good     D     SI1   63.1   55.0   2757  5.69  5.75  3.61
## 53937   0.70  Very Good     D     SI1   62.8   60.0   2757  5.66  5.68  3.56
## 53938   0.86    Premium     H     SI2   61.0   58.0   2757  6.15  6.12  3.74
## 53939   0.75      Ideal     D     SI2   62.2   55.0   2757  5.83  5.87  3.64

另外一个管道操作符>>=意味着对原始数据直接进行修改后付给这个变量.

diamonds2 = diamonds.copy()
diamonds2 >> head(5)

##    carat      cut color clarity  depth  table  price     x     y     z
## 0   0.23    Ideal     E     SI2   61.5   55.0    326  3.95  3.98  2.43
## 1   0.21  Premium     E     SI1   59.8   61.0    326  3.89  3.84  2.31
## 2   0.23     Good     E     VS1   56.9   65.0    327  4.05  4.07  2.31
## 3   0.29  Premium     I     VS2   62.4   58.0    334  4.20  4.23  2.63
## 4   0.31     Good     J     SI2   63.3   58.0    335  4.34  4.35  2.75

diamonds2 >>= head(5)
diamonds2

##    carat      cut color clarity  depth  table  price     x     y     z
## 0   0.23    Ideal     E     SI2   61.5   55.0    326  3.95  3.98  2.43
## 1   0.21  Premium     E     SI1   59.8   61.0    326  3.89  3.84  2.31
## 2   0.23     Good     E     VS1   56.9   65.0    327  4.05  4.07  2.31
## 3   0.29  Premium     I     VS2   62.4   58.0    334  4.20  4.23  2.63
## 4   0.31     Good     J     SI2   63.3   58.0    335  4.34  4.35  2.75

diamonds >> head(10)

##    carat        cut color clarity  depth  table  price     x     y     z
## 0   0.23      Ideal     E     SI2   61.5   55.0    326  3.95  3.98  2.43
## 1   0.21    Premium     E     SI1   59.8   61.0    326  3.89  3.84  2.31
## 2   0.23       Good     E     VS1   56.9   65.0    327  4.05  4.07  2.31
## 3   0.29    Premium     I     VS2   62.4   58.0    334  4.20  4.23  2.63
## 4   0.31       Good     J     SI2   63.3   58.0    335  4.34  4.35  2.75
## 5   0.24  Very Good     J    VVS2   62.8   57.0    336  3.94  3.96  2.48
## 6   0.24  Very Good     I    VVS1   62.3   57.0    336  3.95  3.98  2.47
## 7   0.26  Very Good     H     SI1   61.9   55.0    337  4.07  4.11  2.53
## 8   0.22       Fair     E     VS2   65.1   61.0    337  3.87  3.78  2.49
## 9   0.23  Very Good     H     VS1   59.4   61.0    338  4.00  4.05  2.39

2.2.2 数据框标志`X`

在管道操作符中,将数据框传到后面的函数中,需要使用X标志来代表.在R中,我们可以直接使用数据框的列名来引用每一列,在python中,需要使用X来表示.

diamonds >> select(X.carat, X.cut) >> head(5)

##    carat      cut
## 0   0.23    Ideal
## 1   0.21  Premium
## 2   0.23     Good
## 3   0.29  Premium
## 4   0.31     Good

2.2.3 选择某列或者删除某列(selecting and dropping)

在R中,使用select函数可以完成选择和删除,在删除某列时,直接在前面加上减号就可以了.但是在python中,select也可以完成删除列,只是减号改为~.但是对于删除列,还有一个专门的函数,drop().

选择列的时候,既可以使用列的index,也可以使用X符合以及引号括起来的列名进行选择.可惜的是,暂时看到是不支持切片的.

(
diamonds >> 
  select(0, X.price, "x", ['y', 'z']) >>
  head(5)
)

##    carat  price     x     y     z
## 0   0.23    326  3.95  3.98  2.43
## 1   0.21    326  3.89  3.84  2.31
## 2   0.23    327  4.05  4.07  2.31
## 3   0.29    334  4.20  4.23  2.63
## 4   0.31    335  4.34  4.35  2.75

(
diamonds >>
  select(['carat','price']) >>
  head(5)
)

##    carat  price
## 0   0.23    326
## 1   0.21    326
## 2   0.23    327
## 3   0.29    334
## 4   0.31    335

(
diamonds >>
  drop(1, X.price, ['x', 'y']) >>
  head()
)

##    carat color clarity  depth  table     z
## 0   0.23     E     SI2   61.5   55.0  2.43
## 1   0.21     E     SI1   59.8   61.0  2.31
## 2   0.23     E     VS1   56.9   65.0  2.31
## 3   0.29     I     VS2   62.4   58.0  2.63
## 4   0.31     J     SI2   63.3   58.0  2.75

(
diamonds >>
  select(~0, ~X.price) >>
  head(5)
)

##       z
## 0  2.43
## 1  2.31
## 2  2.31
## 3  2.63
## 4  2.75

2.2.3.1 Selection filter函数

和dplyr类似,也有很多配合使用的函数.

Function	Meaning
`starts_with(prefix)`	find columns that start with a string prefix.
`ends_with(suffix)`	find columns that end with a string suffix.
`contains(substr)`	find columns that contain a substring in their name.
`everything()`	all columns.
`columns_between(start_col, end_col, inclusive=True)`	find columns between a specified start and end column. The inclusive boolean keyword argument indicates whether the end column should be included or not.
`columns_to(end_col, inclusive=True)`	get columns up to a specified end column. The inclusive argument indicates whether the ending column should be included or not.
`columns_from(start_col)`	get the columns starting at a specified column.

因为dfply并不支持切片,所以可以看到,分别提供了三个函数用来代替切片的功能.

(
diamonds >>
  select(columns_between("carat", "color")) >>
  head(5)
)

##    carat      cut color
## 0   0.23    Ideal     E
## 1   0.21  Premium     E
## 2   0.23     Good     E
## 3   0.29  Premium     I
## 4   0.31     Good     J

(
diamonds >>
  select(columns_to("color")) >>
  head(5)
)

##    carat      cut
## 0   0.23    Ideal
## 1   0.21  Premium
## 2   0.23     Good
## 3   0.29  Premium
## 4   0.31     Good

(
diamonds >>
  select(columns_from("color")) >>
  head(5)
)

##   color clarity  depth  table  price     x     y     z
## 0     E     SI2   61.5   55.0    326  3.95  3.98  2.43
## 1     E     SI1   59.8   61.0    326  3.89  3.84  2.31
## 2     E     VS1   56.9   65.0    327  4.05  4.07  2.31
## 3     I     VS2   62.4   58.0    334  4.20  4.23  2.63
## 4     J     SI2   63.3   58.0    335  4.34  4.35  2.75

diamonds >> select(columns_to(1, inclusive=True), 'depth', columns_from(-2)) >> head(2)

##    carat      cut  depth     y     z
## 0   0.23    Ideal   61.5  3.98  2.43
## 1   0.21  Premium   59.8  3.84  2.31

2.2.4 对列进行筛选

2.2.4.1 `row_slice()`

与pandas的iloc方法类似,使用index对列筛选.

问题是,仍然不支持切片.但是可以和group_by联合使用,从而选出每一组的固定行.

diamonds >> row_slice([10,15])

##     carat      cut color clarity  depth  table  price     x     y     z
## 10   0.30     Good     J     SI1   64.0   55.0    339  4.25  4.28  2.73
## 15   0.32  Premium     E      I1   60.9   58.0    345  4.38  4.42  2.68

diamonds >> group_by('cut') >> row_slice(5)

##      carat        cut color clarity  depth  table  price     x     y     z
## 128   0.91       Fair     H     SI2   64.4   57.0   2763  6.11  6.09  3.93
## 20    0.30       Good     I     SI2   63.3   56.0    351  4.26  4.30  2.71
## 40    0.33      Ideal     I     SI2   61.2   56.0    403  4.49  4.50  2.75
## 26    0.24    Premium     I     VS1   62.5   57.0    355  3.97  3.94  2.47
## 21    0.23  Very Good     E     VS2   63.8   55.0    352  3.85  3.92  2.48

2.2.4.2 `sample()`函数

和pandas的.sample()方法类似,可以用于随机抽取列.

diamonds >> sample(frac=0.0001, replace=False)

##        carat        cut color clarity  depth  table  price     x     y     z
## 35971   0.40    Premium     F     SI1   61.7   62.0    920  4.76  4.71  2.92
## 12741   1.01      Ideal     H     VS2   60.6   57.0   5317  6.54  6.49  3.95
## 23321   1.03      Ideal     F      IF   61.3   56.0  11365  6.53  6.55  4.01
## 37784   0.53  Very Good     G     SI2   62.5   55.0   1000  5.14  5.19  3.23
## 8715    0.37      Ideal     G     SI1   62.0   55.0    586  4.62  4.64  2.87

diamonds >> sample(n=3, replace=True)

##        carat        cut color clarity  depth  table  price     x     y     z
## 4188    0.83      Ideal     F     SI1   62.2   55.0   3560  6.01  6.03  3.74
## 19620   1.26      Ideal     H     VS1   60.9   57.0   8238  6.97  6.99  4.25
## 43303   0.61  Very Good     I     VS2   63.4   55.0   1401  5.46  5.42  3.45

2.2.4.3 `distinct()`函数

只保留唯一不重复的行.传入的参数为列名,可以是多个列.与pandas的.drop_duolicates()方法类似.

diamonds >> distinct(X.color, X.price) >> head(5)

##    carat        cut color clarity  depth  table  price     x     y     z
## 0   0.23      Ideal     E     SI2   61.5   55.0    326  3.95  3.98  2.43
## 2   0.23       Good     E     VS1   56.9   65.0    327  4.05  4.07  2.31
## 3   0.29    Premium     I     VS2   62.4   58.0    334  4.20  4.23  2.63
## 4   0.31       Good     J     SI2   63.3   58.0    335  4.34  4.35  2.75
## 5   0.24  Very Good     J    VVS2   62.8   57.0    336  3.94  3.96  2.48

2.2.4.4 `mask()`和`filter_by()`函数

类似于dplyr中的filter()函数.这两个函数是等价的.

diamonds >> mask(X.cut == 'Ideal') >> head(4)

##     carat    cut color clarity  depth  table  price     x     y     z
## 0    0.23  Ideal     E     SI2   61.5   55.0    326  3.95  3.98  2.43
## 11   0.23  Ideal     J     VS1   62.8   56.0    340  3.93  3.90  2.46
## 13   0.31  Ideal     J     SI2   62.2   54.0    344  4.35  4.37  2.71
## 16   0.30  Ideal     I     SI2   62.0   54.0    348  4.31  4.34  2.68

diamonds >> mask(X.cut == 'Ideal', X.color == 'E', X.table < 55, X.price < 500)

##        carat    cut color clarity  depth  table  price     x     y     z
## 26683   0.33  Ideal     E     SI2   62.2   54.0    427  4.44  4.46  2.77
## 32297   0.34  Ideal     E     SI2   62.4   54.0    454  4.49  4.52  2.81
## 40928   0.30  Ideal     E     SI1   61.6   54.0    499  4.32  4.35  2.67
## 50623   0.30  Ideal     E     SI2   62.1   54.0    401  4.32  4.35  2.69
## 50625   0.30  Ideal     E     SI2   62.0   54.0    401  4.33  4.35  2.69

diamonds >> filter_by(X.cut == 'Ideal', X.color == 'E', X.table < 55, X.price < 500)

##        carat    cut color clarity  depth  table  price     x     y     z
## 26683   0.33  Ideal     E     SI2   62.2   54.0    427  4.44  4.46  2.77
## 32297   0.34  Ideal     E     SI2   62.4   54.0    454  4.49  4.52  2.81
## 40928   0.30  Ideal     E     SI1   61.6   54.0    499  4.32  4.35  2.67
## 50623   0.30  Ideal     E     SI2   62.1   54.0    401  4.32  4.35  2.69
## 50625   0.30  Ideal     E     SI2   62.0   54.0    401  4.33  4.35  2.69

2.2.4.5 `pull()`函数

与dplyr中函数同名且功能相同.

(diamonds
 >> filter_by(X.cut == 'Ideal', X.color == 'E', X.table < 55, X.price < 500)
 >> pull('carat'))

## 26683    0.33
## 32297    0.34
## 40928    0.30
## 50623    0.30
## 50625    0.30
## Name: carat, dtype: float64
## 
## D:\software\python\lib\site-packages\dfply\subset.py:87: DeprecationWarning: 
## .ix is deprecated. Please use
## .loc for label based indexing or
## .iloc for positional indexing
## 
## See the documentation here:
## http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
##   return df.ix[:, column]

2.2.5 数据框转变

2.2.5.1 `mutate()`函数

添加新的变量(列)

diamonds >> mutate(x_plus_y=X.x + X.y) >> select(columns_from('x')) >> head(3)

##       x     y     z  x_plus_y
## 0  3.95  3.98  2.43      7.93
## 1  3.89  3.84  2.31      7.73
## 2  4.05  4.07  2.31      8.12

diamonds >> mutate(x_plus_y=X.x + X.y, y_div_z=(X.y / X.z)) >> select(columns_from('x')) >> head(3)

##       x     y     z  x_plus_y   y_div_z
## 0  3.95  3.98  2.43      7.93  1.637860
## 1  3.89  3.84  2.31      7.73  1.662338
## 2  4.05  4.07  2.31      8.12  1.761905

2.2.5.2 `transmutate()`函数

只保留下新创建的列(变量).

diamonds >> transmute(x_plus_y=X.x + X.y, y_div_z=(X.y / X.z)) >> head(3)

##    x_plus_y   y_div_z
## 0      7.93  1.637860
## 1      7.73  1.662338
## 2      8.12  1.761905

2.2.6 分组(grouping)

2.2.6.1 `group_by()`和`ungroup()`函数

分组非常有用.利用某个或某几个变量将数据框分成不同的部分,然后对每一组分别进行处理.结束之后,需要使用ungroup()函数结束分组状态.

比如,如果想知道对于不同的cut的钻石的最高价和最低价,可以使用下面的代码.

(diamonds >> group_by(X.cut) >>
 mutate(price_lead=lead(X.price), price_lag=lag(X.price)) >>
 head(2) >> select(X.cut, X.price, X.price_lead, X.price_lag))

##           cut  price  price_lead  price_lag
## 8        Fair    337      2757.0        NaN
## 91       Fair   2757      2759.0      337.0
## 2        Good    327       335.0        NaN
## 4        Good    335       339.0      327.0
## 0       Ideal    326       340.0        NaN
## 11      Ideal    340       344.0      326.0
## 1     Premium    326       334.0        NaN
## 3     Premium    334       342.0      326.0
## 5   Very Good    336       336.0        NaN
## 6   Very Good    336       337.0      336.0

2.2.7 重塑数据框(reshaping)

2.2.7.1 `arrange()`函数

这个函数是用来sort的.跟pandas的.sort_valiues()方法是一样的.

diamonds >> arrange(X.table, ascending=False) >> head(5)

##        carat   cut color clarity  depth  table  price     x     y     z
## 24932   2.01  Fair     F     SI1   58.6   95.0  13387  8.32  8.31  4.87
## 50773   0.81  Fair     F     SI2   68.8   79.0   2301  5.26  5.20  3.58
## 51342   0.79  Fair     G     SI1   65.3   76.0   2362  5.52  5.13  3.35
## 52860   0.50  Fair     E     VS2   79.0   73.0   2579  5.21  5.18  4.09
## 49375   0.70  Fair     H     VS1   62.0   73.0   2100  5.65  5.54  3.47

(diamonds >> group_by(X.cut) >> arrange(X.price) >>
 head(3) >> ungroup() >> mask(X.carat < 0.23))

##     carat      cut color clarity  depth  table  price     x     y     z
## 8    0.22     Fair     E     VS2   65.1   61.0    337  3.87  3.78  2.49
## 1    0.21  Premium     E     SI1   59.8   61.0    326  3.89  3.84  2.31
## 12   0.22  Premium     F     SI1   60.4   61.0    342  3.88  3.84  2.33

2.2.7.2 `rename()`函数

对列进行重新命名.

diamonds >> rename(CUT=X.cut, COLOR='color') >> head(2)

##    carat      CUT COLOR clarity  depth  table  price     x     y     z
## 0   0.23    Ideal     E     SI2   61.5   55.0    326  3.95  3.98  2.43
## 1   0.21  Premium     E     SI1   59.8   61.0    326  3.89  3.84  2.31

2.2.7.3 `gather()`函数和`spread()`函数

对长宽数据进行转换的函数.gather()将宽数据转边为长数据,spread()将长数据转换为宽数据.

diamonds >> gather('variable', 'value', ['price', 'depth','x','y','z']) >> head(5)

##    carat      cut color clarity  table variable  value
## 0   0.23    Ideal     E     SI2   55.0    price  326.0
## 1   0.21  Premium     E     SI1   61.0    price  326.0
## 2   0.23     Good     E     VS1   65.0    price  327.0
## 3   0.29  Premium     I     VS2   58.0    price  334.0
## 4   0.31     Good     J     SI2   58.0    price  335.0

elongated = diamonds >> gather('variable', 'value', add_id=True)
elongated >> head(5)

##    _ID variable value
## 0    0    carat  0.23
## 1    1    carat  0.21
## 2    2    carat  0.23
## 3    3    carat  0.29
## 4    4    carat  0.31

widened = elongated >> spread(X.variable, X.value)
widened >> head(5)

##    _ID carat clarity color      cut depth price table     x     y     z
## 0    0  0.23     SI2     E    Ideal  61.5   326    55  3.95  3.98  2.43
## 1    1  0.21     SI1     E  Premium  59.8   326    61  3.89  3.84  2.31
## 2    2  0.23     VS1     E     Good  56.9   327    65  4.05  4.07  2.31
## 3    3  0.29     VS2     I  Premium  62.4   334    58   4.2  4.23  2.63
## 4    4  0.31     SI2     J     Good  63.3   335    58  4.34  4.35  2.75

2.2.8 合并(joining)

这个用处非常多.

Funcation	Meaning
inner_join(other, by=‘column’)	合并两个数据框共有的部分
outer_join(other, by=‘column’)	合并两个数据框并集部分
full_join(other, by=‘column’)	合并两个数据框并集部分
right_join(other, by=‘column’)	以右边数据框为准进行合并
left_join(other, by=‘column’)	以左边数据框为准进行合并
semi_join(other, by=‘column’)	只保留左边数据框能够匹配上右边数据框的行
anti_join(other, by=‘column’)	只保留两个数据框都互不相同的行

a = pd.DataFrame({
        'x1':['A','B','C'],
        'x2':[1,2,3]
    })
b = pd.DataFrame({
    'x1':['A','B','D'],
    'x3':[True,False,True]
})

a >> inner_join(b, by='x1')

##   x1  x2     x3
## 0  A   1   True
## 1  B   2  False

a >> outer_join(b, by='x1')

##   x1   x2     x3
## 0  A  1.0   True
## 1  B  2.0  False
## 2  C  3.0    NaN
## 3  D  NaN   True

a >> full_join(b, by='x1')

##   x1   x2     x3
## 0  A  1.0   True
## 1  B  2.0  False
## 2  C  3.0    NaN
## 3  D  NaN   True

a >> semi_join(b, by='x1')

##   x1  x2
## 0  A   1
## 1  B   2

a >> anti_join(b, by='x1')

##   x1  x2
## 2  C   3

未完待续….

Blog Chinese

Python中与dplyr类似的数据处理包

1 dplython

1.1 安装

1.2 简单介绍

2 dfply

2.1 安装

2.2 函数概览

2.2.1 >>和>>=管道操作符

2.2.2 数据框标志X

2.2.3 选择某列或者删除某列(selecting and dropping)

2.2.3.1 Selection filter函数

2.2.4 对列进行筛选

2.2.4.1 row_slice()

2.2.4.2 sample()函数

2.2.4.3 distinct()函数

2.2.4.4 mask()和filter_by()函数

2.2.4.5 pull()函数