我的R语言小伙伴最近分享了自己使用R来做工业级数据清洗的经验,最近我自己在不断测试我的新包tidyfst,因此就拿这个data.table的案例来尝试一下。

测试数据构造

本次测试,将不会加载data.table包,但是其实tidyfst里面无处不是data.table的元素,而且也导出了很多内置的data.table函数,比如as.data.table和data.table。所以这些代码在tidyfst中就可以自如地使用。

library(tidyfst)
#> 
#> Life's short, use R.
diamonds <- ggplot2::diamonds
n = 1e5  #如果想做工业级测试,可以继续增加数量
set.seed(2020)
dtranges <- seq.Date(from = as.Date("2011-01-01"),
                     to = as.Date("2020-01-01"),
                     by = 1)
n1 <- sample(nrow(diamonds), n, replace = TRUE)
dat1 <- as.data.table(diamonds[n1, ])
dat1[, "dt"] <- sample(dtranges, n, replace = TRUE)  # 增加dt列
n2 <- sample(nrow(dat1), nrow(dat1)/1000)
dat1[n2, "price"] <- NA # price列构造千分之一缺失值
dat2 <- data.table(dt = sample(dtranges, min(n/1000, length(dtranges))),
                   price1 = sample(1000, min(n/1000, length(dtranges)), replace = TRUE))

dat3 <- data.table(dt = sample(dtranges, min(n/1000, length(dtranges))),
                   price2 = sample(1000, min(n/1000, length(dtranges)), replace = TRUE))

print(dat1)
#>         carat       cut color clarity depth table price     x     y     z
#>         <num>     <ord> <ord>   <ord> <num> <num> <int> <num> <num> <num>
#>      1:  2.23   Premium     J     VS2  61.0    58 14867  8.39  8.36  5.11
#>      2:  0.46     Ideal     F     SI2  61.5    54   758  4.98  5.01  3.07
#>      3:  0.70     Ideal     H    VVS1  60.9    57  3611  5.70  5.78  3.50
#>      4:  1.50   Premium     F     VS2  61.5    58 14719  7.34  7.32  4.51
#>      5:  0.40 Very Good     G      IF  61.0    59  1154  4.75  4.79  2.91
#>     ---                                                                  
#>  99996:  1.14     Ideal     F     VS2  61.4    56  8017  6.73  6.78  4.14
#>  99997:  0.39     Ideal     F     SI1  61.8    55   886  4.70  4.72  2.91
#>  99998:  2.00      Good     F     SI1  63.0    64 17869  7.67  7.76  4.86
#>  99999:  1.00   Premium     H     VS2  59.7    59  5139  6.57  6.52  3.91
#> 100000:  0.30     Ideal     G    VVS2  60.6    57   878  4.35  4.33  2.63
#>                 dt
#>             <Date>
#>      1: 2016-09-29
#>      2: 2011-10-22
#>      3: 2013-03-06
#>      4: 2014-06-11
#>      5: 2015-02-23
#>     ---           
#>  99996: 2013-11-22
#>  99997: 2014-11-10
#>  99998: 2019-02-21
#>  99999: 2017-01-15
#> 100000: 2019-05-19

基础

聚合

1.求每种切割类型、每种颜色钻石的平均价格、中位数价格与最高价格

在tidyfst中,我设置了一个sys_time_print函数,可以方便地输出system.time()函数返回的结果。

sys_time_print({
  r1_1 <- dat1 %>% 
    summarise_dt(
      by = .(cut,color),
      mean_price = mean(price, na.rm = TRUE),
      median_price = median(price, na.rm = TRUE),
      max_price = max(price, na.rm = TRUE)
    )
})
#> [1] "Finished in 0.000s elapsed (0.000s cpu)"
r1_1
#>           cut color mean_price median_price max_price
#>         <ord> <ord>      <num>        <num>     <int>
#>  1: Very Good     H   4390.470       3394.0     18803
#>  2:      Good     G   3999.886       3303.0     18788
#>  3:     Ideal     F   3309.061       1751.0     18780
#>  4:   Premium     E   3481.628       1881.0     18426
#>  5:      Fair     D   4237.899       3205.0     16386
#>  6:   Premium     J   6416.235       5110.5     18706
#>  7:   Premium     F   4347.158       2874.0     18791
#>  8:     Ideal     D   2608.030       1550.5     18693
#>  9: Very Good     G   3900.558       2403.0     18818
#> 10: Very Good     E   3301.873       2051.0     18731
#> 11:   Premium     G   4543.181       2780.0     18741
#> 12:     Ideal     G   3751.035       1881.0     18806
#> 13:     Ideal     E   2558.501       1429.0     18729
#> 14:   Premium     H   5215.019       4451.0     18795
#> 15:   Premium     I   5921.498       4545.5     18823
#> 16: Very Good     F   3792.863       2473.0     18777
#> 17:      Good     D   3369.352       2661.0     18468
#> 18:      Fair     G   4186.403       2797.0     18574
#> 19:      Good     E   3293.289       2407.0     18236
#> 20:   Premium     D   3796.609       2348.0     18286
#> 21:     Ideal     H   3924.682       2320.0     18659
#> 22: Very Good     I   5475.587       4013.0     18500
#> 23:     Ideal     I   4434.030       2631.0     18779
#> 24:      Fair     J   5029.502       3422.0     18531
#> 25: Very Good     D   3437.133       2242.0     18526
#> 26:      Good     F   3518.548       2755.0     18686
#> 27:      Good     I   4966.745       3484.0     18707
#> 28:      Good     H   4098.280       3323.0     18640
#> 29:      Fair     H   5454.118       4134.5     18308
#> 30:      Fair     F   3790.703       2961.0     17995
#> 31:     Ideal     J   4972.801       4135.0     18508
#> 32: Very Good     J   5107.638       4126.5     18430
#> 33:      Fair     I   4278.831       3022.0     18242
#> 34:      Good     J   4578.490       3690.0     18325
#> 35:      Fair     E   3834.358       2996.5     15584
#>           cut color mean_price median_price max_price

tidyfst是永远不可能比data.table快的,但是如果你觉得上面的代码更容易掌握、更容易读懂,而在日常工作中多花零点几秒的运行时间没有太大问题(实际上节省了大家的交流时间,甚至就是节省将来自己再次读懂自己代码的时间),tidyfst就值得拥有。

join

长宽表转换

1.长表转宽表

sys_time_print({
  mean1 <- function(x) mean(x, na.rm = TRUE)
  max1 <- function(x) max(x, na.rm = TRUE)
  r3_1 <-dat1 %>% 
    wider_dt(cut,
             value = c("depth", "price"),
             name = "color",
             fun = list(mean1,max1))
})
#> [1] "Finished in 0.050s elapsed (0.040s cpu)"
r3_1
#> Key: <cut>
#>          cut depth_mean1_D depth_mean1_E depth_mean1_F depth_mean1_G
#>        <ord>         <num>         <num>         <num>         <num>
#> 1:      Fair      64.15688      63.09882      63.41129      64.33663
#> 2:      Good      62.32176      62.16065      62.18220      62.60212
#> 3: Very Good      61.70454      61.73751      61.73138      61.83043
#> 4:   Premium      61.18875      61.16850      61.25098      61.27894
#> 5:     Ideal      61.67241      61.68693      61.65616      61.70585
#>    depth_mean1_H depth_mean1_I depth_mean1_J price_mean1_D price_mean1_E
#>            <num>         <num>         <num>         <num>         <num>
#> 1:      64.44778      64.24571      64.75682      4237.899      3834.358
#> 2:      62.50826      62.56427      62.44288      3369.352      3293.289
#> 3:      61.97838      61.90052      61.94553      3437.133      3301.873
#> 4:      61.31882      61.34123      61.42092      3796.609      3481.628
#> 5:      61.72374      61.79243      61.80705      2608.030      2558.501
#>    price_mean1_F price_mean1_G price_mean1_H price_mean1_I price_mean1_J
#>            <num>         <num>         <num>         <num>         <num>
#> 1:      3790.703      4186.403      5454.118      4278.831      5029.502
#> 2:      3518.548      3999.886      4098.280      4966.745      4578.490
#> 3:      3792.863      3900.558      4390.470      5475.587      5107.638
#> 4:      4347.158      4543.181      5215.019      5921.498      6416.235
#> 5:      3309.061      3751.035      3924.682      4434.030      4972.801
#>    depth_max1_D depth_max1_E depth_max1_F depth_max1_G depth_max1_H
#>           <num>        <num>        <num>        <num>        <num>
#> 1:         71.6         79.0         70.8         72.9         71.8
#> 2:         67.0         65.9         66.5         65.8         66.1
#> 3:         64.7         64.4         64.2         64.9         64.7
#> 4:         63.0         63.0         63.0         63.0         63.0
#> 5:         64.5         65.5         65.3         64.1         65.1
#>    depth_max1_I depth_max1_J price_max1_D price_max1_E price_max1_F
#>           <num>        <num>        <int>        <int>        <int>
#> 1:         71.3         73.6        16386        15584        17995
#> 2:         65.9         66.0        18468        18236        18686
#> 3:         64.4         64.1        18526        18731        18777
#> 4:         63.0         63.0        18286        18426        18791
#> 5:         66.7         65.4        18693        18729        18780
#>    price_max1_G price_max1_H price_max1_I price_max1_J
#>           <int>        <int>        <int>        <int>
#> 1:        18574        18308        18242        18531
#> 2:        18788        18640        18707        18325
#> 3:        18818        18803        18500        18430
#> 4:        18741        18795        18823        18706
#> 5:        18806        18659        18779        18508

高阶

添加子维度聚合结果为新列