I love the tidy syntax of dplyr and the ultimate speed of data.table. Why not take the both? That is why I have started the work of tidyfst, bridging the tidy syntax and computation performance via translation. This tool is especially friendly for dplyr users who want to learn some data.table, but data.table could also benefit from it (more or less).

A great comparison of data.table and dplyr was displayed at https://atrebas.github.io/post/2019-03-03-datatable-dplyr/ (thanks to Atrebas). I love this tutorial very much because it dig rather deep into many features from both packages. Here I’ll try to implement all operations from that tutorial, and the potential users could find why they would prefer tidyfst for some (if not most) tasks.

The below examples have all been checked with tidyfst. Now let’s begin.

Create example data

library(tidyfst)

set.seed(1L)

## Create a data table
DF <- data.table(V1 = rep(c(1L, 2L), 5)[-10],
                V2 = 1:9,
                V3 = c(0.5, 1.0, 1.5),
                V4 = rep(LETTERS[1:3], 3))
copy(DF) -> DT

class(DF)
DF

Basic operations

Filter rows

### Filter rows using indices
slice_dt(DF, 3:4)

### Discard rows using negative indices
slice_dt(DF, -(3:7))

### Filter rows using a logical expression
filter_dt(DF, V2 > 5)
filter_dt(DF, V4 %in% c("A", "C"))
filter_dt(DF, V4 %chin% c("A", "C")) # fast %in% for character

### Filter rows using multiple conditions
filter_dt(DF, V1 == 1, V4 == "A")
# equals to
filter_dt(DF, V1 == 1 & V4 == "A")

### Filter unique rows
distinct_dt(DF) # unique(DF)
distinct_dt(DF, V1,V4)

### Discard rows with missing values
drop_na_dt(DF) # na.omit(DF)

### Other filters
sample_n_dt(DF, 3)      # n random rows
sample_frac_dt(DF, 0.5) # fraction of random rows
top_n_dt(DF, 1, V1)     # top n entries (includes equals)

filter_dt(DT,V4 %like% "^B")
filter_dt(DT,V2 %between% c(3, 5))
filter_dt(DT,between(V2, 3, 5, incbounds = FALSE))
filter_dt(DT,V2 %inrange% list(-1:1, 1:3))  # see also ?inrange

Select columns

### Select one column using an index (not recommended)
pull_dt(DT,3) # returns a vector
select_dt(DT,3) # returns a data.table

### Select one column using column name
select_dt(DF, V2) # returns a data.table
pull_dt(DF, V2)   # returns a vector

### Select several columns
select_dt(DF, V2, V3, V4)
select_dt(DF, V2:V4) # select columns between V2 and V4

### Exclude columns
select_dt(DF, -V2, -V3)

### Select/Exclude columns using a character vector
cols <- c("V2", "V3")
select_dt(DF,cols = cols)
select_dt(DF,cols = cols,negate = TRUE)

### Other selections
select_dt(DF, cols = paste0("V", 1:2))
relocate_dt(DF, V4) # reorder columns
select_dt(DF, "V")
select_dt(DF, "3$")
select_dt(DF, ".2")
select_dt(DF, "V1|X")
select_dt(DF, -"^V2")
# remove variables using "-" prior to function

Going further

Miscellaneous

Reshape data

### Melt data (from wide to long)
fsetequal(DT,DF)

mDT = DT %>% longer_dt(V4)
mDF = DF %>% longer_dt(-"V1|V2")

fsetequal(mDT,mDF)
mDT

### Cast data (from long to wide)
mDT %>% 
  wider_dt(V4,name = "name",value = "value")
# below is a special case and could only be done in tidyfst
mDT %>% 
  wider_dt(V4,name = "name",value = "value",fun = list)

mDT %>% 
  wider_dt(V4,name = "name",value = "value",fun = sum)

### Split
split(DT, by = "V4")

Other

### Lead/Lag
lag_dt(1:10,n = 1)
lag_dt(1:10,n = 1:2)
lead_dt(1:10,n = 1)

Join/Bind data sets

Join

x <- data.table(Id  = c("A", "B", "C", "C"),
                X1  = c(1L, 3L, 5L, 7L),
                XY  = c("x2", "x4", "x6", "x8"),
                key = "Id")

y <- data.table(Id  = c("A", "B", "B", "D"),
                Y1  = c(1L, 3L, 5L, 7L),
                XY  = c("y1", "y3", "y5", "y7"),
                key = "Id")

### left join
left_join_dt(x, y, by = "Id")

### right join
right_join_dt(x, y, by = "Id")

### inner join
inner_join_dt(x, y, by = "Id")

### full join
full_join_dt(x, y, by = "Id")

### semi join
semi_join_dt(x, y, by = "Id")

### anti join
anti_join_dt(x, y, by = "Id")

Bind

Set operations

x <- data.table(c(1, 2, 2, 3, 3))
y <- data.table(c(2, 2, 3, 4, 4))
### Intersection
fintersect(x, y)
fintersect(x, y, all = TRUE)

### Difference
fsetdiff(x, y)
fsetdiff(x, y, all = TRUE)

### Union
funion(x, y)
funion(x, y, all = TRUE)

### Equality
fsetequal(x, x[order(-V1),])
all.equal(x, x) # S3 method
setequal(x, x[order(-V1),])

Summary

To break all these codes through, tidyfst has improved bit by bit. If you are using tidyfst frequently, you’ll find that while it enjoys a tidy syntax, it is more like you are using data.table in another style. Compared with many other packages with similar goals, tidyfst sticks to many principles of data.table (and is more like data.table in many ways). However, the ultimate goal is still clear: providing users with state-of-the-art data manipulation tools with least pain. Therefore, keep it simple and make it fast. Enjoy tidyfst~