Learning data.table (2)

Posted by Rui Ying on Sunday, April 25, 2021
#download data as usual
library(data.table)
library(magrittr)
file_url <- 'https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv'
df <- fread(file_url)

Introduction

This post studies the use of := operator which add/update/delete columns. The difference between := and default data.frame operation is the former does by reference and therefore is not deep copy and not assign any value to a variable. This causes faster speed, less memory use but some side effects.

Two forms of := ways

# LHS := RHS
DT[, c("colA", "colB") := list(valA, valB)]

# function form
DT[, `:=`(colA = valA, # valA is assigned to colA
          colB = valB)]

Add/update/delete columns

df[, `:=`(speed = distance / (air_time/60),
   delay = arr_delay + dep_delay)] %>% head()
##    year month day dep_delay arr_delay carrier origin dest air_time distance
## 1: 2014     1   1        14        13      AA    JFK  LAX      359     2475
## 2: 2014     1   1        -3        13      AA    JFK  LAX      363     2475
## 3: 2014     1   1         2         9      AA    JFK  LAX      351     2475
## 4: 2014     1   1        -8       -26      AA    LGA  PBI      157     1035
## 5: 2014     1   1         2         1      AA    JFK  LAX      350     2475
## 6: 2014     1   1         4         0      AA    EWR  LAX      339     2454
##    hour    speed delay
## 1:    9 413.6490    27
## 2:   11 409.0909    10
## 3:   19 423.0769    11
## 4:    7 395.5414   -34
## 5:   13 424.2857     3
## 6:   18 434.3363     4

Subassign - conditional update

df[hour == 24L, hour := 0L][] #replace all hour 24 to 0
##         year month day dep_delay arr_delay carrier origin dest air_time
##      1: 2014     1   1        14        13      AA    JFK  LAX      359
##      2: 2014     1   1        -3        13      AA    JFK  LAX      363
##      3: 2014     1   1         2         9      AA    JFK  LAX      351
##      4: 2014     1   1        -8       -26      AA    LGA  PBI      157
##      5: 2014     1   1         2         1      AA    JFK  LAX      350
##     ---                                                                
## 253312: 2014    10  31         1       -30      UA    LGA  IAH      201
## 253313: 2014    10  31        -5       -14      UA    EWR  IAH      189
## 253314: 2014    10  31        -8        16      MQ    LGA  RDU       83
## 253315: 2014    10  31        -4        15      MQ    LGA  DTW       75
## 253316: 2014    10  31        -5         1      MQ    LGA  SDF      110
##         distance hour    speed delay
##      1:     2475    9 413.6490    27
##      2:     2475   11 409.0909    10
##      3:     2475   19 423.0769    11
##      4:     1035    7 395.5414   -34
##      5:     2475   13 424.2857     3
##     ---                             
## 253312:     1416   14 422.6866   -29
## 253313:     1400    8 444.4444   -19
## 253314:      431   11 311.5663     8
## 253315:      502   11 401.6000    11
## 253316:      659    8 359.4545    -4
# note := returns the result invisibly, so we have extra [] here

Delete columns

df[, c("delay") := NULL]

Using by

df[, max_speed := max(speed), by = .(origin, dest)]

Multiple columns

in_cols  = c("dep_delay", "arr_delay")
out_cols = c("max_dep_delay", "max_arr_delay")
df[, c(out_cols) := lapply(.SD, max), by = month, .SD = in_cols][] %>% head()
##    year month day dep_delay arr_delay carrier origin dest air_time distance
## 1: 2014     1   1        14        13      AA    JFK  LAX      359     2475
## 2: 2014     1   1        -3        13      AA    JFK  LAX      363     2475
## 3: 2014     1   1         2         9      AA    JFK  LAX      351     2475
## 4: 2014     1   1        -8       -26      AA    LGA  PBI      157     1035
## 5: 2014     1   1         2         1      AA    JFK  LAX      350     2475
## 6: 2014     1   1         4         0      AA    EWR  LAX      339     2454
##    hour    speed max_speed max_dep_delay max_arr_delay
## 1:    9 413.6490  526.5957           973           996
## 2:   11 409.0909  526.5957           973           996
## 3:   19 423.0769  526.5957           973           996
## 4:    7 395.5414  517.5000           973           996
## 5:   13 424.2857  526.5957           973           996
## 6:   18 434.3363  518.4507           973           996
#note the c() in out_cols because we can not do out_cols := lapply(.SD, max). That would result in adding one new column named out_col.

Side effects of :=

Because := refer to the memory directly, it can be dangerous in writing function: any operation would change the origin data set. Thus a better way is use copy() first and do other things in the duplicate one.