learning data.table (5)

Posted by Rui Ying on Friday, May 7, 2021

What’s secondary index?

Secondary index is similar to key but differ in

  • Only reorder specific columns instead of entire data.table, and record the order vector in attribute index
  • There can be > 1 secondary index for a data.table (though you can have multiple keys per row).

I treat secondary index as fast and simple key for subsetting.

Set secondary index

setindex(flights, origin) #or setindexv()
head(flights)
##    year month day dep_delay arr_delay carrier origin dest air_time distance
## 1: 2014     1   1        14        13      AA    JFK  LAX      359     2475
## 2: 2014     1   1        -3        13      AA    JFK  LAX      363     2475
## 3: 2014     1   1         2         9      AA    JFK  LAX      351     2475
## 4: 2014     1   1        -8       -26      AA    LGA  PBI      157     1035
## 5: 2014     1   1         2         1      AA    JFK  LAX      350     2475
## 6: 2014     1   1         4         0      AA    EWR  LAX      339     2454
##    hour
## 1:    9
## 2:   11
## 3:   19
## 4:    7
## 5:   13
## 6:   18
names(attributes(flights))
## [1] "names"             "row.names"         "class"            
## [4] ".internal.selfref" "index"

Get secondary index

setindex(flights, origin, dest)
indices(flights)
## [1] "origin"       "origin__dest"

Why secondary index?

Case 1: reordering is expensive

Using setkey() does two things:

  • Computing order vector

  • Reordering entire data.table

But the second one can be consuming, so secondary index do the first thing only and store the output vector into index.

Case 2: there is only 1 key for 1 data.table

If we have set one key, and need to reset one, we have to re-computer and re-order entire data.table. But the secondary index allows more.

Case 3: reuse index

The index attribute avoid repeated computing if an index already exists. Simply add verbose=TRUE in data.table arguments.

Application of secondary index

Fast subset using on

flights["JFK", on="origin"] #works for both key and secondary index
##        year month day dep_delay arr_delay carrier origin dest air_time distance
##     1: 2014     1   1        14        13      AA    JFK  LAX      359     2475
##     2: 2014     1   1        -3        13      AA    JFK  LAX      363     2475
##     3: 2014     1   1         2         9      AA    JFK  LAX      351     2475
##     4: 2014     1   1         2         1      AA    JFK  LAX      350     2475
##     5: 2014     1   1        -2       -18      AA    JFK  LAX      338     2475
##    ---                                                                         
## 81479: 2014    10  31        -4       -21      UA    JFK  SFO      337     2586
## 81480: 2014    10  31        -2       -37      UA    JFK  SFO      344     2586
## 81481: 2014    10  31         0       -33      UA    JFK  LAX      320     2475
## 81482: 2014    10  31        -6       -38      UA    JFK  SFO      343     2586
## 81483: 2014    10  31        -6       -38      UA    JFK  LAX      323     2475
##        hour
##     1:    9
##     2:   11
##     3:   19
##     4:   13
##     5:   21
##    ---     
## 81479:   17
## 81480:   18
## 81481:   17
## 81482:    9
## 81483:   11

Other features as normal cases

flights[.("LGA", "TPA"), .(arr_delay), on = c("origin", "dest")]
##       arr_delay
##    1:         1
##    2:        14
##    3:       -17
##    4:        -4
##    5:       -12
##   ---          
## 1848:        39
## 1849:       -24
## 1850:       -12
## 1851:        21
## 1852:       -11
#find more in vignettes