Learning data.table (3)

Posted by Rui Ying on Thursday, April 29, 2021

What’s key

Key is very similar to the group_by concept, or the telephone number book in our world: when you want to find someone’s phone number, you go find his/her first name, then second name. The names here are key.

Or as the data.table vignette shows, the key in data.table is inherited from rowname in data.frame. However, key is advantageous in

  • each row can have many keys in dt, but only one rowname in df
  • keys are not unique (you can have duplicate variable in key column)
  • keys can be in various variable types
  • key columns are well sorted

How to set key

setkey() or setkeyv(): the former one is better in interactive use, whilst the latter one is more of function-use.

setkey(flights, origin, dest)
#is equal to
setkeyv(flights, c("origin", "dest"))

Feature of using key

  • it modify data.table and return result invisibly
  • it’s manipulating by reference as := operator and all set* family function (setkey, setname etc.)

Do some work

setkey(flights, origin, dest)
# select j
flights[.("LGA", "TPA"), .(arr_delay)]

# do sth in j
flights[.("LGA", "TPA"), max(arr_delay)]

# use by
flights["JFK", max(dep_delay), keyby = month]

mult and nomatch argument

Very simply, mult choose how many matching rows to return, and nomatch selects if the unmatching NA should be returned or skipped.

The default value of these two are “all” and “NA”. Setting nomatch=NULL kips queries with no matches.

Ad of using key

Key is based on binary search (二分法) with O(Log(N)) complexity, while the traditional vector scan method is simply scaning as the name shows. The complexity is O(N).

But the document says the vector scan has been optimized in recent version thus is also binary search-based, otherwise you set the key to NULL and then it’s back to the slow one.