Key is very similar to the group_by concept, or the telephone number book in our world: when you want to find someone’s phone number, you go find his/her first name, then second name. The names here are key.
Or as the data.table vignette shows, the
key in data.table is inherited from
rowname in data.frame. However,
key is advantageous in
- each row can have many keys in dt, but only one rowname in df
- keys are not unique (you can have duplicate variable in key column)
- keys can be in various variable types
- key columns are well sorted
How to set key
setkey() or setkeyv(): the former one is better in interactive use, whilst the latter one is more of function-use.
setkey(flights, origin, dest) #is equal to setkeyv(flights, c("origin", "dest"))
Feature of using key
- it modify data.table and return result invisibly
- it’s manipulating by reference as
:=operator and all
set*family function (setkey, setname etc.)
Do some work
setkey(flights, origin, dest) # select j flights[.("LGA", "TPA"), .(arr_delay)] # do sth in j flights[.("LGA", "TPA"), max(arr_delay)] # use by flights["JFK", max(dep_delay), keyby = month]
mult choose how many matching rows to return, and
nomatch selects if the unmatching NA should be returned or skipped.
The default value of these two are “all” and “NA”. Setting
nomatch=NULL kips queries with no matches.
Ad of using key
Key is based on binary search (二分法) with O(Log(N)) complexity, while the traditional vector scan method is simply scaning as the name shows. The complexity is O(N).
But the document says the vector scan has been optimized in recent version thus is also binary search-based, otherwise you set the key to
NULL and then it’s back to the slow one.