Learning data.table (4)

Posted by Rui Ying on Friday, April 30, 2021

melt and dcast

Actually I don’t know the difference of these two functions between data.table and reshape2, but the use should be same.

melt converts wide to long

library(data.table)
s1 <- "family_id age_mother dob_child1 dob_child2 dob_child3
1         30 1998-11-26 2000-01-29         NA
2         27 1996-06-22         NA         NA
3         26 2002-07-11 2004-04-05 2007-09-02
4         32 2004-10-10 2009-08-27 2012-07-21
5         29 2000-12-05 2005-02-28         NA"
DT <- fread(s1)
DT
##    family_id age_mother dob_child1 dob_child2 dob_child3
## 1:         1         30 1998-11-26 2000-01-29       <NA>
## 2:         2         27 1996-06-22       <NA>       <NA>
## 3:         3         26 2002-07-11 2004-04-05 2007-09-02
## 4:         4         32 2004-10-10 2009-08-27 2012-07-21
## 5:         5         29 2000-12-05 2005-02-28       <NA>
str(DT)
## Classes 'data.table' and 'data.frame':   5 obs. of  5 variables:
##  $ family_id : int  1 2 3 4 5
##  $ age_mother: int  30 27 26 32 29
##  $ dob_child1: IDate, format: "1998-11-26" "1996-06-22" ...
##  $ dob_child2: IDate, format: "2000-01-29" NA ...
##  $ dob_child3: IDate, format: NA NA ...
##  - attr(*, ".internal.selfref")=<externalptr>
DT.melt <- melt(DT, id.vars = c("family_id","age_mother"),
                measure.vars = c("dob_child1", "dob_child2", "dob_child3"),
                variable.name = "child",
                value.name = "dob")
head(DT.melt)
##    family_id age_mother      child        dob
## 1:         1         30 dob_child1 1998-11-26
## 2:         2         27 dob_child1 1996-06-22
## 3:         3         26 dob_child1 2002-07-11
## 4:         4         32 dob_child1 2004-10-10
## 5:         5         29 dob_child1 2000-12-05
## 6:         1         30 dob_child2 2000-01-29
  • If neither id.var and measure.var are specified , the id will assign all non-integer, logical, numeric columns.
  • By default, var and value name is simply var and value.

dcast converts long to wide

dcast use formula as below

dcast(DT.melt, family_id + age_mother ~ child, value.var = "dob") #value.var is the variable to fill all cells
##    family_id age_mother dob_child1 dob_child2 dob_child3
## 1:         1         30 1998-11-26 2000-01-29       <NA>
## 2:         2         27 1996-06-22       <NA>       <NA>
## 3:         3         26 2002-07-11 2004-04-05 2007-09-02
## 4:         4         32 2004-10-10 2009-08-27 2012-07-21
## 5:         5         29 2000-12-05 2005-02-28       <NA>

Limitation in combine multiple columns

s2 <- "family_id age_mother dob_child1 dob_child2 dob_child3 gender_child1 gender_child2 gender_child3
1         30 1998-11-26 2000-01-29         NA             1             2            NA
2         27 1996-06-22         NA         NA             2            NA            NA
3         26 2002-07-11 2004-04-05 2007-09-02             2             2             1
4         32 2004-10-10 2009-08-27 2012-07-21             1             1             1
5         29 2000-12-05 2005-02-28         NA             2             1            NA"
DT <- fread(s2)
DT
##    family_id age_mother dob_child1 dob_child2 dob_child3 gender_child1
## 1:         1         30 1998-11-26 2000-01-29       <NA>             1
## 2:         2         27 1996-06-22       <NA>       <NA>             2
## 3:         3         26 2002-07-11 2004-04-05 2007-09-02             2
## 4:         4         32 2004-10-10 2009-08-27 2012-07-21             1
## 5:         5         29 2000-12-05 2005-02-28       <NA>             2
##    gender_child2 gender_child3
## 1:             2            NA
## 2:            NA            NA
## 3:             2             1
## 4:             1             1
## 5:             1            NA

An untidy way to implement this (not run)

DT.m1 = melt(DT, id = c("family_id", "age_mother"))
DT.m1[, c("variable", "child") := tstrsplit(variable, "_", fixed = TRUE)] #I don't even understand this weird function tstrsplit
DT.c1 = dcast(DT.m1, family_id + age_mother + child ~ variable, value.var = "value")
DT.c1

Enhanced ways

Pass a list to measure.vars where each element of the list contains the column that should be combined together.

colA = paste("dob_child", 1:3, sep = "")
colB = paste("gender_child", 1:3, sep = "")
DT.m2 = melt(DT, measure = list(colA, colB), value.name = c("dob", "gender")) #notice the measure paramter instead of measure.var
DT.m2
##     family_id age_mother variable        dob gender
##  1:         1         30        1 1998-11-26      1
##  2:         2         27        1 1996-06-22      2
##  3:         3         26        1 2002-07-11      2
##  4:         4         32        1 2004-10-10      1
##  5:         5         29        1 2000-12-05      2
##  6:         1         30        2 2000-01-29      2
##  7:         2         27        2       <NA>     NA
##  8:         3         26        2 2004-04-05      2
##  9:         4         32        2 2009-08-27      1
## 10:         5         29        2 2005-02-28      1
## 11:         1         30        3       <NA>     NA
## 12:         2         27        3       <NA>     NA
## 13:         3         26        3 2007-09-02      1
## 14:         4         32        3 2012-07-21      1
## 15:         5         29        3       <NA>     NA

The additional variable column is good to do dcast later and can be removed if necessary!

Using patterns()

DT.m2 = melt(DT, measure = patterns("^dob", "^gender"), value.name = c("dob", "gender"))
DT.m2
##     family_id age_mother variable        dob gender
##  1:         1         30        1 1998-11-26      1
##  2:         2         27        1 1996-06-22      2
##  3:         3         26        1 2002-07-11      2
##  4:         4         32        1 2004-10-10      1
##  5:         5         29        1 2000-12-05      2
##  6:         1         30        2 2000-01-29      2
##  7:         2         27        2       <NA>     NA
##  8:         3         26        2 2004-04-05      2
##  9:         4         32        2 2009-08-27      1
## 10:         5         29        2 2005-02-28      1
## 11:         1         30        3       <NA>     NA
## 12:         2         27        3       <NA>     NA
## 13:         3         26        3 2007-09-02      1
## 14:         4         32        3 2012-07-21      1
## 15:         5         29        3       <NA>     NA
#the ^ here means letters beginning with "dob" and "gender"
#similarly, _ means end with xxx

You can also do this using stat::reshape() in base R

Enhanced dcast

DT.c2 <- dcast(DT.m2, family_id + age_mother ~ variable, value.var = c("dob", "gender") )
DT.c2
##    family_id age_mother      dob_1      dob_2      dob_3 gender_1 gender_2
## 1:         1         30 1998-11-26 2000-01-29       <NA>        1        2
## 2:         2         27 1996-06-22       <NA>       <NA>        2       NA
## 3:         3         26 2002-07-11 2004-04-05 2007-09-02        2        2
## 4:         4         32 2004-10-10 2009-08-27 2012-07-21        1        1
## 5:         5         29 2000-12-05 2005-02-28       <NA>        2        1
##    gender_3
## 1:       NA
## 2:       NA
## 3:        1
## 4:        1
## 5:       NA