Review data analysis in R

Posted by Rui Ying on Wednesday, January 6, 2021

Use tibble to replace “old” data.frame

Differences:

  • print() by specifying rows;
  • subset (you can do slice like in Pandas).
library(tidyverse)
library(magrittr)
#mtcars %>% print(n=10,width = Inf) #always print all columns, regardless of the width of the screen.
mtcars[[1]] == mtcars[['mpg']] #first column/variable
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [31] TRUE TRUE
mtcars[,1] == mtcars$mpg #the same
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [31] TRUE TRUE
mtcars %>% select(mpg) %>% head(5) #select() is also used in subset but a little different
##                    mpg
## Mazda RX4         21.0
## Mazda RX4 Wag     21.0
## Datsun 710        22.8
## Hornet 4 Drive    21.4
## Hornet Sportabout 18.7

convert with data.frame

df <- tibble(foo = 1, bar = 'bar')
class(as.data.frame(df)) #[1] "data.frame"
## [1] "data.frame"
class(as_tibble(df)) #[1] "tbl_df"     "tbl"        "data.frame"
## [1] "tbl_df"     "tbl"        "data.frame"

Reshape data

What is Tidy data

Column -> variable, row -> observation. Ad: 1. uniformity; 2. Placing variable in columns (or a “vector” in R) allows R and tidyverse to work naturally. e.g.

mutate(mtcars,new_col = cyl+1) %>% head(5)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb new_col
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4       7
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4       7
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1       5
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1       7
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2       9

dplyr basics

  • Pick observations by their values: filter()
  • Reorder the rows: arrange()
  • Pick variables by their names: select()
  • Create new variable by computing based on existing variables: mutate()
  • A simple summary: summarise()
  • Group data by one or more variables: group_by() (always ungroup() after grouping)

Pivot

Longer table

Question: Column names are not names of variables, but values of a variable. For example:

head(table4a)
## # A tibble: 3 × 3
##   country     `1999` `2000`
##   <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

We can divide the two untidy columns into a variable column (with col_name ‘year’) and a value column (with col_name ‘cases’)

table4a %>% pivot_longer(c('1999','2000'), names_to = "year", values_to = "cases")
## # A tibble: 6 × 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Afghanistan 2000    2666
## 3 Brazil      1999   37737
## 4 Brazil      2000   80488
## 5 China       1999  212258
## 6 China       2000  213766
table4a %>% gather(`1999`,`2000`,key = 'year',value = 'cases') #the gather() in tidyr is equal
## # A tibble: 6 × 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766

Wider pivot

Question: Observations is scattered across multiple rows e.g. the cases and population are both observation, which should be put in one row.

 head(table2,5)
## # A tibble: 5 × 4
##   country      year type          count
##   <chr>       <int> <chr>         <int>
## 1 Afghanistan  1999 cases           745
## 2 Afghanistan  1999 population 19987071
## 3 Afghanistan  2000 cases          2666
## 4 Afghanistan  2000 population 20595360
## 5 Brazil       1999 cases         37737
table2 %>% pivot_wider(names_from = type, values_from = count) # here's 2 paramters only
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583
table2 %>% spread(type,count) #spread() is equal as well
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Joins

The variables used to connect each pair of tables are called keys, if it is primary key in one table (i.e. define the observation or row name), then it’s a foreign key in other tables. Primary key are not necessary for a table.

Four mutate joins (add new matched variable to primary): inner (smallest subset), left (as primary), right (as primary), full (collection). Two filter joins (filter matched observations to save or drop).

reshap2 package

melt()

melt() breaks data into a data.frame with only two columns (variable and value), which is exactly transforming a “wide format” data into a “long format” data. parameters: - id - variable.name (change the default variable column name) - value.name (change the default value column name)

library(reshape2)
table1_melt <- table1 %>% melt(id=c("country","year"))
table1_melt
##        country year   variable      value
## 1  Afghanistan 1999      cases        745
## 2  Afghanistan 2000      cases       2666
## 3       Brazil 1999      cases      37737
## 4       Brazil 2000      cases      80488
## 5        China 1999      cases     212258
## 6        China 2000      cases     213766
## 7  Afghanistan 1999 population   19987071
## 8  Afghanistan 2000 population   20595360
## 9       Brazil 1999 population  172006362
## 10      Brazil 2000 population  174504898
## 11       China 1999 population 1272915272
## 12       China 2000 population 1280428583

dcast()

dcast() do the opposite thing as melt(), transforming long to wide data. Additional parameter including mean, The difference between dcast and acast is the output in data.frame (2-D only) or vector/array/matrix.

table1_melt %>% dcast(country+year~variable) #turn it back
##       country year  cases population
## 1 Afghanistan 1999    745   19987071
## 2 Afghanistan 2000   2666   20595360
## 3      Brazil 1999  37737  172006362
## 4      Brazil 2000  80488  174504898
## 5       China 1999 212258 1272915272
## 6       China 2000 213766 1280428583

This post is built by rmarkdown. Seemingly Rstudio and Rmd are good tool to manage this site.