R Data Reshaping

Merge data frames

R language data frame merging merge() Function.

The syntax format of the merge() function is as follows:

# S3　Method
merge(x, y, ...)
# data.frame of S3　Method　
merge(x, y, by = intersect(names(x), names(y)),
　　　　　　by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all,
　　　　　　sort = TRUE, suffixes = c(".x",".y"), no.dups = TRUE,
　　　　　　incomparables = NULL, ...)

Common parameter descriptions:

x, y: Data frames
by, by.x, by.y: Specify the matching column names in two data frames, by default using the same column names in the two data frames.
all: Logical value; all = L is the shorthand for all.x = L and all.y = L, where L can be TRUE or FALSE.
all.x: Logical value, default is FALSE. If TRUE, display the matching rows in x, even if there is no corresponding match in y, and the non-matching rows in y are represented by NA.
all.y: Logical value, default is FALSE. If TRUE, display the matching rows in y, even if there is no corresponding match in x, and the non-matching rows in x are represented by NA.
sort: Logical value, indicating whether to sort the columns.

The merge() function is very similar to SQL's JOIN feature:

Natural join or INNER JOIN: If there is at least one match in the table, return the row
Left outer join or LEFT JOIN: Even if there is no match in the right table, return all rows from the left table
Right outer join or RIGHT JOIN: Even if there is no match in the left table, return all rows from the right table
Full outer join or FULL JOINIf there is a match in any of the tables, return the row

Example

# data frame　1
df1　= data.frame(SiteId = c(1:6), Site = c("Google","w3"codebox","Taobao","Facebook","Zhihu","Weibo")
# data frame　2
df2　= data.frame(SiteId = c(2,　4,　6,　7,　8), Country = c("CN","USA","CN","USA","IN")　
# INNER JOIN　
df1　= merge(x=df1,y=df2,by="SiteId")
print("-----　INNER JOIN　-----)
print(df1)
# FULL JOIN
df2　= merge(x=df1,y=df2,by="SiteId",all=TRUE)
print("-----　FULL JOIN　-----)
print(df2)
# LEFT JOIN
df3　= merge(x=df1,y=df2,by="SiteId",all.x=TRUE)
print("-----　LEFT JOIN　-----)
print(df3)
# RIGHT JOIN
df4　= merge(x=df1,y=df2,by="SiteId",all.y=TRUE)
print("-----　RIGHT JOIN　-----)
print(df4)

The output of the above code is:

[1] "-----　INNER JOIN　-----"
　　SiteId　　　　　Site　Country
1　　　　　　2　　　w3codebox　　　　　　CN
2　　　　　　4　Facebook　　　　　USA
3　　　　　　6　　　　Weibo　　　　　　CN
[1] "-----　FULL JOIN　-----"
　　SiteId　　　　　Site　Country.x　Country.y
1　　　　　　2　　　w3codebox　　　　　　　　CN　　　　　　　　CN
2　　　　　　4　Facebook　　　　　　　USA　　　　　　　USA
3　　　　　　6　　　　Weibo　　　　　　　　CN　　　　　　　　CN
4　　　　　　7　　　　　<NA>　　　　　　<NA>　　　　　　　USA
5　　　　　　8　　　　　<NA>　　　　　　<NA>　　　　　　　　IN
[1] "-----　LEFT JOIN　-----"
　　SiteId　　　Site.x　Country　　　Site.y　Country.x　Country.y
1　　　　　　2　　　w3codebox　　　　　　CN　　　w3codebox　　　　　　　　CN　　　　　　　　CN
2　　　　　　4　Facebook　　　　　USA　Facebook　　　　　　　USA　　　　　　　USA
3　　　　　　6　　　　Weibo　　　　　　CN　　　　Weibo　　　　　　　　CN　　　　　　　　CN
[1] "-----　RIGHT JOIN　-----"
　　SiteId　　　Site.x　Country　　　Site.y　Country.x　Country.y
1　　　　　　2　　　w3codebox　　　　　　CN　　　w3codebox　　　　　　　　CN　　　　　　　　CN
2　　　　　　4　Facebook　　　　　USA　Facebook　　　　　　　USA　　　　　　　USA
3　　　　　　6　　　　Weibo　　　　　　CN　　　　Weibo　　　　　　　　CN　　　　　　　　CN
4　　　　　　7　　　　　<NA>　　　　<NA>　　　　　<NA>　　　　　　<NA>　　　　　　　USA
5　　　　　　8　　　　　<NA>　　　　<NA>　　　　　<NA>　　　　　　<NA>　　　　　　　　IN

data integration and splitting

used in R language melt() and cast() functions to integrate and split data.

melt(): Convert wide-form data to long-form.
cast(): Convert long-form data to wide-form.

The following figure well demonstrates the functions of melt() and cast() (detailed examples will be provided later):

melt() stacks each column of the dataset into one column, function syntax format:}}

melt(data, ..., na.rm = FALSE, value.name = "value")

Parameter description:

data: Dataset.
...: Pass other parameters to other methods or parameters from other methods.
na.rm: Whether to delete NA values in the dataset.
value.name: Variable name, used to store values.

Before performing the following operations, we first install the required packages:

# Install libraries, MASS includes many statistical functions, tools, and datasets
install.packages("MASS", repos = "https://mirrors.ustc.edu.cn/CRAN/)　
　　
# melt() and cast() functions require libraries　
install.packages("reshape2", repos = "https://mirrors.ustc.edu.cn/CRAN/)　
install.packages("reshape", repos = "https://mirrors.ustc.edu.cn/CRAN/)

Test example:

Example

# Load libraries
library(MASS)　
library(reshape2)　
library(reshape)　
　　
# Create data frame
id<-　c(1,　1,　2,　2)　
time <-　c(1,　2,　1,　2)　
x1　<-　c(5,　3,　6,　2)　
x2　<-　c(6,　5,　1,　4)　
mydata <-　data.frame(id, time, x1, x2)　
　　
# Original data frame
cat("Original data frame:\n")　
print(mydata)　
# Integration
md <-　melt(mydata, id = c("id","time"))　
　　
cat("\nAfter integration:\n")　
print(md)

The output of the above code is:

Original data frame:
id time x1　x2
1　　1　　　　1　　5　　6
2　　1　　　　2　　3　　5
3　　2　　　　1　　6　　1
4　　2　　　　2　　2　　4
After integration:
id time variable value
1　　1　　　　1　　　　　　　x1　　　　　5
2　　1　　　　2　　　　　　　x1　　　　　3
3　　2　　　　1　　　　　　　x1　　　　　6
4　　2　　　　2　　　　　　　x1　　　　　2
5　　1　　　　1　　　　　　　x2　　　　　6
6　　1　　　　2　　　　　　　x2　　　　　5
7　　2　　　　1　　　　　　　x2　　　　　1
8　　2　　　　2　　　　　　　x2　　　　　4

The cast function is used to restore merged data frames, dcast() returns a data frame, acast() returns a vector/Matrix/Array.

cast() function syntax format:

dcast(
　　data,
　　formula,
　　fun.aggregate = NULL,
　　...
　　margins = NULL,
　　subset = NULL,
　　fill = NULL,
　　drop = TRUE,
　　value.var = guess_value(data)
)
acast(
　　data,
　　formula,
　　fun.aggregate = NULL,
　　...
　　margins = NULL,
　　subset = NULL,
　　fill = NULL,
　　drop = TRUE,
　　value.var = guess_value(data)
)

Parameter description:

data: Merged data frame.
formula: The format of reshaped data, similar to x ~ y format, x as row label, y as column label.
fun.aggregate: Aggregate function, used to process value values.
margins: A vector of variable names (can include "grand_col" and "grand_row"), used to calculate margins, set TRUE to calculate all margins.
subset: Filter the results by conditions, format similar to subset = .(variable=="length")。
drop: Whether to retain the default value.
value.var: Followed by the field to be processed.

Example

# Load libraries
library(MASS)　
library(reshape2)　
library(reshape)　
　　
# Create data frame
id<-　c(1,　1,　2,　2)　
time <-　c(1,　2,　1,　2)　
x1　<-　c(5,　3,　6,　2)　
x2　<-　c(6,　5,　1,　4)　
mydata <-　data.frame(id, time, x1, x2)　
# Integration
md <-　melt(mydata, id = c("id","time"))　
# Print recasted dataset using cast() function　
cast.data <-　cast(md, id~variable, mean　
　　
print(cast.data)　
　　
cat("\n")　
time.cast <-　cast(md, time~variable, mean　
print(time.cast)　
cat("\n")　
id.time <-　cast(md, id~time, mean　
print(id.time)　
cat("\n")　
id.time.cast <-　cast(md, id+time~variable)　
print(id.time.cast　
cat("\n")　
id.variable.time <-　cast(md, id+variable~time)　
print(id.variable.time)　
cat("\n")　
id.variable.time2　<-　cast(md, id~variable+time)　
print(id.variable.time2)

The output of the above code is:

id x1　　x2
1　　1　　4　5.5
2　　2　　4　2.5
　　time x1　　x2
1　　　　1　5.5　3.5
2　　　　2　2.5　4.5
　　id　　　1　2
1　　1　5.5　4
2　　2　3.5　3
　　id time x1　x2
1　　1　　　　1　　5　　6
2　　1　　　　2　　3　　5
3　　2　　　　1　　6　　1
4　　2　　　　2　　2　　4
　　id variable　1　2
1　　1　　　　　　　x1　5　3
2　　1　　　　　　　x2　6　5
3　　2　　　　　　　x1　6　2
4　　2　　　　　　　x2　1　4
　　id x1_1　x1_2　x2_1　x2_2
1　　1　　　　5　　　　3　　　　6　　　　5
2　　2　　　　6　　　　2　　　　1　　　　4

R Packages R Data Frames

R Language Tutorial

R Data Reshaping

Merge data frames

data integration and splitting