How I can calculate correlation between each variable within each group in R using dplyr package?

Question

Let's say i have data frame in R that looks like this :

var2 = c(rep("A",3),rep("B",3),rep("C",3),rep("D",3),rep("E",3),rep("F",3),
         rep("H",3),rep("I",3))

y2 = c(-1.23, -0.983, 1.28, -0.268, -0.46, -1.23,
            1.87, 0.416, -1.99, 0.289, 1.7, -0.455,
           -0.648, 0.376, -0.887,0.534,-0.679,-0.923,
           0.987,0.324,-0.783,-0.679,0.326,0.998);length(y2)
group2 = c(rep(1,6),rep(2,6),rep(3,6),rep(1,6))
data2 = tibble(var2,group2,y2)

with output :

# A tibble: 24 × 3
   var2  group2     y2
   <chr>  <dbl>  <dbl>
 1 A          1 -1.23 
 2 A          1 -0.983
 3 A          1  1.28 
 4 B          1 -0.268
 5 B          1 -0.46 
 6 B          1 -1.23 
 7 C          2  1.87 
 8 C          2  0.416
 9 C          2 -1.99 
10 D          2  0.289
11 D          2  1.7  
12 D          2 -0.455
13 E          3 -0.648
14 E          3  0.376
15 E          3 -0.887
16 F          3  0.534
17 F          3 -0.679
18 F          3 -0.923
19 H          1  0.987
20 H          1  0.324
21 H          1 -0.783
22 I          1 -0.679
23 I          1  0.326
24 I          1  0.998

i want to calculate the correlation of each distinct pair in R within each group using dplyr. Ideally i want the resulted tibble to look like this (the 4th column to contain the values of each correlation pair):

which ideally must look like this :

group	var1	var2	value
1	A	B	cor(A,B)
1	A	H	cor(A,H)
1	A	I	cor(A,I)
1	B	H	cor(B,H)
1	B	I	cor(B,I)
1	H	I	cor(H,I)
2	C	D	cor(C,D)
3	E	F	cor(E,F)

How i can do that in R ? Any help ?

PaulS · Accepted Answer · 2022-07-15 16:47:23Z

3

A possible solution:

library(tidyverse)

data2 %>%
  group_by(group2) %>% 
  group_split() %>% 
  map(\(x) x %>% group_by(var2) %>% 
  group_map(~ data.frame(.x[-1]) %>% set_names(.y)) %>% 
  bind_cols() %>% cor %>% 
  {data.frame(row = rownames(.)[row(.)[upper.tri(.)]], 
              col = colnames(.)[col(.)[upper.tri(.)]], 
              corr = .[upper.tri(.)])}) %>% 
  imap_dfr(~ data.frame(group = .y, .x))

#>   group row col       corr
#> 1     1   A   B -0.9949738
#> 2     1   A   H -0.9581357
#> 3     1   B   H  0.9819901
#> 4     1   A   I  0.8533855
#> 5     1   B   I -0.9012948
#> 6     1   H   I -0.9669093
#> 7     2   C   D  0.4690460
#> 8     3   E   F -0.1864518

answered Jul 15, 2022 at 16:47

PaulS

23.5k2 gold badges13 silver badges33 bronze badges

If I had a NA value somewhere.Say the NA is the last value of the y vector then the cor must be pairwise complete.In your solution what modifications I must do in order to calculate it ?
– Homer Jay Simpson
Commented Jul 15, 2022 at 17:05
By pairwise complete correlations, you mean that the correlations with I should appear with NA, @HomerJaySimpson? Or should NA be ignored while computing the correlation, in order to obtain a number for the correlation (an not NA)?
– PaulS
Commented Jul 15, 2022 at 17:12
yes you are right I am sorry.
– Homer Jay Simpson
Commented Jul 15, 2022 at 17:16
If you want the correlations be NA when Iis involved, then my code does already that. If you need to ignore the NA and compute those correlations in order to obtains numbers for the correlations, we need to change the code a little bit.
– PaulS
Commented Jul 15, 2022 at 17:20
1

thank you very much Paul for your effort and your useful and precious answers
– Homer Jay Simpson
Commented Jul 15, 2022 at 18:04

| Show 2 more comments

Onyambu · Accepted Answer · 2022-07-15 16:53:52Z

if you are okay with repeating the functions you can do:

fun <- function(x, y){
  a <- split(x, y)
  col1 <- combn(names(a), 2, paste, collapse = '_')
  col2 <- combn(unname(a), 2, do.call, what='cor')
  data.frame(vars = col1, cor = col2)
}

data2 %>%
  group_by(group2)%>%
  summarise(fun(y2, var2), .groups = 'drop')

# A tibble: 8 x 3
# Groups:   group2 [3]
  group2 vars     cor
   <dbl> <chr>  <dbl>
1      1 A_B   -0.995
2      1 A_H   -0.958
3      1 A_I    0.853
4      1 B_H    0.982
5      1 B_I   -0.901
6      1 H_I   -0.967
7      2 C_D    0.469
8      3 E_F   -0.186

If you do not want to repeat the functions as the process might be expensive, you can do:

 data2 %>%
      group_by(group2)%>%
      summarise(s=combn(split(y2, var2), 2, 
      \(x)stack(setNames(cor(x[[1]], x[[2]]), paste(names(x), collapse='_'))),
          simplify = FALSE),.groups = 'drop') %>%
      unnest(s)

# A tibble: 8 x 3
  group2 values ind  
   <dbl>  <dbl> <fct>
1      1 -0.995 A_B  
2      1 -0.958 A_H  
3      1  0.853 A_I  
4      1  0.982 B_H  
5      1 -0.901 B_I  
6      1 -0.967 H_I  
7      2  0.469 C_D  
8      3 -0.186 E_F

the vars column in your first solution and the ind column in the second I want to be two separate columns — Homer Jay Simpson, Commented Jul 17, 2022 at 9:22
@HomerJaySimpson thats easy. just use separate function to do that. — Onyambu, Commented Jul 17, 2022 at 9:40

stefan · Accepted Answer · 2022-07-15 16:51:14Z

Another option would be widyr::pairwise_cor which requires to first add an identifier for the "observation":

library(widyr)
library(dplyr)

data2 %>%
  group_by(var2, group2) %>%
  mutate(obs = row_number()) |> 
  ungroup() %>% 
  split(.$group2) %>%
  lapply(function(x) widyr::pairwise_cor(x, var2, obs, y2, upper = FALSE)) %>%
  bind_rows(.id = "group2")
#> # A tibble: 8 × 4
#>   group2 item1 item2 correlation
#>   <chr>  <chr> <chr>       <dbl>
#> 1 1      A     B          -0.995
#> 2 1      A     H          -0.958
#> 3 1      B     H           0.982
#> 4 1      A     I           0.853
#> 5 1      B     I          -0.901
#> 6 1      H     I          -0.967
#> 7 2      C     D           0.469
#> 8 3      E     F          -0.186

If I had a NA value somewhere.Say the NA is the last value of the y vector then the cor must be pairwise complete.In your solution what modifications I must do in order to calculate it ? — Homer Jay Simpson, Commented Jul 15, 2022 at 17:05

Collectives™ on Stack Overflow

How I can calculate correlation between each variable within each group in R using dplyr package?

3 Answers 3

Not the answer you're looking for? Browse other questions tagged
r
dplyr
correlation
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Not the answer you're looking for? Browse other questions tagged rdplyrcorrelation or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
r
dplyr
correlation
or ask your own question.