5

Let's say i have data frame in R that looks like this :

var2 = c(rep("A",3),rep("B",3),rep("C",3),rep("D",3),rep("E",3),rep("F",3),
         rep("H",3),rep("I",3))

y2 = c(-1.23, -0.983, 1.28, -0.268, -0.46, -1.23,
            1.87, 0.416, -1.99, 0.289, 1.7, -0.455,
           -0.648, 0.376, -0.887,0.534,-0.679,-0.923,
           0.987,0.324,-0.783,-0.679,0.326,0.998);length(y2)
group2 = c(rep(1,6),rep(2,6),rep(3,6),rep(1,6))
data2 = tibble(var2,group2,y2)

with output :

# A tibble: 24 × 3
   var2  group2     y2
   <chr>  <dbl>  <dbl>
 1 A          1 -1.23 
 2 A          1 -0.983
 3 A          1  1.28 
 4 B          1 -0.268
 5 B          1 -0.46 
 6 B          1 -1.23 
 7 C          2  1.87 
 8 C          2  0.416
 9 C          2 -1.99 
10 D          2  0.289
11 D          2  1.7  
12 D          2 -0.455
13 E          3 -0.648
14 E          3  0.376
15 E          3 -0.887
16 F          3  0.534
17 F          3 -0.679
18 F          3 -0.923
19 H          1  0.987
20 H          1  0.324
21 H          1 -0.783
22 I          1 -0.679
23 I          1  0.326
24 I          1  0.998

i want to calculate the correlation of each distinct pair in R within each group using dplyr. Ideally i want the resulted tibble to look like this (the 4th column to contain the values of each correlation pair):

which ideally must look like this :

group var1 var2 value
1 A B cor(A,B)
1 A H cor(A,H)
1 A I cor(A,I)
1 B H cor(B,H)
1 B I cor(B,I)
1 H I cor(H,I)
2 C D cor(C,D)
3 E F cor(E,F)

How i can do that in R ? Any help ?

3 Answers 3

3

A possible solution:

library(tidyverse)

data2 %>%
  group_by(group2) %>% 
  group_split() %>% 
  map(\(x) x %>% group_by(var2) %>% 
  group_map(~ data.frame(.x[-1]) %>% set_names(.y)) %>% 
  bind_cols() %>% cor %>% 
  {data.frame(row = rownames(.)[row(.)[upper.tri(.)]], 
              col = colnames(.)[col(.)[upper.tri(.)]], 
              corr = .[upper.tri(.)])}) %>% 
  imap_dfr(~ data.frame(group = .y, .x))

#>   group row col       corr
#> 1     1   A   B -0.9949738
#> 2     1   A   H -0.9581357
#> 3     1   B   H  0.9819901
#> 4     1   A   I  0.8533855
#> 5     1   B   I -0.9012948
#> 6     1   H   I -0.9669093
#> 7     2   C   D  0.4690460
#> 8     3   E   F -0.1864518
7
  • If I had a NA value somewhere.Say the NA is the last value of the y vector then the cor must be pairwise complete.In your solution what modifications I must do in order to calculate it ? Commented Jul 15, 2022 at 17:05
  • By pairwise complete correlations, you mean that the correlations with I should appear with NA, @HomerJaySimpson? Or should NA be ignored while computing the correlation, in order to obtain a number for the correlation (an not NA)?
    – PaulS
    Commented Jul 15, 2022 at 17:12
  • yes you are right I am sorry. Commented Jul 15, 2022 at 17:16
  • If you want the correlations be NA when Iis involved, then my code does already that. If you need to ignore the NA and compute those correlations in order to obtains numbers for the correlations, we need to change the code a little bit.
    – PaulS
    Commented Jul 15, 2022 at 17:20
  • 1
    thank you very much Paul for your effort and your useful and precious answers Commented Jul 15, 2022 at 18:04
3

if you are okay with repeating the functions you can do:

fun <- function(x, y){
  a <- split(x, y)
  col1 <- combn(names(a), 2, paste, collapse = '_')
  col2 <- combn(unname(a), 2, do.call, what='cor')
  data.frame(vars = col1, cor = col2)
}

data2 %>%
  group_by(group2)%>%
  summarise(fun(y2, var2), .groups = 'drop')

# A tibble: 8 x 3
# Groups:   group2 [3]
  group2 vars     cor
   <dbl> <chr>  <dbl>
1      1 A_B   -0.995
2      1 A_H   -0.958
3      1 A_I    0.853
4      1 B_H    0.982
5      1 B_I   -0.901
6      1 H_I   -0.967
7      2 C_D    0.469
8      3 E_F   -0.186

If you do not want to repeat the functions as the process might be expensive, you can do:

 data2 %>%
      group_by(group2)%>%
      summarise(s=combn(split(y2, var2), 2, 
      \(x)stack(setNames(cor(x[[1]], x[[2]]), paste(names(x), collapse='_'))),
          simplify = FALSE),.groups = 'drop') %>%
      unnest(s)

# A tibble: 8 x 3
  group2 values ind  
   <dbl>  <dbl> <fct>
1      1 -0.995 A_B  
2      1 -0.958 A_H  
3      1  0.853 A_I  
4      1  0.982 B_H  
5      1 -0.901 B_I  
6      1 -0.967 H_I  
7      2  0.469 C_D  
8      3 -0.186 E_F  
2
  • the vars column in your first solution and the ind column in the second I want to be two separate columns Commented Jul 17, 2022 at 9:22
  • @HomerJaySimpson thats easy. just use separate function to do that.
    – Onyambu
    Commented Jul 17, 2022 at 9:40
2

Another option would be widyr::pairwise_cor which requires to first add an identifier for the "observation":

library(widyr)
library(dplyr)

data2 %>%
  group_by(var2, group2) %>%
  mutate(obs = row_number()) |> 
  ungroup() %>% 
  split(.$group2) %>%
  lapply(function(x) widyr::pairwise_cor(x, var2, obs, y2, upper = FALSE)) %>%
  bind_rows(.id = "group2")
#> # A tibble: 8 × 4
#>   group2 item1 item2 correlation
#>   <chr>  <chr> <chr>       <dbl>
#> 1 1      A     B          -0.995
#> 2 1      A     H          -0.958
#> 3 1      B     H           0.982
#> 4 1      A     I           0.853
#> 5 1      B     I          -0.901
#> 6 1      H     I          -0.967
#> 7 2      C     D           0.469
#> 8 3      E     F          -0.186
1
  • If I had a NA value somewhere.Say the NA is the last value of the y vector then the cor must be pairwise complete.In your solution what modifications I must do in order to calculate it ? Commented Jul 15, 2022 at 17:05

Not the answer you're looking for? Browse other questions tagged or ask your own question.