Split a vector into chunks

Question

I have to split a vector into n chunks of equal size in R. I couldn't find any base function to do that. Also Google didn't get me anywhere. Here is what I came up with so far;

x <- 1:10
n <- 3
chunk <- function(x,n) split(x, factor(sort(rank(x)%%n)))
chunk(x,n)
$`0`
[1] 1 2 3

$`1`
[1] 4 5 6 7

$`2`
[1]  8  9 10

Yes, it's very unclear that what you get is the solution to "n chunks of equal size". But maybe this gets you there too: x <- 1:10; n <- 3; split(x, cut(x, n, labels = FALSE)) — mdsumner, Commented Jul 23, 2010 at 14:08
both the solution in the question, and the solution in the preceding comment are incorrect, in that they might not work, if the vector has repeated entries. Try this: > foo <- c(rep(1, 12), rep(2,3), rep(3,3)) [1] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 3 3 3 > chunk(foo, 2) (gives wrong result) > chunk(foo, 3) (also wrong) — mathheadinclouds, Commented Apr 29, 2013 at 9:21
(continuing preceding comment) why? rank(x) doesn't need to be an integer > rank(c(1,1,2,3)) [1] 1.5 1.5 3.0 4.0 so that's why the method in the question fails. this one works (thanks to Harlan below) > chunk2 <- function(x,n) split(x, cut(seq_along(x), n, labels = FALSE)) — mathheadinclouds, Commented Apr 29, 2013 at 9:33
As @mathheadinclouds suggests, the example data is a very special case. Examples that are more general would be more useful and better tests. E.g. x <- c(NA, 4, 3, NA, NA, 2, 1, 1, NA ); y <- letters[x]; z <- factor(y) gives examples with missing data, repeated values, that are not already sorted, and are in different classes (integer, character, factor). — Kalin, Commented Feb 21, 2018 at 17:39

dfrankow · Accepted Answer · 2013-12-23 18:41:05Z

403

Answer recommended by R Language Collective

A one-liner splitting d into chunks of size 20:

split(d, ceiling(seq_along(d)/20))

More details: I think all you need is seq_along(), split() and ceiling():

> d <- rpois(73,5)
> d
 [1]  3  1 11  4  1  2  3  2  4 10 10  2  7  4  6  6  2  1  1  2  3  8  3 10  7  4
[27]  3  4  4  1  1  7  2  4  6  0  5  7  4  6  8  4  7 12  4  6  8  4  2  7  6  5
[53]  4  5  4  5  5  8  7  7  7  6  2  4  3  3  8 11  6  6  1  8  4
> max <- 20
> x <- seq_along(d)
> d1 <- split(d, ceiling(x/max))
> d1
$`1`
 [1]  3  1 11  4  1  2  3  2  4 10 10  2  7  4  6  6  2  1  1  2

$`2`
 [1]  3  8  3 10  7  4  3  4  4  1  1  7  2  4  6  0  5  7  4  6

$`3`
 [1]  8  4  7 12  4  6  8  4  2  7  6  5  4  5  4  5  5  8  7  7

$`4`
 [1]  7  6  2  4  3  3  8 11  6  6  1  8  4

edited Dec 23, 2013 at 18:41

dfrankow

20.9k42 gold badges157 silver badges233 bronze badges

answered Jul 23, 2010 at 19:22

Harlan

19.3k8 gold badges48 silver badges57 bronze badges

50

The question asks for n chunks of equal size. This gets you an unknown number of chunks of size n. I had the same problem and used the solutions from @mathheadinclouds.
– rrs
Commented Apr 21, 2014 at 18:26
5

As one can see from the output of d1, this answer does not split d into groups of equal size (4 is obviously shorter). Thus it does not answer the question.
– Calimo
Commented Jan 23, 2015 at 16:39
9

@rrs : split(d, ceiling(seq_along(d)/(length(d)/n)))
– gkcn
Commented Jun 5, 2015 at 11:45
1

I know this is quite old but it may be of help to those who stumble here. Although the OP's question was to split into chunks of equal size, if the vector happens not to be a multiple of the divisor, the last chink will have a different size than chunk. To split into n-chunks I used max <- length(d)%/%n. I used this with a vector of 31 strings and obtained a list of 3 vectors of 10 sentences and one of 1 sentence.
– salvu
Commented Feb 4, 2017 at 12:59
@Harlan Is there a way to shuffle the split as well? your solution worked well for me but I would like to make sure the splits are randomly assigned and not just consecutive
– Spooked
Commented Oct 21, 2020 at 23:22

Add a comment |

Dis Shishkov · Accepted Answer · 2013-04-29 10:10:29Z

117

chunk2 <- function(x,n) split(x, cut(seq_along(x), n, labels = FALSE))

edited Apr 29, 2013 at 10:10

Dis Shishkov

6567 silver badges21 bronze badges

answered Apr 29, 2013 at 9:37

mathheadinclouds

3,6352 gold badges28 silver badges40 bronze badges

This is the fastest way I've tried so far! Setting labels = FALSE speed up twice, and using cut() is 4 times faster than using ceiling(seq_along(x) / n on my data.
– Drumy
Commented Oct 21, 2020 at 6:25
1

Correction: this is the fastest among the split() approaches. @verbarmour's answer below is the fastest overall. It is blazing fast because it doesn't have to work with factor, nor does it need to sort. That answer deserves a lot more upvotes.
– Drumy
Commented Oct 21, 2020 at 7:05

Add a comment |

andschar · Accepted Answer · 2021-09-07 11:31:15Z

54

A simplified version:

n = 3
split(x, sort(x%%n))

NB: This will only work on numeric vectors.

edited Sep 7, 2021 at 11:31

andschar

3,8732 gold badges30 silver badges37 bronze badges

answered Apr 20, 2016 at 21:03

zhan2383

6695 silver badges9 bronze badges

I like this as it gives you chunks that are as equally sized as possible (good for dividing up large task e.g. to accommodate limited RAM or to run a task across multiple threads).
– alexvpickering
Commented Jul 21, 2016 at 22:13
7

This is useful, but keep in mind this will only work on numeric vectors.
– Keith Hughitt
Commented Aug 24, 2016 at 17:49
@KeithHughitt this can be solved with factors and returning the levels as numeric. Or at least this is how I implemented it.
– drmariod
Commented Apr 5, 2018 at 7:02
2

@drmariod can also be extended by doing split(x, sort(1:length(x) %% n))
– Richard DiSalvo
Commented Sep 14, 2020 at 19:28
2

@JessicaBurnett I think split() is the slowest part of this code (because it calls as.factor). So maybe consider using a data.frame and do something like data$group <- sort(1:length(data) %% n), then use the group column in the rest of your code.
– Richard DiSalvo
Commented Dec 14, 2021 at 19:40

| Show 2 more comments

FXQuantTrader · Accepted Answer · 2018-11-01 04:47:07Z

27

Using base R's rep_len:

x <- 1:10
n <- 3

split(x, rep_len(1:n, length(x)))
# $`1`
# [1]  1  4  7 10
# 
# $`2`
# [1] 2 5 8
# 
# $`3`
# [1] 3 6 9

And as already mentioned if you want sorted indices, simply:

split(x, sort(rep_len(1:n, length(x))))
# $`1`
# [1] 1 2 3 4
# 
# $`2`
# [1] 5 6 7
# 
# $`3`
# [1]  8  9 10

answered Nov 1, 2018 at 4:47

FXQuantTrader

6,8714 gold badges38 silver badges68 bronze badges

Add a comment |

Sam Firke · Accepted Answer · 2017-01-12 02:01:52Z

23

Try the ggplot2 function, cut_number:

library(ggplot2)
x <- 1:10
n <- 3
cut_number(x, n) # labels = FALSE if you just want an integer result
#>  [1] [1,4]  [1,4]  [1,4]  [1,4]  (4,7]  (4,7]  (4,7]  (7,10] (7,10] (7,10]
#> Levels: [1,4] (4,7] (7,10]

# if you want it split into a list:
split(x, cut_number(x, n))
#> $`[1,4]`
#> [1] 1 2 3 4
#> 
#> $`(4,7]`
#> [1] 5 6 7
#> 
#> $`(7,10]`
#> [1]  8  9 10

edited Jan 12, 2017 at 2:01

Sam Firke

22.5k10 gold badges90 silver badges114 bronze badges

answered Jan 9, 2015 at 13:41

Scott Worland

1,3821 gold badge12 silver badges15 bronze badges

2

This does not work for splitting up the x, y, or z defined in this comment. In particular, it sorts the results, which may or may not be okay, depending on the application.
– Kalin
Commented Feb 21, 2018 at 17:42
Rather, this comment.
– Kalin
Commented Feb 21, 2018 at 17:48

Add a comment |

verbamour · Accepted Answer · 2014-12-23 18:26:24Z

21

If you don't like split() and you don't like matrix() (with its dangling NAs), there's this:

chunk <- function(x, n) (mapply(function(a, b) (x[a:b]), seq.int(from=1, to=length(x), by=n), pmin(seq.int(from=1, to=length(x), by=n)+(n-1), length(x)), SIMPLIFY=FALSE))

Like split(), it returns a list, but it doesn't waste time or space with labels, so it may be more performant.

answered Dec 23, 2014 at 18:26

verbamour

9659 silver badges16 bronze badges

1

This is blazing fast!
– Drumy
Commented Oct 21, 2020 at 7:03
1

This also does chunks of size n rather than n chunks.
– nelliott
Commented Dec 8, 2021 at 0:41
1

Just what I needed to prevent an "out of memory" error. Thanks!
– Jeff
Commented Feb 28, 2023 at 16:25
Great. Is there a way to quickly set this up to make each group randomized each time?
– theforestecologist
Commented Aug 28, 2023 at 18:46

Add a comment |

15 revs, 2 users 100% · Accepted Answer · 2020-09-29 16:13:00Z

This will split it differently to what you have, but is still quite a nice list structure I think:

chunk.2 <- function(x, n, force.number.of.groups = TRUE, len = length(x), groups = trunc(len/n), overflow = len%%n) { 
  if(force.number.of.groups) {
    f1 <- as.character(sort(rep(1:n, groups)))
    f <- as.character(c(f1, rep(n, overflow)))
  } else {
    f1 <- as.character(sort(rep(1:groups, n)))
    f <- as.character(c(f1, rep("overflow", overflow)))
  }
  
  g <- split(x, f)
  
  if(force.number.of.groups) {
    g.names <- names(g)
    g.names.ordered <- as.character(sort(as.numeric(g.names)))
  } else {
    g.names <- names(g[-length(g)])
    g.names.ordered <- as.character(sort(as.numeric(g.names)))
    g.names.ordered <- c(g.names.ordered, "overflow")
  }
  
  return(g[g.names.ordered])
}

Which will give you the following, depending on how you want it formatted:

> x <- 1:10; n <- 3
> chunk.2(x, n, force.number.of.groups = FALSE)
$`1`
[1] 1 2 3

$`2`
[1] 4 5 6

$`3`
[1] 7 8 9

$overflow
[1] 10

> chunk.2(x, n, force.number.of.groups = TRUE)
$`1`
[1] 1 2 3

$`2`
[1] 4 5 6

$`3`
[1]  7  8  9 10

Running a couple of timings using these settings:

set.seed(42)
x <- rnorm(1:1e7)
n <- 3

Then we have the following results:

> system.time(chunk(x, n)) # your function 
   user  system elapsed 
 29.500   0.620  30.125 

> system.time(chunk.2(x, n, force.number.of.groups = TRUE))
   user  system elapsed 
  5.360   0.300   5.663

Note: Changing as.factor() to as.character() made my function twice as fast.

Richard Herron · Accepted Answer · 2010-07-23 14:38:42Z

A few more variants to the pile...

> x <- 1:10
> n <- 3

Note, that you don't need to use the factor function here, but you still want to sort o/w your first vector would be 1 2 3 10:

> chunk <- function(x, n) split(x, sort(rank(x) %% n))
> chunk(x,n)
$`0`
[1] 1 2 3
$`1`
[1] 4 5 6 7
$`2`
[1]  8  9 10

Or you can assign character indices, vice the numbers in left ticks above:

> my.chunk <- function(x, n) split(x, sort(rep(letters[1:n], each=n, len=length(x))))
> my.chunk(x, n)
$a
[1] 1 2 3 4
$b
[1] 5 6 7
$c
[1]  8  9 10

Or you can use plainword names stored in a vector. Note that using sort to get consecutive values in x alphabetizes the labels:

> my.other.chunk <- function(x, n) split(x, sort(rep(c("tom", "dick", "harry"), each=n, len=length(x))))
> my.other.chunk(x, n)
$dick
[1] 1 2 3
$harry
[1] 4 5 6
$tom
[1]  7  8  9 10

Matifou · Accepted Answer · 2022-08-02 12:07:22Z

11

Yet another possibility is the splitIndices function from package parallel:

library(parallel)
splitIndices(20, 3)

Gives:

[[1]]
[1] 1 2 3 4 5 6 7

[[2]]
[1]  8  9 10 11 12 13

[[3]]
[1] 14 15 16 17 18 19 20

NB: this works only with numeric values though. If you want to split a character vector, you would need to do some indexing: lapply(splitIndices(20, 3), \(x) letters[1:20][x])

edited Aug 2, 2022 at 12:07

answered Sep 10, 2018 at 21:31

Matifou

8,6334 gold badges52 silver badges59 bronze badges

Only works with numeric values
– Julien
Commented Jul 31, 2022 at 15:48

Add a comment |

SiggyF · Accepted Answer · 2010-07-23 14:22:55Z

10

You could combine the split/cut, as suggested by mdsummer, with quantile to create even groups:

split(x,cut(x,quantile(x,(0:n)/n), include.lowest=TRUE, labels=FALSE))

This gives the same result for your example, but not for skewed variables.

answered Jul 23, 2010 at 14:22

SiggyF

22.9k8 gold badges45 silver badges57 bronze badges

Add a comment |

frankc · Accepted Answer · 2010-07-23 18:10:28Z

7

split(x,matrix(1:n,n,length(x))[1:length(x)])

perhaps this is more clear, but the same idea:
split(x,rep(1:n, ceiling(length(x)/n),length.out = length(x)))

if you want it ordered,throw a sort around it

edited Jul 23, 2010 at 18:10

answered Jul 23, 2010 at 16:30

frankc

11.4k4 gold badges33 silver badges49 bronze badges

Add a comment |

Ferdinand.kraft · Accepted Answer · 2013-09-14 23:08:43Z

7

Here's another variant.

NOTE: with this sample you're specifying the CHUNK SIZE in the second parameter

all chunks are uniform, except for the last;
the last will at worst be smaller, never bigger than the chunk size.

chunk <- function(x,n)
{
    f <- sort(rep(1:(trunc(length(x)/n)+1),n))[1:length(x)]
    return(split(x,f))
}

#Test
n<-c(1,2,3,4,5,6,7,8,9,10,11)

c<-chunk(n,5)

q<-lapply(c, function(r) cat(r,sep=",",collapse="|") )
#output
1,2,3,4,5,|6,7,8,9,10,|11,|

edited Sep 14, 2013 at 23:08

Ferdinand.kraft

12.8k10 gold badges49 silver badges70 bronze badges

answered Sep 14, 2013 at 16:41

eAndy

3232 silver badges9 bronze badges

Add a comment |

Zak D · Accepted Answer · 2013-06-23 07:41:00Z

I needed the same function and have read the previous solutions, however i also needed to have the unbalanced chunk to be at the end i.e if i have 10 elements to split them into vectors of 3 each, then my result should have vectors with 3,3,4 elements respectively. So i used the following (i left the code unoptimised for readability, otherwise no need to have many variables):

chunk <- function(x,n){
  numOfVectors <- floor(length(x)/n)
  elementsPerVector <- c(rep(n,numOfVectors-1),n+length(x) %% n)
  elemDistPerVector <- rep(1:numOfVectors,elementsPerVector)
  split(x,factor(elemDistPerVector))
}
set.seed(1)
x <- rnorm(10)
n <- 3
chunk(x,n)
$`1`
[1] -0.6264538  0.1836433 -0.8356286

$`2`
[1]  1.5952808  0.3295078 -0.8204684

$`3`
[1]  0.4874291  0.7383247  0.5757814 -0.3053884

Philip Michaelsen · Accepted Answer · 2018-02-08 14:30:34Z

5

Simple function for splitting a vector by simply using indexes - no need to over complicate this

vsplit <- function(v, n) {
    l = length(v)
    r = l/n
    return(lapply(1:n, function(i) {
        s = max(1, round(r*(i-1))+1)
        e = min(l, round(r*i))
        return(v[s:e])
    }))
}

answered Feb 8, 2018 at 14:30

Philip Michaelsen

511 silver badge2 bronze badges

Add a comment |

Laura Paladini · Accepted Answer · 2018-08-21 13:29:08Z

3

Sorry if this answer comes so late, but maybe it can be useful for someone else. Actually there is a very useful solution to this problem, explained at the end of ?split.

> testVector <- c(1:10) #I want to divide it into 5 parts
> VectorList <- split(testVector, 1:5)
> VectorList
$`1`
[1] 1 6

$`2`
[1] 2 7

$`3`
[1] 3 8

$`4`
[1] 4 9

$`5`
[1]  5 10

answered Aug 21, 2018 at 13:29

Laura Paladini

931 silver badge10 bronze badges

3

this will break if there are unequal number of values in each group!
– Matifou
Commented Sep 10, 2018 at 21:31

Add a comment |

Community · Accepted Answer · 2017-05-23 12:02:51Z

2

Credit to @Sebastian for this function

chunk <- function(x,y){
         split(x, factor(sort(rank(row.names(x))%%y)))
         }

edited May 23, 2017 at 12:02

CommunityBot

11 silver badge

answered Dec 5, 2014 at 15:24

user1587280

Add a comment |

verbamour · Accepted Answer · 2014-12-23 17:42:01Z

2

If you don't like split() and you don't mind NAs padding out your short tail:

chunk <- function(x, n) { if((length(x)%%n)==0) {return(matrix(x, nrow=n))} else {return(matrix(append(x, rep(NA, n-(length(x)%%n))), nrow=n))} }

The columns of the returned matrix ([,1:ncol]) are the droids you are looking for.

answered Dec 23, 2014 at 17:42

verbamour

9659 silver badges16 bronze badges

Add a comment |

rferrisx · Accepted Answer · 2017-03-26 21:24:53Z

I need a function that takes the argument of a data.table (in quotes) and another argument that is the upper limit on the number of rows in the subsets of that original data.table. This function produces whatever number of data.tables that upper limit allows for:

library(data.table)    
split_dt <- function(x,y) 
    {
    for(i in seq(from=1,to=nrow(get(x)),by=y)) 
        {df_ <<- get(x)[i:(i + y)];
            assign(paste0("df_",i),df_,inherits=TRUE)}
    rm(df_,inherits=TRUE)
    }

This function gives me a series of data.tables named df_[number] with the starting row from the original data.table in the name. The last data.table can be short and filled with NAs so you have to subset that back to whatever data is left. This type of function is useful because certain GIS software have limits on how many address pins you can import, for example. So slicing up data.tables into smaller chunks may not be recommended, but it may not be avoidable.

M-- · Accepted Answer · 2020-09-29 16:14:47Z

1

I have come up with this solution:

require(magrittr)
create.chunks <- function(x, elements.per.chunk){
    # plain R version
    # split(x, rep(seq_along(x), each = elements.per.chunk)[seq_along(x)])
    # magrittr version - because that's what people use now
    x %>% seq_along %>% rep(., each = elements.per.chunk) %>% extract(seq_along(x)) %>% split(x, .) 
}
create.chunks(letters[1:10], 3)
$`1`
[1] "a" "b" "c"

$`2`
[1] "d" "e" "f"

$`3`
[1] "g" "h" "i"

$`4`
[1] "j"

The key is to use the seq(each = chunk.size) parameter so make it work. Using seq_along acts like rank(x) in my previous solution, but is actually able to produce the correct result with duplicated entries.

edited Sep 29, 2020 at 16:14

M--

28.1k9 gold badges69 silver badges101 bronze badges

answered Sep 19, 2018 at 11:08

Sebastian

3,7973 gold badges19 silver badges14 bronze badges

1

For those concerned that rep(seq_along(x), each = elements.per.chunk) might be too straining on the memory: yes it does. You could try a modified version of my previous suggestion: chunk <- function(x,n) split(x, factor(seq_along(x)%%n))
– Sebastian
Commented Sep 19, 2018 at 11:13
For me, it produces the following error: no applicable method for 'extract_' applied to an object of class "c('integer', 'numeric')
– sharchaea
Commented Nov 6, 2020 at 9:02

Add a comment |

Vitor Hugo Moreau · Accepted Answer · 2024-04-26 09:47:29Z

As there were many good answers, I benchmarked the solutions that worked properly:

Below, the code:

require(microbenchmark)
require(ggplot2)
require(parallel)

x <- rpois(75,5)
result <- microbenchmark(times = 1000,
                         "Harlan" = split(x, ceiling(seq_along(x)/(length(x)/5))),
                         "mathheadinclouds" = split(x, cut(seq_along(x), 5, labels = FALSE)),
                         "Richard DiSalvo on zhan2383" = split(x, sort(1:length(x) %% 5)),
                         "FXQuantTrader" = split(x, sort(rep_len(1:5, length(x)))),
                         "verbamour" = {mapply(function(a, b) (x[a:b]), seq.int(from=1, to=length(x), by=floor(length(x)/5)),
                                               pmin(seq.int(from=1, to=length(x), by=floor(length(x)/5))+floor((length(x)/5-1)), length(x)),
                                               SIMPLIFY=FALSE)},
                         "Tony Breyal" = chunk.2(x, 5),
                         "Richard Herron"  = split(x, sort(rep(letters[1:5], each=5, len=length(x)))),
                         "Philip Michaelsen" = vsplit(x, 5),
                         "Iyar Lin" =  split_to_chunks(x, 5)
)

split_to_chunks <- function(x, n, keep.order=TRUE){
  if(keep.order){
    return(split(x, sort(rep(1:n, length.out = length(x)))))
  }else{
    return(split(x, rep(1:n, length.out = length(x))))
  }
}

vsplit <- function(v, n) {
  l = length(v)
  r = l/n
  return(lapply(1:n, function(i) {
    s = max(1, round(r*(i-1))+1)
    e = min(l, round(r*i))
    return(v[s:e])
  }))
}

Your code does not contain the chunk.2 function. MOreover, the functions shoud be defined before the call to microbenchmark — Julien, Commented Jul 9 at 11:27

Iyar Lin · Accepted Answer · 2021-10-05 10:38:08Z

Here's yet another one, allowing you to control if you want the result ordered or not:

split_to_chunks <- function(x, n, keep.order=TRUE){
  if(keep.order){
    return(split(x, sort(rep(1:n, length.out = length(x)))))
  }else{
    return(split(x, rep(1:n, length.out = length(x))))
  }
}

split_to_chunks(x = 1:11, n = 3)
$`1`
[1] 1 2 3 4

$`2`
[1] 5 6 7 8

$`3`
[1]  9 10 11

split_to_chunks(x = 1:11, n = 3, keep.order=FALSE)

$`1`
[1]  1  4  7 10

$`2`
[1]  2  5  8 11

$`3`
[1] 3 6 9

cmilando · Accepted Answer · 2022-10-25 23:08:24Z

0

Not sure if this answers OP's question, but I think the %% can be useful here

df # some data.frame
N_CHUNKS <- 10
I_VEC <- 1:nrow(df)
df_split <- split(df, sort(I_VEC %% N_CHUNKS))

answered Oct 25, 2022 at 23:08

cmilando

865 bronze badges

Add a comment |

Valentas · Accepted Answer · 2019-07-31 08:27:04Z

-1

This splits into chunks of size ⌊n/k⌋+1 or ⌊n/k⌋ and does not use the O(n log n) sort.

get_chunk_id<-function(n, k){
    r <- n %% k
    s <- n %/% k
    i<-seq_len(n)
    1 + ifelse (i <= r * (s+1), (i-1) %/% (s+1), r + ((i - r * (s+1)-1) %/% s))
}

split(1:10, get_chunk_id(10,3))

answered Jul 31, 2019 at 8:27

Valentas

2,14320 silver badges26 bronze badges

Add a comment |

Collectives™ on Stack Overflow

Split a vector into chunks

23 Answers 23

Not the answer you're looking for? Browse other questions tagged
r
vector
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

23 Answers 23

Not the answer you're looking for? Browse other questions tagged rvector or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
r
vector
or ask your own question.