Best Way to Subset a Large Dataframe into a List of Variables?

Question

I currently have a data frame of ~83000 rows (13 columns) that has data from years 2000-2012 of crimes, each row is a crime and the zip code is reported (so the zip code xxxxx can be found in year 2001, 2003, and 2007 as an example).

Here is an example of my data:

 Year Quarter   Zip MissingZip BusCode LossCode NumTheftsPQ  DUL 
 2000       1 99502          1       3        5           2 9479           
 2009       2 99502          2       3        4           3 3220
 2000       1 11111          1       3        5           2 3479           
 2004       2 11111          2       3        4           3 1020

Right now, I am able to assign global variables to all of my zip codes (I am using R studio and my list of data shown is very long and it has significantly slowed the program). Here is how I have assigned global variables to all of my zip codes:

   for (n in all.data$Zip) {
     x <- subset(all.data, n == all.data$Zip) #subsets the data
     u <- x[1,3] #gets the zip code value
     assign(paste0("Zip", u), x, envir = .GlobalEnv)  #assigns it to a global environment
     #need something here, MasterList <<- ?

}

I would like to contain all of these variables in a list. For example, if all my zip code variables were stored in list "MasterList":

   MasterList["Zip11111"]

would yield the data frame:

 Year Quarter   Zip MissingZip BusCode LossCode NumTheftsPQ  DUL 
 2000       1 11111          1       3        5           2 3479           
 2004       2 11111          2       3        4           3 1020

Is this possible? What would be an alternative/faster/better way to do such? I was hoping that storing these variables in a list would be more efficient.

Bonus points: I know in my for loop I am reassigning variables that already exist to the exact same thing, wasting processing time. Any quick line I could add to speed this up?

Thanks in advance for your help!

Can't you just do all.data[all.data$Zip == 11111,]? I don't understand what you're trying to achieve with the for loop... — alexwhan, Commented Aug 11, 2013 at 3:15

joran · Accepted Answer · 2013-08-11 03:25:27Z

With only base R:

 dat <- read.table(text = "Year Quarter   Zip MissingZip BusCode LossCode NumTheftsPQ  DUL 
+  2000       1 99502          1       3        5           2 9479           
+  2009       2 99502          2       3        4           3 3220
+  2000       1 11111          1       3        5           2 3479           
+  2004       2 11111          2       3        4           3 1020",header = TRUE,sep = "")

> dats <- split(dat,dat$Zip)
> dats
$`11111`
  Year Quarter   Zip MissingZip BusCode LossCode NumTheftsPQ  DUL
3 2000       1 11111          1       3        5           2 3479
4 2004       2 11111          2       3        4           3 1020

$`99502`
  Year Quarter   Zip MissingZip BusCode LossCode NumTheftsPQ  DUL
1 2000       1 99502          1       3        5           2 9479
2 2009       2 99502          2       3        4           3 3220

> names(dats) <- paste0('Zip',names(dats))
> dats
$Zip11111
  Year Quarter   Zip MissingZip BusCode LossCode NumTheftsPQ  DUL
3 2000       1 11111          1       3        5           2 3479
4 2004       2 11111          2       3        4           3 1020

$Zip99502
  Year Quarter   Zip MissingZip BusCode LossCode NumTheftsPQ  DUL
1 2000       1 99502          1       3        5           2 9479
2 2009       2 99502          2       3        4           3 3220

James Pringle · Accepted Answer · 2013-08-11 03:12:16Z

2

You could change for (n in all.data$Zip) to for (n in unique(all.data$Zip)). That would cut down on redundancy. Why don't you make a list before the loop, MasterList <- list() and then add to the list by

MasterList[[paste0("Zip", n)]] <- x

Yes, I used n for the zip code number because n is assigned each value in the vector you tell it (in your case all.data$Zip, in mine unique(all.data$Zip))

answered Aug 11, 2013 at 3:12

James Pringle

1,0796 silver badges15 bronze badges

Add a comment |

David · Accepted Answer · 2013-08-11 03:22:19Z

2

Probably the easiest way to make your list is using the plyr function, like so:

> set.seed(2)
> dat <- data.frame(zip=as.factor(sample(11111:22222,1000,replace=T)),var1=rnorm(1000),var2=rnorm(1000))
> head(dat)
    zip       var1        var2
1 13165 -0.4597894 -0.84724423
2 18915  0.6179261  0.07042928
3 17481 -0.7204224  1.58119491
4 12978 -0.5835119  0.02059799
5 21598  0.2163245 -0.12337051
6 21594  1.2449912 -1.25737890
> library(plyr)
> MasterList <- dlply(dat,.(zip))
> MasterList[["13165"]]
    zip       var1       var2
1 13165 -0.4597894 -0.8472442

However it sounds like speed is your motivation here and if so you'd probably be much better off not storing the data in some separate list object and converting your data frame to a data.table():

> library(data.table)
> dat.dt <- data.table(dat)
> dat.dt[zip==13165]
     zip       var1       var2
1: 13165 -0.4597894 -0.8472442

answered Aug 11, 2013 at 3:22

David

9,3553 gold badges42 silver badges40 bronze badges

I'm not sure if this is faster or not - I'm sure it works though. I ran it after I ran @joran example above so I think it only "seems" to be running slower because my computer is a bit slow with all the splitting. Thanks for the help I appreciate it!
– James
Commented Aug 11, 2013 at 5:12
No problem, if you want to do things faster I'd really suggest using data.table - you can do even more to speed it up like defining zip to a key. Also, you can run the dlply function in parallel by setting .parallel = T.
– David
Commented Aug 11, 2013 at 14:24

Add a comment |

Collectives™ on Stack Overflow

Best Way to Subset a Large Dataframe into a List of Variables?

3 Answers 3

Not the answer you're looking for? Browse other questions tagged
r
dataframe
subset
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Not the answer you're looking for? Browse other questions tagged rdataframesubset or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
r
dataframe
subset
or ask your own question.