1

I currently have a data frame of ~83000 rows (13 columns) that has data from years 2000-2012 of crimes, each row is a crime and the zip code is reported (so the zip code xxxxx can be found in year 2001, 2003, and 2007 as an example).

Here is an example of my data:

 Year Quarter   Zip MissingZip BusCode LossCode NumTheftsPQ  DUL 
 2000       1 99502          1       3        5           2 9479           
 2009       2 99502          2       3        4           3 3220
 2000       1 11111          1       3        5           2 3479           
 2004       2 11111          2       3        4           3 1020

Right now, I am able to assign global variables to all of my zip codes (I am using R studio and my list of data shown is very long and it has significantly slowed the program). Here is how I have assigned global variables to all of my zip codes:

   for (n in all.data$Zip) {
     x <- subset(all.data, n == all.data$Zip) #subsets the data
     u <- x[1,3] #gets the zip code value
     assign(paste0("Zip", u), x, envir = .GlobalEnv)  #assigns it to a global environment
     #need something here, MasterList <<- ?

}  

I would like to contain all of these variables in a list. For example, if all my zip code variables were stored in list "MasterList":

   MasterList["Zip11111"]

would yield the data frame:

 Year Quarter   Zip MissingZip BusCode LossCode NumTheftsPQ  DUL 
 2000       1 11111          1       3        5           2 3479           
 2004       2 11111          2       3        4           3 1020

Is this possible? What would be an alternative/faster/better way to do such? I was hoping that storing these variables in a list would be more efficient.

Bonus points: I know in my for loop I am reassigning variables that already exist to the exact same thing, wasting processing time. Any quick line I could add to speed this up?

Thanks in advance for your help!

3
  • Are you just trying to subset your dataframe based on zip?
    – alexwhan
    Commented Aug 11, 2013 at 3:04
  • I am just trying to subset the dataframe based on zip.
    – James
    Commented Aug 11, 2013 at 3:06
  • Can't you just do all.data[all.data$Zip == 11111,]? I don't understand what you're trying to achieve with the for loop...
    – alexwhan
    Commented Aug 11, 2013 at 3:15

3 Answers 3

2

With only base R:

 dat <- read.table(text = "Year Quarter   Zip MissingZip BusCode LossCode NumTheftsPQ  DUL 
+  2000       1 99502          1       3        5           2 9479           
+  2009       2 99502          2       3        4           3 3220
+  2000       1 11111          1       3        5           2 3479           
+  2004       2 11111          2       3        4           3 1020",header = TRUE,sep = "")

> dats <- split(dat,dat$Zip)
> dats
$`11111`
  Year Quarter   Zip MissingZip BusCode LossCode NumTheftsPQ  DUL
3 2000       1 11111          1       3        5           2 3479
4 2004       2 11111          2       3        4           3 1020

$`99502`
  Year Quarter   Zip MissingZip BusCode LossCode NumTheftsPQ  DUL
1 2000       1 99502          1       3        5           2 9479
2 2009       2 99502          2       3        4           3 3220

> names(dats) <- paste0('Zip',names(dats))
> dats
$Zip11111
  Year Quarter   Zip MissingZip BusCode LossCode NumTheftsPQ  DUL
3 2000       1 11111          1       3        5           2 3479
4 2004       2 11111          2       3        4           3 1020

$Zip99502
  Year Quarter   Zip MissingZip BusCode LossCode NumTheftsPQ  DUL
1 2000       1 99502          1       3        5           2 9479
2 2009       2 99502          2       3        4           3 3220
2

You could change for (n in all.data$Zip) to for (n in unique(all.data$Zip)). That would cut down on redundancy. Why don't you make a list before the loop, MasterList <- list() and then add to the list by

MasterList[[paste0("Zip", n)]] <- x

Yes, I used n for the zip code number because n is assigned each value in the vector you tell it (in your case all.data$Zip, in mine unique(all.data$Zip))

2

Probably the easiest way to make your list is using the plyr function, like so:

> set.seed(2)
> dat <- data.frame(zip=as.factor(sample(11111:22222,1000,replace=T)),var1=rnorm(1000),var2=rnorm(1000))
> head(dat)
    zip       var1        var2
1 13165 -0.4597894 -0.84724423
2 18915  0.6179261  0.07042928
3 17481 -0.7204224  1.58119491
4 12978 -0.5835119  0.02059799
5 21598  0.2163245 -0.12337051
6 21594  1.2449912 -1.25737890
> library(plyr)
> MasterList <- dlply(dat,.(zip))
> MasterList[["13165"]]
    zip       var1       var2
1 13165 -0.4597894 -0.8472442

However it sounds like speed is your motivation here and if so you'd probably be much better off not storing the data in some separate list object and converting your data frame to a data.table():

> library(data.table)
> dat.dt <- data.table(dat)
> dat.dt[zip==13165]
     zip       var1       var2
1: 13165 -0.4597894 -0.8472442
2
  • I'm not sure if this is faster or not - I'm sure it works though. I ran it after I ran @joran example above so I think it only "seems" to be running slower because my computer is a bit slow with all the splitting. Thanks for the help I appreciate it!
    – James
    Commented Aug 11, 2013 at 5:12
  • No problem, if you want to do things faster I'd really suggest using data.table - you can do even more to speed it up like defining zip to a key. Also, you can run the dlply function in parallel by setting .parallel = T.
    – David
    Commented Aug 11, 2013 at 14:24

Not the answer you're looking for? Browse other questions tagged or ask your own question.