I am trying to predict sales for a retail store. Here are my variables (you can largely ignore the values of the variables, outside of ZipZone; their values are largely irrelevant for this question):
storeId sales meanTemperature meanHumidity ZipZone
1 1350 56.78 61.12 0
2 1230 59.90 45.67 3
3 8476 63.54 49.87 3
4 4357 62.12 65.09 4
5 2314 69.78 68.99 4
6 7812 74.90 59.78 4
7 1350 56.78 61.12 6
8 1230 59.90 45.67 6
9 8476 63.54 49.87 6
10 4357 62.12 65.09 7
11 2314 69.78 68.99 7
12 7812 74.90 59.78 8
...
There are 50 unique storeId values (i.e. there are fifty stores). I built a regression model in the form of:
model <- lm(sales ~ meanTemperature*meanHumidity + ZipZone)
I'm currently testing this model's efficacy in terms of in- and out-of-sample prediction, so I've created inSample and outSample data frames (the former has 40 stores; the latter has 10). The issue, though, is that I have several stores in just one ZipZone. For example, the inSample table has store 1 (the only store in ZipZone 0), while the outSample table has store 12 (the only store in ZipZone 8). When I run the following:
pred <- predict(model, newdata = outSample)
I get the following error:
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor ZIPzone has new levels 8
I assume this is because inSample doesn't have a store in ZipZone 8, while outSample does. How can I avoid this problem?