Note: After lots of experimenting with the code, I have completely re-written this question
I'm trying to use user-input values in a 1-row data object to predict the user's category with randomForest, however I get an error indicating NA/Inf values of my data object.
I have a randomForest classifier, which I've trained on a taining dataset and validated on a validation dataset. This was done in my file analysis.R on github and the object is saved as rf.rds, which is read in by server.R).
In server.R I read in the training data which is called x (i.e. x.rds) and then extract only the first row into userdf.
In ui.R I let users enter values which reactively update this object:
values <- reactiveValues()
values$df <- userdf
newEntry <- observe({
values$df$bron_badges <- input$bron_badges
values$df$silv_badges <- input$silv_badges
values$df$gold_badges <- input$gold_badges
values$df$reputation <- input$reputation
values$df$views <- input$views
values$df$votes <- input$votes
})
This appears to work. I say so because I can run:
output$table <- renderTable({data.frame(values$df)})
and watch the values update beautifully in my UI.
However, when I try to run the following code to run a prediction for the user I get an error message saying that there are NA's:
output$results <- renderText({
{ ds1 <- values$df
x <- x[,sort(names(x))]
ds1 <- ds1[,sort(names(ds1))]
names(ds1) <- colnames(x)
predict(rf, newdata = data.frame(ds1))
}
})
Even though I "know" the data is not NA from having watched values$df update via ui.R in the line mentioned above and by virtue of the fact that all of the initial values which come from x are not NA. I've also tried it without the data.frame part of the predict statement.
Interestingly, if I replace the predict statement above with table(is.na(ds1)) it tells me that all 1,033 values are NA.
Also interesting, if I replace ds1 with userdf in the predict statement, then everything runs fine (userdf is the non-reactive object).
If I replace the predict statement with setdiff(colnames(x), colnames(ds1)) it does not show any mis-matching column names (it did until the addition of the colnames statements above, due to some weird conversion of _ to . in the reactive dataframe's colnames).
Finally, I find that if I access the names from rf via rf$forest$ncat I get "incorrect number of dimensions" as my error:
output$results <- renderTable({
{ ds1 <- values$df
cn <- rf$forest$ncat
cn <- cn[,sort(names(cn))]
ds1 <- ds1[,sort(names(ds1))]
names(ds1) <- names(cn)#x #rf$forest$xlevels
predict(rf, newdata = data.frame(ds1))
}
})
However, with the following modification:
output$results <- renderTable({
{ ds1 <- values$df
cn <- as.data.frame(t(rf$forest$ncat))
cn <- cn[,sort(names(cn))]
ds1 <- ds1[,sort(names(ds1))]
names(ds1) <- names(cn)#x #rf$forest$xlevels
predict(rf, newdata = data.frame(ds1))
}
})
My error goes back to "variables in the training data missing in newdata".
Minimal, reproducible example: https://github.com/hack-r/troubleshooting_predictor_minimal
Here's the full reproducible code and data: https://github.com/hack-r/coursera_shiny