Problem
Working with a data frame in R, I want to change variables represented as characters into variables represented as numbers (i.e. from class chr to num).
For an entire data set, this is a straightforward problem (different flavors of solutions here, here, here, and here). However, I have one variable that needs to stay as characters.
Example Data
Using this example data (df), let's say I want to change only var1 from class chr to num, leaving "chrOK" as a chr variable. In my real data set, there are many variables to change, so manual approaches like df$var1 = as.numeric(df$var1) is too laborious.
df = data.frame(var1 = c("1","2","3","4"),
var2 = c(1,2,3,4),
chrOK = c("rick", "summer","beth", "morty"),
stringsAsFactors = FALSE)
str(df)
'data.frame': 4 obs. of 3 variables:
$ var1 : chr "1" "2" "3" "4"
$ var2 : num 1 2 3 4
$ chrOK: chr "rick" "summer" "beth" "morty"
Partial Solutions
I've tried a several approaches that seem close, but don't do exactly what I want.
Attempt 1 — introduces NAs
Most of my columns are characters that should be numeric, like "var1". So, using apply() to convert class works. However, this approach fails induces NA values in "chrOK".
df = as.data.frame(apply(df, 2, function(x) as.numeric(x)))
Warning message:
In FUN(newX[, i], ...) : NAs introduced by coercion
str(df)
'data.frame': 4 obs. of 3 variables:
$ var1 : num 1 2 3 4
$ var2 : num 1 2 3 4
$ chrOK: num NA NA NA NA
Attempt 2 — split, convert, cbind
Using apply() on the subset of chr variables, excluding "chrOK", doesn't induce NAs, but requires using cbind() to re-include "chrOK".
This solution is not ideal because cbind() results are hard to check for data mutations. (Also, "chrOK" is returned as a factor. Using df = cbind(changed,as.character(unchanged)) doesn't work. [a])
changed = as.data.frame(apply(df[-(which(colnames(df)=="chrOK"))],2,function(x) as.numeric(x)))
unchanged = (df$chrOK)
df = cbind(changed,unchanged)
str(df)
'data.frame': 4 obs. of 3 variables:
$ var1 : num 1 2 3 4
$ var2 : num 1 2 3 4
$ unchanged: Factor w/ 4 levels "beth","morty",..: 3 4 1 2 #[a]
Attempt 3 — correct subset, but error when converting
Using setdiff() I get the subset of chr class variables excluding `"chrOK".
df[setdiff(names(df[sapply(df,is.character)]),"chrOK")]
var1
1 1
2 2
3 3
4 4
But trying to plug this into an apply function, so that only the subset is changed from chr to num returns an error (see [b]).
apply(as.numeric(df[setdiff(names(df[sapply(df,is.character)]),"chrOK")]),
2,function(x) as.numeric(x))
Error in apply(as.numeric(df[setdiff(names(df[sapply(df, is.character)]), :
(list) object cannot be coerced to type 'double' #[b]
Questions
- What is the best solution for converting a data frame's character variables to numeric, while excluding a specified subset?
- Which of my attempts is the right path or is there a better approach?
- [bonus] What mechanism causes the unexpected results at [a] and [b], above?