One way of addressing high cardinality in a column is the use of frequency encoding. However, if you use a cross validated analysis plan the you would need to re-encode the column at each step.
It's been suggested that step_lencode_mixed() from r-library "embed" (version 1.1.2) could be used. The examples that I have seen look like this:
classify.knn <- train(target ~ ., data = data.trn, method = "knn",
trControl = ctrl,
preProcess = c("center","scale"),
tuneGrid =data.frame(k=seq(5,100,by=15))) %>%
step_lencode_mixed(watchlist, outcome = vars(target))
My concern is that step_lencode_mixed() ("watchlist" is the column I wish to frequency encode) is invoked after the train statement.
Is this correct?