handleInvalid를 설정해주면 된다
StringIndexerModel.from_labels(labels,inputCol=categoricalCol, outputCol=categoricalCol + 'Index',handleInvalid="keep")
종류
- 'error': throws an exception (which is the default)
- 'skip': skips the rows containing the unseen labels entirely (removes the rows on the output!)
- 'keep': puts unseen labels in a special additional bucket, at index numLabels
Ref
https://stackoverflow.com/questions/34681534/spark-ml-stringindexer-handling-unseen-labels