Target Encoding in Sparkling Water — H2O Sparkling Water 3.36.1.4-1-2.3 documentation

Target Encoding in Sparkling Water is a mechanism of converting categorical features to continues features based on the mean calculated from values of the label (target) column. See also Parameters of H2OTargetEncoder.

An example of converting a categorical feature to continues with Target Encoder (Town_te is a produced column):

Town
Label
Town_te
Chennai
1
0.8
Prague
0
0.286
Chennai
0
0.8
Mountain View
1
0.714
Chennai
1
0.8
Prague
1
0.286
Mountain View
1
0.714
Chennai
1
0.8
Mountain View
0
0.714
Prague
1
0.286
Prague
0
0.286
Mountain View
1
0.714
Prague
0
0.286
Mountain View
0
0.714
Chennai
1
0.8
Mountain View
1
0.714
Prague
0
0.286
Prague
0
0.286
Mountain View
1
0.714

Town	Label	Town_te
Chennai	1	0.8
Prague	0	0.286
Chennai	0	0.8
Mountain View	1	0.714
Chennai	1	0.8
Prague	1	0.286
Mountain View	1	0.714
Chennai	1	0.8
Mountain View	0	0.714
Prague	1	0.286
Prague	0	0.286
Mountain View	1	0.714
Prague	0	0.286
Mountain View	0	0.714
Chennai	1	0.8
Mountain View	1	0.714
Prague	0	0.286
Prague	0	0.286
Mountain View	1	0.714

Target Encoding can help to improve accuracy of machine learning algorithms when columns with high cardinality are used as features during a training phase.

Using Target Encoder¶

Sparkling Water exposes API for target encoder in Scala and Python. Before we start using Target Encoder, we need to run and prepare the environment:

Scala

First, let’s start Sparkling Shell (use :paste mode when you try to copy-paste examples):

./bin/sparkling-shell

Start H2O cluster inside the Spark environment:

import ai.h2o.sparkling._import java.net.URIval hc = H2OContext.getOrCreate()

Parse the data using H2O and convert them to Spark Frame:

import org.apache.spark.SparkFilesspark.sparkContext.addFile("")val sparkDF = spark.read.option("header", "true").option("inferSchema", "true").csv(SparkFiles.get("prostate.csv"))val Array(trainingDF, testingDF) = sparkDF.randomSplit(Array(0.8, 0.2))

Python

First, let’s start PySparkling Shell:

./bin/pysparkling

Start H2O cluster inside the Spark environment:

from pysparkling import *hc = H2OContext.getOrCreate()

Parse the data using H2O and convert them to Spark Frame:

import h2oframe = h2o.import_file("")sparkDF = hc.asSparkFrame(frame)[trainingDF, testingDF] = sparkDF.randomSplit([0.8, 0.2])

Target Encoder in ML Pipeline¶

Target Encoder in Sparkling Water is implemented as a regular estimator and thus could be placed as a stage to Spark ML Pipeline

Scala

Let’s create an instance of Target Encoder and configure it:

import ai.h2o.sparkling.ml.features.H2OTargetEncoderval targetEncoder = new H2OTargetEncoder() .setInputCols(Array("RACE", "DPROS", "DCAPS")) .setProblemType("Classification") .setLabelCol("CAPSULE")

Also, create an instance of an algorithm consuming encoded columns and define pipeline:

import ai.h2o.sparkling.ml.algos.classification.H2OGBMClassifierimport org.apache.spark.ml.Pipelineval gbm = new H2OGBMClassifier() .setFeaturesCols(targetEncoder.getOutputCols()) .setLabelCol("CAPSULE")val pipeline = new Pipeline().setStages(Array(targetEncoder, gbm))

Train the created pipeline

val pipelineModel = pipeline.fit(trainingDF)

Make predictions including a model of Target Encoder:

pipelineModel.transform(testingDF).show()

The model of Target Encoder is persistable to MOJO, so you can save and load the whole pipeline model:

import org.apache.spark.ml.PipelineModelpipelineModel.write.save("somePathForStoringPipelineModel")val loadedPipelineModel = PipelineModel.load("somePathForStoringPipelineModel")loadedPipelineModel.transform(testingDF).show()

Python

Let’s create an instance of Target Encoder and configure it:

from pysparkling.ml import H2OTargetEncodertargetEncoder = H2OTargetEncoder()\ .setInputCols(["RACE", "DPROS", "DCAPS"])\ .setLabelCol("CAPSULE")\ .setProblemType("Classification")

Also, create an instance of an algorithm consuming encoded columns and define pipeline:

from pysparkling.ml import H2OGBMClassifierfrom pyspark.ml import Pipelinegbm = H2OGBMClassifier()\ .setFeaturesCols(targetEncoder.getOutputCols())\ .setLabelCol("CAPSULE")pipeline = Pipeline(stages=[targetEncoder, gbm])

Train the created pipeline

pipelineModel = pipeline.fit(trainingDF)

Make predictions including a model of Target Encoder:

pipelineModel.transform(testingDF).show()

The model of Target Encoder is persistable to MOJO, so you can save and load the whole pipeline model:

from pyspark.ml import PipelineModelpipelineModel.save("somePathForStoringPipelineModel")loadedPipelineModel = PipelineModel.load("somePathForStoringPipelineModel")loadedPipelineModel.transform(testingDF).show()

Standalone Target Encoder¶

Target Encoder’s parameters like noise and holdoutStrategy are relevant only for a training dataset. Thus the transform method of H2OTargetEncoderModel has to treat training and other data sets differently and eventually, ignore the mentioned parameters.

When Target Encoder is inside a ML pipeline, the differentiation is done automatically. But if a user decides to train an algorithm without ML pipeline, the ‘transformTrainingDataset’ method should be on the model of Target Encoder to get appropriate results.

Edge Cases¶

The label column can’t contain any null values.
Input columns transformed by Target Encoder can contain null values.
Novel values in a testing/production data set and null values belong to the same category. In other words, Target Encoder returns a prior average for all novel values in case a given column of the training dataset did not contain any null values. Otherwise, the posterior average of rows having null values in the column is returned.