Target Encoding in Sparkling Water — H2O Sparkling Water 3.36.1.4-1-2.3 documentation

Target Encoding in Sparkling Water is a mechanism of converting categorical features to continues features based on the mean calculated from values of the label (target) column. See also Parameters of H2OTargetEncoder.

An example of converting a categorical feature to continues with Target Encoder (Town_te is a produced column):

Town

Label

Town_te

Chennai

1

0.8

Prague

0

0.286

Chennai

0

0.8

Mountain View

1

0.714

Chennai

1

0.8

Prague

1

0.286

Mountain View

1

0.714

Chennai

1

0.8

Mountain View

0

0.714

Prague

1

0.286

Prague

0

0.286

Mountain View

1

0.714

Prague

0

0.286

Mountain View

0

0.714

Chennai

1

0.8

Mountain View

1

0.714

Prague

0

0.286

Prague

0

0.286

Mountain View

1

0.714

Target Encoding can help to improve accuracy of machine learning algorithms when columns with high cardinality are used as features during a training phase.

Using Target Encoder¶

Sparkling Water exposes API for target encoder in Scala and Python. Before we start using Target Encoder, we need to run and prepare the environment:

Scala

First, let’s start Sparkling Shell (use :paste mode when you try to copy-paste examples):

./bin/sparkling-shell

Start H2O cluster inside the Spark environment:

import ai.h2o.sparkling._import java.net.URIval hc = H2OContext.getOrCreate()

Parse the data using H2O and convert them to Spark Frame:

import org.apache.spark.SparkFilesspark.sparkContext.addFile("")val sparkDF = spark.read.option("header", "true").option("inferSchema", "true").csv(SparkFiles.get("prostate.csv"))val Array(trainingDF, testingDF) = sparkDF.randomSplit(Array(0.8, 0.2))

Python

First, let’s start PySparkling Shell:

./bin/pysparkling

Start H2O cluster inside the Spark environment:

from pysparkling import *hc = H2OContext.getOrCreate()

Parse the data using H2O and convert them to Spark Frame:

import h2oframe = h2o.import_file("")sparkDF = hc.asSparkFrame(frame)[trainingDF, testingDF] = sparkDF.randomSplit([0.8, 0.2])

Target Encoder in ML Pipeline¶

Target Encoder in Sparkling Water is implemented as a regular estimator and thus could be placed as a stage to Spark ML Pipeline

Scala

Let’s create an instance of Target Encoder and configure it:

import ai.h2o.sparkling.ml.features.H2OTargetEncoderval targetEncoder = new H2OTargetEncoder() .setInputCols(Array("RACE", "DPROS", "DCAPS")) .setProblemType("Classification") .setLabelCol("CAPSULE")

Also, create an instance of an algorithm consuming encoded columns and define pipeline:

import ai.h2o.sparkling.ml.algos.classification.H2OGBMClassifierimport org.apache.spark.ml.Pipelineval gbm = new H2OGBMClassifier() .setFeaturesCols(targetEncoder.getOutputCols()) .setLabelCol("CAPSULE")val pipeline = new Pipeline().setStages(Array(targetEncoder, gbm))

Train the created pipeline

val pipelineModel = pipeline.fit(trainingDF)

Make predictions including a model of Target Encoder:

pipelineModel.transform(testingDF).show()

The model of Target Encoder is persistable to MOJO, so you can save and load the whole pipeline model:

import org.apache.spark.ml.PipelineModelpipelineModel.write.save("somePathForStoringPipelineModel")val loadedPipelineModel = PipelineModel.load("somePathForStoringPipelineModel")loadedPipelineModel.transform(testingDF).show()

Python

Let’s create an instance of Target Encoder and configure it:

from pysparkling.ml import H2OTargetEncodertargetEncoder = H2OTargetEncoder()\ .setInputCols(["RACE", "DPROS", "DCAPS"])\ .setLabelCol("CAPSULE")\ .setProblemType("Classification")

Also, create an instance of an algorithm consuming encoded columns and define pipeline:

from pysparkling.ml import H2OGBMClassifierfrom pyspark.ml import Pipelinegbm = H2OGBMClassifier()\ .setFeaturesCols(targetEncoder.getOutputCols())\ .setLabelCol("CAPSULE")pipeline = Pipeline(stages=[targetEncoder, gbm])

Train the created pipeline

pipelineModel = pipeline.fit(trainingDF)

Make predictions including a model of Target Encoder:

pipelineModel.transform(testingDF).show()

The model of Target Encoder is persistable to MOJO, so you can save and load the whole pipeline model:

from pyspark.ml import PipelineModelpipelineModel.save("somePathForStoringPipelineModel")loadedPipelineModel = PipelineModel.load("somePathForStoringPipelineModel")loadedPipelineModel.transform(testingDF).show()

Standalone Target Encoder¶

Target Encoder’s parameters like noise and holdoutStrategy are relevant only for a training dataset. Thus the transform method of H2OTargetEncoderModel has to treat training and other data sets differently and eventually, ignore the mentioned parameters.

When Target Encoder is inside a ML pipeline, the differentiation is done automatically. But if a user decides to train an algorithm without ML pipeline, the ‘transformTrainingDataset’ method should be on the model of Target Encoder to get appropriate results.

Edge Cases¶

  • The label column can’t contain any null values.

  • Input columns transformed by Target Encoder can contain null values.

  • Novel values in a testing/production data set and null values belong to the same category. In other words, Target Encoder returns a prior average for all novel values in case a given column of the training dataset did not contain any null values. Otherwise, the posterior average of rows having null values in the column is returned.

You Might Also Like