How to Use Split Validation
Split validation is a learning-time regulariser that lets you withhold a subset of training rows from being used as candidate split points while still letting them influence impurity scoring. It is useful when you suspect that pathological x-values in the training data (outliers, near-duplicates, or measurement noise) would otherwise drive the tree to make splits that do not generalise.
Two complementary mechanisms are available:
split_validation_mask — marks each row as either a training or an evaluation sample;
split_validation_mode — controls which rows contribute to the impurity score at each candidate split.
Basic Usage
Pass a boolean or uint8 mask to jpt.trees.JPT.fit().
True (or 1) marks a row whose feature value may serve as a
split candidate; False (0) marks a row that is excluded
from the candidate set but still contributes to target statistics.
import numpy as np
import pandas as pd
from jpt.trees import JPT
from jpt.variables import NumericVariable, SymbolicVariable
from jpt.distributions import Bool
rng = np.random.RandomState(0)
n = 1000
x = rng.uniform(0, 1, n)
y = x > 0.5
df = pd.DataFrame({'x': x, 'y': y})
# Hold 30 % out as evaluation rows.
mask = rng.rand(n) < 0.7 # True = training
xvar = NumericVariable('x')
yvar = SymbolicVariable('y', Bool)
jpt = JPT(
variables=[xvar, yvar],
targets=[yvar],
min_samples_leaf=20,
)
jpt.fit(df, split_validation_mask=mask)
With no other arguments, the evaluation rows are treated under
split_validation_mode='both' (the default): their target
values are included in the impurity score at every candidate
split, but their x-coordinates are not tried as split points.
Choosing a Mode
split_validation_mode determines which rows contribute to the
target impurity calculation at each split:
'both'(default)All rows contribute to impurity. Equivalent to a classic validation-hold-out: the training rows define the candidate splits, but every row tells the optimiser how good each candidate is.
'training'Only training rows contribute to impurity. The evaluation rows act purely as a don’t split on these x-values signal.
'evaluation'Only evaluation rows contribute to impurity. The tree is scored exclusively on held-out rows; training rows propose splits and nothing else. Works well when the training set has many near-duplicate or extreme x-values that you want to suggest candidate boundaries but not vote on quality.
min_eval_samples — Require a Minimum of Held-out Rows per Child
When split_validation_mode='evaluation' is active the
impurity is scored on a smaller set than the training set.
Splits that leave very few evaluation rows on one side yield
unreliable impurity estimates. Setting min_eval_samples in
the jpt.trees.JPT constructor rejects any candidate
split where either child partition contains fewer than
min_eval_samples evaluation rows:
jpt = JPT(
variables=[xvar, yvar],
targets=[yvar],
min_samples_leaf=20,
min_eval_samples=10, # int: absolute count
)
jpt.fit(df, split_validation_mask=mask,
split_validation_mode='evaluation')
As with min_samples_leaf, a float in \((0, 1)\) is
interpreted as a fraction of the total training rows:
JPT(..., min_eval_samples=0.05) # 5 % of all rows
min_eval_samples=0 (the default) disables the check.
min_eval_samples is ignored for modes other than
'evaluation'.
Serialisation
Both min_eval_samples and the resulting tree structure are
preserved by jpt.trees.JPT.to_json() /
jpt.trees.JPT.from_json(). The split-validation mask and
mode are learning-time parameters only — they are not stored in
the fitted model and do not affect inference.
Troubleshooting
- All splits rejected, tree ends up with a single leaf.
min_eval_samplesis too large for your evaluation set size. If you have 200 evaluation rows and setmin_eval_samples=60, no split can leave both sides with 60+ evaluation rows unless the tree is nearly balanced. Reduce the value.- Training with a mask is much slower than without.
The evaluation-only path requires a second pass over the target statistics per candidate split. For large datasets,
split_validation_mode='both'(the default) is the fastest option.- “Mask length must equal number of samples” error.
The mask is row-aligned with the
dataargument tofit()after any preprocessing (dropping NaN rows, etc.). Build the mask from the cleaned DataFrame, not from the raw input.- “Mask must contain at least one training sample” error.
At least one row needs
mask[i] == Trueso the tree has candidate split points to choose from.