How to Work with Variables
pyjpt uses typed variable objects to describe the columns of your
data. Every variable has a name and a domain — a distribution
class that defines the set of values the variable can take and how
they are represented internally. This guide covers:
The relationship between variables and their domains
The three variable types and their settings
How to create variables manually and infer them from data
Labels (user-facing) vs. values (internal representation)
Variable maps and assignments for queries and results
Impurity inversion for symbolic variables
Variables and Domains
A variable is a pairing of a name (a string identifying a column) with a domain (a distribution class that defines the legal values). The domain is always a class — not an instance — because the variable describes a type of data, not a fitted distribution. Fitted distributions are created later, inside each leaf of the tree.
from jpt.distributions import (
SymbolicType,
NumericType,
Bool,
Numeric,
)
from jpt.distributions.univariate import IntegerType
from jpt.variables import (
SymbolicVariable,
NumericVariable,
IntegerVariable,
)
# The domain is a CLASS (not an instance).
Color = SymbolicType('Color', ['red', 'green', 'blue'])
color = SymbolicVariable('color', Color)
# Numeric domains can be plain or scaled.
temperature = NumericVariable('temperature') # Numeric
height = NumericVariable( # ScaledNumeric
'height',
domain=NumericType(
'Height',
values=[165.0, 170.0, 180.0, 190.0],
),
)
# Integer domains specify a range.
die = IntegerVariable(
'die',
IntegerType('Die', lmin=1, lmax=6),
)
# Bool is a predefined symbolic domain with
# labels {True, False}.
raining = SymbolicVariable('raining', Bool)
When the tree calls variable.distribution() internally, it
instantiates the domain class and passes relevant settings from the
variable to the new distribution instance. The resulting object
holds fitted parameters (probability vectors, quantile functions,
etc.) for a specific leaf.
Domain Factory Functions
Domains are created with factory functions that return new classes (not instances):
- SymbolicType(name, labels)
Creates a
Multinomialsubclass.labelsis a list of user-facing category names. Internally, each label is mapped to a zero-based integer index.Fruit = SymbolicType('Fruit', ['apple', 'banana', 'cherry']) Fruit.labels # {0: 'apple', 1: 'banana', 2: 'cherry'} Fruit.values # {'apple': 0, 'banana': 1, 'cherry': 2} Fruit.n_values # 3
- NumericType(name, values=None)
Creates a
ScaledNumericsubclass. Ifvaluesis provided (a sample of the expected data), the domain stores mean/scale normalization factors so that the tree works in standardized space internally while preserving the original scale externally. If no values are given, the plainNumericclass is used (no scaling).- IntegerType(name, lmin=None, lmax=None)
Creates an
Integersubclass.lminandlmaxdefine the label-space bounds (inclusive). Omitting a bound creates an open-ended domain. Internally, values are mapped to zero-based indices:Score = IntegerType('Score', lmin=-2, lmax=2) # labels (external): -2, -1, 0, 1, 2 # values (internal): 0, 1, 2, 3, 4
- Bool
A predefined
Multinomialsubclass with two labels,False(index 0) andTrue(index 1). It is a ready-made class, not a factory.
Variable Types
All variables inherit from the abstract base class
Variable, which provides the settings
mechanism, serialization, and hashing. Variable cannot be
instantiated directly; use one of the three concrete subclasses.
NumericVariable
For continuous, real-valued columns.
temp = NumericVariable(
'temperature',
domain=Numeric, # default
blur=0.05, # widen point evidence
max_std=2.0, # stop splitting below
# this standard deviation
precision=0.01, # quantile granularity
min_impurity_improvement=0.0, # minimum split gain
)
Setting |
Description |
|---|---|
|
Widens a single-value evidence
point into an interval via the
prior quantile function.
Default: |
|
If the standard deviation
in a node drops below this
limit, no further splits are
attempted for this variable.
Default: |
|
Controls the granularity of
the quantile-based density
approximation.
Default: |
|
Minimum impurity reduction
required for a split on this
variable.
Default: |
IntegerVariable
For discrete, integer-valued columns with a finite or open-ended range.
rolls = IntegerVariable(
'rolls',
IntegerType('Rolls', lmin=1, lmax=20),
min_impurity_improvement=0.0,
)
Setting |
Description |
|---|---|
|
Minimum impurity reduction
required for a split on this
variable.
Default: |
SymbolicVariable
For categorical columns.
species = SymbolicVariable(
'species',
SymbolicType('Species', ['setosa', 'versicolor', 'virginica']),
invert_impurity=False,
min_impurity_improvement=0.0,
)
Setting |
Description |
|---|---|
|
Invert the Gini impurity
for this variable, favoring
mixed leaves.
Default: |
|
Minimum impurity reduction
required for a split on this
variable.
Default: |
Inferring Variables from a DataFrame
For quick setup, infer_from_dataframe()
inspects column dtypes and creates one variable per column:
import pandas as pd
from jpt.variables import infer_from_dataframe
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Carol'],
'age': [30.0, 25.0, 35.0],
'score': [3, 5, 4],
})
variables = infer_from_dataframe(df)
# [SymbolicVariable('name', ...),
# NumericVariable('age', ...),
# IntegerVariable('score', ...)]
The mapping from dtype to variable type is:
Column dtype |
Variable type |
|---|---|
|
|
|
|
|
|
|
|
Useful keyword arguments:
scale_numeric_types (default
True): useScaledNumericdomains for float columns, storing mean/scale normalization from the column data.unique_domain_names (default
False): append a UUID to every domain name, preventing name collisions when calling the function multiple times.excluded_columns: a dict mapping column names to user-provided domain classes, overriding automatic inference for those columns.
remove_nan (default
False): excludeNaN/±infvalues when constructing numeric domains.
Labels vs. Values
pyjpt distinguishes two representations of data:
- Labels (user-facing / exterior)
The human-readable representation: category strings (
'red','blue'), raw floats (23.5), or integers (-2). Labels are what you see in DataFrames, queries, and printed results.- Values (internal / interior)
The representation used during tree learning and inference. For symbolic variables, labels are mapped to zero-based integer indices. For scaled numeric variables, labels are z-normalized. For plain numeric and integer variables, labels and values are identical (identity mapping).
Every domain class provides bidirectional conversion:
Color = SymbolicType('Color', ['red', 'green', 'blue'])
# Label → Value
Color.values['red'] # 0
Color.label2value('green') # 1
Color.label2value({'red', 'blue'}) # {0, 2}
# Value → Label
Color.labels[0] # 'red'
Color.value2label(1) # 'green'
Color.value2label({0, 2}) # {'red', 'blue'}
For numeric variables with scaling:
import numpy as np
Height = NumericType(
'Height',
values=np.array([165., 170., 180., 190.]),
)
# label2value normalizes (subtracts mean, divides by std)
internal = Height.label2value(180.0)
# value2label denormalizes
external = Height.value2label(internal) # ≈ 180.0
The tree’s bind() method works in label space by default,
so you pass human-readable values and the conversion happens
automatically:
evidence = model.bind(color='red', temperature=22.5)
Variable Maps and Assignments
VariableMap
VariableMap is a dictionary-like
container that maps Variable objects to arbitrary values. It
supports lookup by both the Variable object and its name
string:
from jpt.variables import VariableMap
vm = VariableMap()
vm[color] = 'red' # set by Variable object
vm['temperature'] = 22.5 # set by name string (if
# registered)
print(vm[color]) # 'red'
print(vm['color']) # 'red'
VariableMap is used throughout the library: leaf distributions,
prior distributions, query results, and moment computations all
return VariableMap instances.
It supports standard dict operations — iteration (yields
Variable objects), in, del, len, keys(),
values(), items() — as well as copy(), in-place
+= / -=, and JSON serialization.
VariableAssignment
VariableAssignment extends
VariableMap with type validation for the values. It is
an abstract class with two concrete subclasses that correspond to
the two data representations:
LabelAssignment — stores values in label space.
from jpt.variables import LabelAssignment
la = LabelAssignment(variables=[color, temperature])
la[color] = {'red', 'green'} # set of labels
la[temperature] = 22.5 # scalar → ContinuousSet
# Convert to internal representation:
va = la.value_assignment()
ValueAssignment — stores values in value space (integer indices for symbolic variables, normalized floats for scaled numerics).
# Convert back to labels:
la2 = va.label_assignment()
Assignments validate their inputs: setting a symbolic variable to a label that is not in its domain raises an error, as does setting a numeric variable to a non-numeric value.
In practice, you rarely construct assignments directly. The
tree’s bind() method builds a LabelAssignment from
user-friendly keyword arguments:
# These are equivalent:
evidence = model.bind(color='red', temperature=[20, 25])
# The list [20, 25] is converted to ContinuousSet(20, 25)
# automatically.
Impurity Inversion for Symbolic Variables
By default the JPT learning algorithm minimizes the Gini impurity of every target variable, producing leaves in which each symbolic variable is as pure as possible — ideally a single dominant category.
Setting invert_impurity=True on a
SymbolicVariable reverses this
objective for that variable: the learner now favors splits that
keep the variable’s distribution mixed within each leaf instead
of separating it.
Gender = SymbolicType('Gender', ['female', 'male', 'other'])
gender = SymbolicVariable(
'gender',
Gender,
invert_impurity=True,
)
model = JPT(variables=[gender, ...])
model.fit(df)
When to Use Impurity Inversion
Fairness-aware learning.
Mark a protected attribute (e.g. gender, ethnicity) with
invert_impurity=True. The tree will avoid creating splits
that segregate by that attribute, producing leaves where the
protected attribute remains representative of the overall
population. This is a lightweight structural fairness
constraint — the model can still predict on other variables
without building discriminatory partitions.
Confound suppression. In observational data a confounding variable (e.g. hospital site in a multi-site medical study) may dominate splits even though the goal is to learn patient-level patterns. Inverting impurity on the confounder forces the tree to find splits driven by other variables while keeping the confounder mixed in every leaf.
Balanced stratification for downstream tasks. If you need per-leaf statistics that should be computed over a representative distribution of some grouping variable (e.g. product category), inversion ensures each leaf retains a mix of all groups rather than splitting them apart.
Example
import pandas as pd
from jpt.distributions import SymbolicType
from jpt.variables import SymbolicVariable
from jpt.trees import JPT
df = pd.DataFrame({
'fst': ['a', 'a', 'a', 'b', 'b', 'b'],
'snd': ['c', 'd', 'c', 'd', 'c', 'd'],
})
AT = SymbolicType('AType', labels=['a', 'b'])
BT = SymbolicType('BType', labels=['c', 'd'])
A = SymbolicVariable('fst', AT, invert_impurity=True)
B = SymbolicVariable('snd', BT)
model = JPT([A, B])
model.fit(df)
# Each leaf retains a mix of 'a' and 'b' for variable
# ``fst`` instead of splitting them into pure nodes.
for leaf in model.leaves.values():
print(leaf.distributions['fst'])
Variable Settings
Every variable carries a settings dictionary populated from
class-level defaults and overridden by constructor arguments.
Settings are accessible as regular attributes:
v = NumericVariable('x', blur=0.1, precision=0.05)
v.blur # 0.1
v.precision # 0.05
v.settings # {'min_impurity_improvement': 0,
# 'blur': 0.1, 'max_std_lbl': 0.0,
# 'precision': 0.05}
Settings are included in equality checks and hashing, so two variables with the same name and domain but different settings are considered distinct. Settings are also preserved through JSON and pickle serialization.
Serialization
All variables and variable maps support JSON and pickle round-trips:
import json
import pickle
from jpt.variables import Variable
# JSON
data = json.dumps(color.to_json())
restored = Variable.from_json(json.loads(data))
assert color == restored
# Pickle
restored = pickle.loads(pickle.dumps(color))
assert color == restored
The JSON representation includes the variable type
('numeric', 'symbolic', 'integer'), name, serialized
domain, and settings. Variable.from_json() dispatches to
the correct subclass based on the type field.
See also
jpt.variables — full API reference.
How to Classify with JPTs — using symbolic targets for classification.
How to Predict Continuous Values with JPTs — working with numeric targets.