How to Work with Variables

pyjpt uses typed variable objects to describe the columns of your data. Every variable has a name and a domain — a distribution class that defines the set of values the variable can take and how they are represented internally. This guide covers:

  • The relationship between variables and their domains

  • The three variable types and their settings

  • How to create variables manually and infer them from data

  • Labels (user-facing) vs. values (internal representation)

  • Variable maps and assignments for queries and results

  • Impurity inversion for symbolic variables

Variables and Domains

A variable is a pairing of a name (a string identifying a column) with a domain (a distribution class that defines the legal values). The domain is always a class — not an instance — because the variable describes a type of data, not a fitted distribution. Fitted distributions are created later, inside each leaf of the tree.

from jpt.distributions import (
    SymbolicType,
    NumericType,
    Bool,
    Numeric,
)
from jpt.distributions.univariate import IntegerType
from jpt.variables import (
    SymbolicVariable,
    NumericVariable,
    IntegerVariable,
)

# The domain is a CLASS (not an instance).
Color = SymbolicType('Color', ['red', 'green', 'blue'])
color = SymbolicVariable('color', Color)

# Numeric domains can be plain or scaled.
temperature = NumericVariable('temperature')  # Numeric
height = NumericVariable(                     # ScaledNumeric
    'height',
    domain=NumericType(
        'Height',
        values=[165.0, 170.0, 180.0, 190.0],
    ),
)

# Integer domains specify a range.
die = IntegerVariable(
    'die',
    IntegerType('Die', lmin=1, lmax=6),
)

# Bool is a predefined symbolic domain with
# labels {True, False}.
raining = SymbolicVariable('raining', Bool)

When the tree calls variable.distribution() internally, it instantiates the domain class and passes relevant settings from the variable to the new distribution instance. The resulting object holds fitted parameters (probability vectors, quantile functions, etc.) for a specific leaf.

Domain Factory Functions

Domains are created with factory functions that return new classes (not instances):

SymbolicType(name, labels)

Creates a Multinomial subclass. labels is a list of user-facing category names. Internally, each label is mapped to a zero-based integer index.

Fruit = SymbolicType('Fruit', ['apple', 'banana', 'cherry'])
Fruit.labels   # {0: 'apple', 1: 'banana', 2: 'cherry'}
Fruit.values   # {'apple': 0, 'banana': 1, 'cherry': 2}
Fruit.n_values  # 3
NumericType(name, values=None)

Creates a ScaledNumeric subclass. If values is provided (a sample of the expected data), the domain stores mean/scale normalization factors so that the tree works in standardized space internally while preserving the original scale externally. If no values are given, the plain Numeric class is used (no scaling).

IntegerType(name, lmin=None, lmax=None)

Creates an Integer subclass. lmin and lmax define the label-space bounds (inclusive). Omitting a bound creates an open-ended domain. Internally, values are mapped to zero-based indices:

Score = IntegerType('Score', lmin=-2, lmax=2)
# labels (external): -2, -1, 0, 1, 2
# values (internal):  0,  1, 2, 3, 4
Bool

A predefined Multinomial subclass with two labels, False (index 0) and True (index 1). It is a ready-made class, not a factory.

Variable Types

All variables inherit from the abstract base class Variable, which provides the settings mechanism, serialization, and hashing. Variable cannot be instantiated directly; use one of the three concrete subclasses.

NumericVariable

For continuous, real-valued columns.

temp = NumericVariable(
    'temperature',
    domain=Numeric,               # default
    blur=0.05,                    # widen point evidence
    max_std=2.0,                  # stop splitting below
                                  # this standard deviation
    precision=0.01,               # quantile granularity
    min_impurity_improvement=0.0, # minimum split gain
)

Setting

Description

blur

Widens a single-value evidence point into an interval via the prior quantile function. Default: 0.

max_std

If the standard deviation in a node drops below this limit, no further splits are attempted for this variable. Default: 0 (disabled).

precision

Controls the granularity of the quantile-based density approximation. Default: 0.01.

min_impurity_improvement

Minimum impurity reduction required for a split on this variable. Default: 0.

IntegerVariable

For discrete, integer-valued columns with a finite or open-ended range.

rolls = IntegerVariable(
    'rolls',
    IntegerType('Rolls', lmin=1, lmax=20),
    min_impurity_improvement=0.0,
)

Setting

Description

min_impurity_improvement

Minimum impurity reduction required for a split on this variable. Default: 0.

SymbolicVariable

For categorical columns.

species = SymbolicVariable(
    'species',
    SymbolicType('Species', ['setosa', 'versicolor', 'virginica']),
    invert_impurity=False,
    min_impurity_improvement=0.0,
)

Setting

Description

invert_impurity

Invert the Gini impurity for this variable, favoring mixed leaves. Default: False. See Impurity Inversion for Symbolic Variables.

min_impurity_improvement

Minimum impurity reduction required for a split on this variable. Default: 0.

Inferring Variables from a DataFrame

For quick setup, infer_from_dataframe() inspects column dtypes and creates one variable per column:

import pandas as pd
from jpt.variables import infer_from_dataframe

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Carol'],
    'age':  [30.0, 25.0, 35.0],
    'score': [3, 5, 4],
})

variables = infer_from_dataframe(df)
# [SymbolicVariable('name', ...),
#  NumericVariable('age', ...),
#  IntegerVariable('score', ...)]

The mapping from dtype to variable type is:

Column dtype

Variable type

bool, object,

SymbolicVariable

string

float16/32/64

NumericVariable

int8/16/32/64

IntegerVariable

Useful keyword arguments:

  • scale_numeric_types (default True): use ScaledNumeric domains for float columns, storing mean/scale normalization from the column data.

  • unique_domain_names (default False): append a UUID to every domain name, preventing name collisions when calling the function multiple times.

  • excluded_columns: a dict mapping column names to user-provided domain classes, overriding automatic inference for those columns.

  • remove_nan (default False): exclude NaN / ±inf values when constructing numeric domains.

Labels vs. Values

pyjpt distinguishes two representations of data:

Labels (user-facing / exterior)

The human-readable representation: category strings ('red', 'blue'), raw floats (23.5), or integers (-2). Labels are what you see in DataFrames, queries, and printed results.

Values (internal / interior)

The representation used during tree learning and inference. For symbolic variables, labels are mapped to zero-based integer indices. For scaled numeric variables, labels are z-normalized. For plain numeric and integer variables, labels and values are identical (identity mapping).

Every domain class provides bidirectional conversion:

Color = SymbolicType('Color', ['red', 'green', 'blue'])

# Label → Value
Color.values['red']         # 0
Color.label2value('green')  # 1
Color.label2value({'red', 'blue'})  # {0, 2}

# Value → Label
Color.labels[0]             # 'red'
Color.value2label(1)        # 'green'
Color.value2label({0, 2})   # {'red', 'blue'}

For numeric variables with scaling:

import numpy as np

Height = NumericType(
    'Height',
    values=np.array([165., 170., 180., 190.]),
)

# label2value normalizes (subtracts mean, divides by std)
internal = Height.label2value(180.0)

# value2label denormalizes
external = Height.value2label(internal)  # ≈ 180.0

The tree’s bind() method works in label space by default, so you pass human-readable values and the conversion happens automatically:

evidence = model.bind(color='red', temperature=22.5)

Variable Maps and Assignments

VariableMap

VariableMap is a dictionary-like container that maps Variable objects to arbitrary values. It supports lookup by both the Variable object and its name string:

from jpt.variables import VariableMap

vm = VariableMap()
vm[color] = 'red'       # set by Variable object
vm['temperature'] = 22.5  # set by name string (if
                          # registered)

print(vm[color])         # 'red'
print(vm['color'])       # 'red'

VariableMap is used throughout the library: leaf distributions, prior distributions, query results, and moment computations all return VariableMap instances.

It supports standard dict operations — iteration (yields Variable objects), in, del, len, keys(), values(), items() — as well as copy(), in-place += / -=, and JSON serialization.

VariableAssignment

VariableAssignment extends VariableMap with type validation for the values. It is an abstract class with two concrete subclasses that correspond to the two data representations:

LabelAssignment — stores values in label space.

from jpt.variables import LabelAssignment

la = LabelAssignment(variables=[color, temperature])
la[color] = {'red', 'green'}  # set of labels
la[temperature] = 22.5        # scalar → ContinuousSet

# Convert to internal representation:
va = la.value_assignment()

ValueAssignment — stores values in value space (integer indices for symbolic variables, normalized floats for scaled numerics).

# Convert back to labels:
la2 = va.label_assignment()

Assignments validate their inputs: setting a symbolic variable to a label that is not in its domain raises an error, as does setting a numeric variable to a non-numeric value.

In practice, you rarely construct assignments directly. The tree’s bind() method builds a LabelAssignment from user-friendly keyword arguments:

# These are equivalent:
evidence = model.bind(color='red', temperature=[20, 25])

# The list [20, 25] is converted to ContinuousSet(20, 25)
# automatically.

Impurity Inversion for Symbolic Variables

By default the JPT learning algorithm minimizes the Gini impurity of every target variable, producing leaves in which each symbolic variable is as pure as possible — ideally a single dominant category.

Setting invert_impurity=True on a SymbolicVariable reverses this objective for that variable: the learner now favors splits that keep the variable’s distribution mixed within each leaf instead of separating it.

Gender = SymbolicType('Gender', ['female', 'male', 'other'])
gender = SymbolicVariable(
    'gender',
    Gender,
    invert_impurity=True,
)

model = JPT(variables=[gender, ...])
model.fit(df)

When to Use Impurity Inversion

Fairness-aware learning. Mark a protected attribute (e.g. gender, ethnicity) with invert_impurity=True. The tree will avoid creating splits that segregate by that attribute, producing leaves where the protected attribute remains representative of the overall population. This is a lightweight structural fairness constraint — the model can still predict on other variables without building discriminatory partitions.

Confound suppression. In observational data a confounding variable (e.g. hospital site in a multi-site medical study) may dominate splits even though the goal is to learn patient-level patterns. Inverting impurity on the confounder forces the tree to find splits driven by other variables while keeping the confounder mixed in every leaf.

Balanced stratification for downstream tasks. If you need per-leaf statistics that should be computed over a representative distribution of some grouping variable (e.g. product category), inversion ensures each leaf retains a mix of all groups rather than splitting them apart.

Example

import pandas as pd
from jpt.distributions import SymbolicType
from jpt.variables import SymbolicVariable
from jpt.trees import JPT

df = pd.DataFrame({
    'fst': ['a', 'a', 'a', 'b', 'b', 'b'],
    'snd': ['c', 'd', 'c', 'd', 'c', 'd'],
})

AT = SymbolicType('AType', labels=['a', 'b'])
BT = SymbolicType('BType', labels=['c', 'd'])

A = SymbolicVariable('fst', AT, invert_impurity=True)
B = SymbolicVariable('snd', BT)

model = JPT([A, B])
model.fit(df)

# Each leaf retains a mix of 'a' and 'b' for variable
# ``fst`` instead of splitting them into pure nodes.
for leaf in model.leaves.values():
    print(leaf.distributions['fst'])

Variable Settings

Every variable carries a settings dictionary populated from class-level defaults and overridden by constructor arguments. Settings are accessible as regular attributes:

v = NumericVariable('x', blur=0.1, precision=0.05)
v.blur        # 0.1
v.precision   # 0.05
v.settings    # {'min_impurity_improvement': 0,
              #  'blur': 0.1, 'max_std_lbl': 0.0,
              #  'precision': 0.05}

Settings are included in equality checks and hashing, so two variables with the same name and domain but different settings are considered distinct. Settings are also preserved through JSON and pickle serialization.

Serialization

All variables and variable maps support JSON and pickle round-trips:

import json
import pickle
from jpt.variables import Variable

# JSON
data = json.dumps(color.to_json())
restored = Variable.from_json(json.loads(data))
assert color == restored

# Pickle
restored = pickle.loads(pickle.dumps(color))
assert color == restored

The JSON representation includes the variable type ('numeric', 'symbolic', 'integer'), name, serialized domain, and settings. Variable.from_json() dispatches to the correct subclass based on the type field.

See also

jpt.variables — full API reference.

How to Classify with JPTs — using symbolic targets for classification.

How to Predict Continuous Values with JPTs — working with numeric targets.