How to Work with Variables ========================== ``pyjpt`` uses typed variable objects to describe the columns of your data. Every variable has a **name** and a **domain** — a distribution class that defines the set of values the variable can take and how they are represented internally. This guide covers: - The relationship between variables and their domains - The three variable types and their settings - How to create variables manually and infer them from data - Labels (user-facing) vs. values (internal representation) - Variable maps and assignments for queries and results - Impurity inversion for symbolic variables .. contents:: On this page :local: :depth: 2 Variables and Domains --------------------- A variable is a pairing of a **name** (a string identifying a column) with a **domain** (a distribution *class* that defines the legal values). The domain is always a class — not an instance — because the variable describes a *type* of data, not a fitted distribution. Fitted distributions are created later, inside each leaf of the tree. .. code-block:: python from jpt.distributions import ( SymbolicType, NumericType, Bool, Numeric, ) from jpt.distributions.univariate import IntegerType from jpt.variables import ( SymbolicVariable, NumericVariable, IntegerVariable, ) # The domain is a CLASS (not an instance). Color = SymbolicType('Color', ['red', 'green', 'blue']) color = SymbolicVariable('color', Color) # Numeric domains can be plain or scaled. temperature = NumericVariable('temperature') # Numeric height = NumericVariable( # ScaledNumeric 'height', domain=NumericType( 'Height', values=[165.0, 170.0, 180.0, 190.0], ), ) # Integer domains specify a range. die = IntegerVariable( 'die', IntegerType('Die', lmin=1, lmax=6), ) # Bool is a predefined symbolic domain with # labels {True, False}. raining = SymbolicVariable('raining', Bool) When the tree calls ``variable.distribution()`` internally, it instantiates the domain class and passes relevant settings from the variable to the new distribution instance. The resulting object holds fitted parameters (probability vectors, quantile functions, etc.) for a specific leaf. Domain Factory Functions ~~~~~~~~~~~~~~~~~~~~~~~~ Domains are created with factory functions that return new *classes* (not instances): **SymbolicType(name, labels)** Creates a :py:class:`~jpt.distributions.univariate.Multinomial` subclass. ``labels`` is a list of user-facing category names. Internally, each label is mapped to a zero-based integer index. .. code-block:: python Fruit = SymbolicType('Fruit', ['apple', 'banana', 'cherry']) Fruit.labels # {0: 'apple', 1: 'banana', 2: 'cherry'} Fruit.values # {'apple': 0, 'banana': 1, 'cherry': 2} Fruit.n_values # 3 **NumericType(name, values=None)** Creates a :py:class:`~jpt.distributions.univariate.numeric.ScaledNumeric` subclass. If ``values`` is provided (a sample of the expected data), the domain stores mean/scale normalization factors so that the tree works in standardized space internally while preserving the original scale externally. If no values are given, the plain ``Numeric`` class is used (no scaling). **IntegerType(name, lmin=None, lmax=None)** Creates an :py:class:`~jpt.distributions.univariate.Integer` subclass. ``lmin`` and ``lmax`` define the label-space bounds (inclusive). Omitting a bound creates an open-ended domain. Internally, values are mapped to zero-based indices: .. code-block:: python Score = IntegerType('Score', lmin=-2, lmax=2) # labels (external): -2, -1, 0, 1, 2 # values (internal): 0, 1, 2, 3, 4 **Bool** A predefined ``Multinomial`` subclass with two labels, ``False`` (index 0) and ``True`` (index 1). It is a ready-made class, not a factory. Variable Types -------------- All variables inherit from the abstract base class :py:class:`~jpt.variables.Variable`, which provides the settings mechanism, serialization, and hashing. ``Variable`` cannot be instantiated directly; use one of the three concrete subclasses. NumericVariable ~~~~~~~~~~~~~~~ For continuous, real-valued columns. .. code-block:: python temp = NumericVariable( 'temperature', domain=Numeric, # default blur=0.05, # widen point evidence max_std=2.0, # stop splitting below # this standard deviation precision=0.01, # quantile granularity min_impurity_improvement=0.0, # minimum split gain ) ============================================ ================================ Setting Description ============================================ ================================ ``blur`` Widens a single-value evidence point into an interval via the prior quantile function. Default: ``0``. ``max_std`` If the standard deviation in a node drops below this limit, no further splits are attempted for this variable. Default: ``0`` (disabled). ``precision`` Controls the granularity of the quantile-based density approximation. Default: ``0.01``. ``min_impurity_improvement`` Minimum impurity reduction required for a split on this variable. Default: ``0``. ============================================ ================================ IntegerVariable ~~~~~~~~~~~~~~~ For discrete, integer-valued columns with a finite or open-ended range. .. code-block:: python rolls = IntegerVariable( 'rolls', IntegerType('Rolls', lmin=1, lmax=20), min_impurity_improvement=0.0, ) ============================================ ================================ Setting Description ============================================ ================================ ``min_impurity_improvement`` Minimum impurity reduction required for a split on this variable. Default: ``0``. ============================================ ================================ SymbolicVariable ~~~~~~~~~~~~~~~~ For categorical columns. .. code-block:: python species = SymbolicVariable( 'species', SymbolicType('Species', ['setosa', 'versicolor', 'virginica']), invert_impurity=False, min_impurity_improvement=0.0, ) ============================================ ================================ Setting Description ============================================ ================================ ``invert_impurity`` Invert the Gini impurity for this variable, favoring *mixed* leaves. Default: ``False``. See :ref:`impurity-inversion`. ``min_impurity_improvement`` Minimum impurity reduction required for a split on this variable. Default: ``0``. ============================================ ================================ Inferring Variables from a DataFrame ------------------------------------ For quick setup, :py:func:`~jpt.variables.infer_from_dataframe` inspects column dtypes and creates one variable per column: .. code-block:: python import pandas as pd from jpt.variables import infer_from_dataframe df = pd.DataFrame({ 'name': ['Alice', 'Bob', 'Carol'], 'age': [30.0, 25.0, 35.0], 'score': [3, 5, 4], }) variables = infer_from_dataframe(df) # [SymbolicVariable('name', ...), # NumericVariable('age', ...), # IntegerVariable('score', ...)] The mapping from dtype to variable type is: =========================== ================================== Column dtype Variable type =========================== ================================== ``bool``, ``object``, ``SymbolicVariable`` ``string`` ``float16/32/64`` ``NumericVariable`` ``int8/16/32/64`` ``IntegerVariable`` =========================== ================================== Useful keyword arguments: - **scale_numeric_types** (default ``True``): use ``ScaledNumeric`` domains for float columns, storing mean/scale normalization from the column data. - **unique_domain_names** (default ``False``): append a UUID to every domain name, preventing name collisions when calling the function multiple times. - **excluded_columns**: a dict mapping column names to user-provided domain classes, overriding automatic inference for those columns. - **remove_nan** (default ``False``): exclude ``NaN`` / ``±inf`` values when constructing numeric domains. Labels vs. Values ----------------- ``pyjpt`` distinguishes two representations of data: **Labels** (user-facing / exterior) The human-readable representation: category strings (``'red'``, ``'blue'``), raw floats (``23.5``), or integers (``-2``). Labels are what you see in DataFrames, queries, and printed results. **Values** (internal / interior) The representation used during tree learning and inference. For symbolic variables, labels are mapped to zero-based integer indices. For scaled numeric variables, labels are z-normalized. For plain numeric and integer variables, labels and values are identical (identity mapping). Every domain class provides bidirectional conversion: .. code-block:: python Color = SymbolicType('Color', ['red', 'green', 'blue']) # Label → Value Color.values['red'] # 0 Color.label2value('green') # 1 Color.label2value({'red', 'blue'}) # {0, 2} # Value → Label Color.labels[0] # 'red' Color.value2label(1) # 'green' Color.value2label({0, 2}) # {'red', 'blue'} For numeric variables with scaling: .. code-block:: python import numpy as np Height = NumericType( 'Height', values=np.array([165., 170., 180., 190.]), ) # label2value normalizes (subtracts mean, divides by std) internal = Height.label2value(180.0) # value2label denormalizes external = Height.value2label(internal) # ≈ 180.0 The tree's ``bind()`` method works in **label space** by default, so you pass human-readable values and the conversion happens automatically: .. code-block:: python evidence = model.bind(color='red', temperature=22.5) Variable Maps and Assignments ----------------------------- VariableMap ~~~~~~~~~~~ :py:class:`~jpt.variables.VariableMap` is a dictionary-like container that maps ``Variable`` objects to arbitrary values. It supports lookup by both the ``Variable`` object and its name string: .. code-block:: python from jpt.variables import VariableMap vm = VariableMap() vm[color] = 'red' # set by Variable object vm['temperature'] = 22.5 # set by name string (if # registered) print(vm[color]) # 'red' print(vm['color']) # 'red' ``VariableMap`` is used throughout the library: leaf distributions, prior distributions, query results, and moment computations all return ``VariableMap`` instances. It supports standard dict operations — iteration (yields ``Variable`` objects), ``in``, ``del``, ``len``, ``keys()``, ``values()``, ``items()`` — as well as ``copy()``, in-place ``+=`` / ``-=``, and JSON serialization. VariableAssignment ~~~~~~~~~~~~~~~~~~ :py:class:`~jpt.variables.VariableAssignment` extends ``VariableMap`` with **type validation** for the values. It is an abstract class with two concrete subclasses that correspond to the two data representations: **LabelAssignment** — stores values in label space. .. code-block:: python from jpt.variables import LabelAssignment la = LabelAssignment(variables=[color, temperature]) la[color] = {'red', 'green'} # set of labels la[temperature] = 22.5 # scalar → ContinuousSet # Convert to internal representation: va = la.value_assignment() **ValueAssignment** — stores values in value space (integer indices for symbolic variables, normalized floats for scaled numerics). .. code-block:: python # Convert back to labels: la2 = va.label_assignment() Assignments validate their inputs: setting a symbolic variable to a label that is not in its domain raises an error, as does setting a numeric variable to a non-numeric value. In practice, you rarely construct assignments directly. The tree's ``bind()`` method builds a ``LabelAssignment`` from user-friendly keyword arguments: .. code-block:: python # These are equivalent: evidence = model.bind(color='red', temperature=[20, 25]) # The list [20, 25] is converted to ContinuousSet(20, 25) # automatically. .. _impurity-inversion: Impurity Inversion for Symbolic Variables ----------------------------------------- By default the JPT learning algorithm minimizes the Gini impurity of every target variable, producing leaves in which each symbolic variable is as *pure* as possible — ideally a single dominant category. Setting ``invert_impurity=True`` on a :py:class:`~jpt.variables.SymbolicVariable` reverses this objective for that variable: the learner now favors splits that keep the variable's distribution *mixed* within each leaf instead of separating it. .. code-block:: python Gender = SymbolicType('Gender', ['female', 'male', 'other']) gender = SymbolicVariable( 'gender', Gender, invert_impurity=True, ) model = JPT(variables=[gender, ...]) model.fit(df) When to Use Impurity Inversion ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Fairness-aware learning.** Mark a protected attribute (e.g. gender, ethnicity) with ``invert_impurity=True``. The tree will avoid creating splits that segregate by that attribute, producing leaves where the protected attribute remains representative of the overall population. This is a lightweight structural fairness constraint — the model can still predict on other variables without building discriminatory partitions. **Confound suppression.** In observational data a confounding variable (e.g. hospital site in a multi-site medical study) may dominate splits even though the goal is to learn patient-level patterns. Inverting impurity on the confounder forces the tree to find splits driven by other variables while keeping the confounder mixed in every leaf. **Balanced stratification for downstream tasks.** If you need per-leaf statistics that should be computed over a representative distribution of some grouping variable (e.g. product category), inversion ensures each leaf retains a mix of all groups rather than splitting them apart. Example ~~~~~~~ .. code-block:: python import pandas as pd from jpt.distributions import SymbolicType from jpt.variables import SymbolicVariable from jpt.trees import JPT df = pd.DataFrame({ 'fst': ['a', 'a', 'a', 'b', 'b', 'b'], 'snd': ['c', 'd', 'c', 'd', 'c', 'd'], }) AT = SymbolicType('AType', labels=['a', 'b']) BT = SymbolicType('BType', labels=['c', 'd']) A = SymbolicVariable('fst', AT, invert_impurity=True) B = SymbolicVariable('snd', BT) model = JPT([A, B]) model.fit(df) # Each leaf retains a mix of 'a' and 'b' for variable # ``fst`` instead of splitting them into pure nodes. for leaf in model.leaves.values(): print(leaf.distributions['fst']) Variable Settings ----------------- Every variable carries a ``settings`` dictionary populated from class-level defaults and overridden by constructor arguments. Settings are accessible as regular attributes: .. code-block:: python v = NumericVariable('x', blur=0.1, precision=0.05) v.blur # 0.1 v.precision # 0.05 v.settings # {'min_impurity_improvement': 0, # 'blur': 0.1, 'max_std_lbl': 0.0, # 'precision': 0.05} Settings are included in equality checks and hashing, so two variables with the same name and domain but different settings are considered distinct. Settings are also preserved through JSON and pickle serialization. Serialization ------------- All variables and variable maps support JSON and pickle round-trips: .. code-block:: python import json import pickle from jpt.variables import Variable # JSON data = json.dumps(color.to_json()) restored = Variable.from_json(json.loads(data)) assert color == restored # Pickle restored = pickle.loads(pickle.dumps(color)) assert color == restored The JSON representation includes the variable type (``'numeric'``, ``'symbolic'``, ``'integer'``), name, serialized domain, and settings. ``Variable.from_json()`` dispatches to the correct subclass based on the ``type`` field. .. seealso:: :py:mod:`jpt.variables` — full API reference. :doc:`howto_classification` — using symbolic targets for classification. :doc:`howto_regression` — working with numeric targets.