jpt.trees

Classes

`Node`	Wrapper for the nodes of the `jpt.learning.trees.Tree`.
`DecisionNode`	Represents an inner (decision) node of the the `jpt.learning.trees.Tree`.
`Leaf`	Represents a leaf node of the `jpt.trees.Tree`.
`JPT`	Implementation of Joint Probability Trees (JPTs).

Module Contents

class jpt.trees.Node(idx: int, parent: DecisionNode | None = None)

Wrapper for the nodes of the jpt.learning.trees.Tree.

Create a Node :param idx: the identifier of a node :param parent: the parent of this node

idx

parent: DecisionNode = None

samples = 0

_path = []

property path: jpt.variables.VariableMap

Returns:: the path of this Node as VariableMap

consistent_with(evidence: jpt.variables.VariableMap) → bool

Check if the node is consistent with the variable assignments in evidence.

Parameters:: evidence – A VariableMap that maps to singular values (numeric or symbolic) or ranges (continuous set, set)
Returns:: bool

format_path(fmt: str = None, precision: int = None) → str

abstract number_of_parameters() → int

__str__() → str

__repr__() → str

depth() → int

Returns:: the depth of this node

contains(samples: numpy.ndarray, variable_index_map: jpt.variables.VariableMap) → numpy.array

Check if this node contains the given samples in parallel.

Parameters:

samples – The samples to check
variable_index_map – A VariableMap mapping to the indices in ‘samples’

Returns:

numpy array with 0s and 1s

class jpt.trees.DecisionNode(idx: int | None, variable: jpt.variables.Variable, parent: 'DecisionNode' or None = None)

Bases: Node

Represents an inner (decision) node of the the jpt.learning.trees.Tree.

Create a DecisionNode

Parameters:

idx – The identifier of a node
variable – The split variable
parent – The parent of this node

_splits = None

variable

children: None or List[Node] = None

__hash__()

__eq__(o) → bool

to_json() → Dict[str, Any]

Returns:: The DecisionNode as a json serializable dict.

static from_json(tree: JPT, data: Dict[str, Any]) → DecisionNode: Construct a Decision node from a json dict. :param tree: The tree to mount the node in :param data: The data describing the members of the node :return: the constructed and mounted DecisionNode

property splits: List

set_child(idx: int, node: Node) → None: Set the child at index of this Node. Also extend the path of the child node with this nodes’ path. :param idx: the idx of the child (0 for left, 1 for right) :param node: The child

str_edge(idx_split: int) → str: Convert the edge to child at idx to a string. :param idx_split: The index of the child :return: str

property str_node: str

recursive_children()

Returns:: All children of this node

__str__() → str

__repr__() → str

number_of_parameters() → int

Returns:: The number of relevant parameters in this decision node. 2 are parameters necessary since it the variable and its splitting value are sufficient to describe this computation unit.

class jpt.trees.Leaf(idx: int, parent: DecisionNode or None = None, prior: float or None = None)

Bases: Node

Represents a leaf node of the jpt.trees.Tree.

Construct a Leaf :param idx: the index of this leaf :param parent: the parent of this leaf :param prior: the prior of this leaf (relative number of samples in this leaf)

distributions

prior = None

s_indices = []

property str_node: str

applies(query: jpt.variables.VariableAssignment) → bool: Checks whether this leaf is consistent with the given query. :param query: the query to check :return: bool

property value

recursive_children()

Returns:: All children of this node

__str__() → str

__repr__() → str

__hash__()

to_json() → Dict[str, Any]

Returns:: The DecisionNode as a json serializable dict.

static from_json(tree: JPT, data: Dict[str, Any]) → Leaf: Construct a Decision node from a json dict. :param tree: The tree to mount the node in :param data: The data describing the members of the node :return: the constructed and mounted DecisionNode

__eq__(o) → bool

consistent_with(evidence: jpt.variables.VariableMap) → bool

Check if the node is consistent with the variable assignments in evidence.

Parameters:: evidence – A preprocessed VariableMap that maps to singular values (numeric or symbolic) or ranges (continuous set, set)

path_consistent_with(evidence: jpt.variables.VariableMap) → bool

Check if the path of this node is consistent with the variable assignments in evidence.

Parameters:: evidence – A preprocessed VariableMap that maps to singular values (numeric or symbolic) or ranges (continuous set, set)

probability(query: jpt.variables.VariableAssignment, dirac_scaling: float = 2.0, min_distances: jpt.variables.VariableMap = None) → float

Calculate the probability of a (partial) query. Exploits the independence assumption.

Parameters:

query (VariableMap) – A preprocessed VariableMap that maps to singular values (numeric or symbolic) or ranges (continuous set, set)
dirac_scaling (float) – the minimal distance between the samples within a dimension are multiplied by this factor if a durac impulse is used to model the variable.
min_distances (A VariableMap from numeric variables to floats or None) – A dict mapping the variables to the minimal distances between the observations. This can be useful to use the same likelihood parameters for different test sets for example in cross validation processes.

_numeric_probability(variable: jpt.variables.NumericVariable, value, dirac_scaling: float = 2.0, min_distances: jpt.variables.VariableMap = None)

Calculate the probability of an arbitrary value for a numeric variable.

Parameters:

variable – A numeric variable
dirac_scaling – the minimal distance between the samples within a dimension are multiplied by this factor if a durac impulse is used to model the variable.
min_distances – A dict mapping the variables to the minimal distances between the observations. This can be useful to use the same likelihood parameters for different test sets for example in cross validation processes.

likelihood(queries: pandas.DataFrame, dirac_scaling: float = 2.0, min_distances: jpt.variables.VariableMap = None, single_likelihoods: bool = False, variables: Iterable[jpt.variables.Variable | str] = None) → numpy.ndarray

Calculate the probability of a (partial) query. Exploits the independence assumption.

Parameters:

single_likelihoods –
queries – An array-like object that represents variable assignments in value space.
dirac_scaling (float) – the minimal distance between the samples within a dimension are multiplied by this factor if a dirac impulse is used to model the variable.
min_distances (A VariableMap from numeric variables to floats or None) – A dict mapping the variables to the minimal distances between the observations. This can be useful to use the same likelihood parameters for different test sets for example in cross validation processes.
single_likelihoods – whether likelihoods of each variable shall be reported
variables – the variables indices to consider in the likelihood calculation

copy() → Leaf: Create a copy of this leaf. The copy is unaware of the tree and vice versa. Hence, not path or parent etc. is set. The copy only provides querying functionality.

conditional_leaf(evidence: jpt.variables.VariableAssignment) → Leaf

Create a leaf that is cropped to the values described in evidence.

Parameters:: evidence – A VariableAssignment describing evidence.
Returns:: The cropped leaf, that hos no parent, path, etc. set.

mpe(minimal_distances: jpt.variables.VariableMap) → tuple[jpt.variables.VariableMap, float]

Calculate the most probable explanation of this leaf as a fully factorized distribution.

Returns:: the likelihood of the maximum as a float and the configuration as a VariableMap

k_mpe() → Iterator[jpt.variables.LabelAssignment]: Compute the k most probable explanations of this leaf. :return:

number_of_parameters() → int

Returns:: The number of relevant parameters in this decision node. Leafs require 1 + the sum of all distributions parameters. The 1 extra parameter represents the prior.

sample(amount) → numpy.ndarray

Sample amount many samples from the leaf.

Returns:: A numpy array of size (amount, self.variables) containing the samples.

class jpt.trees.JPT(variables: list[jpt.variables.Variable], targets: list[str | jpt.variables.Variable] = None, features: list[str | jpt.variables.Variable] = None, min_samples_leaf: float | int = 1, min_impurity_improvement: float | None = None, max_leaves: int | None = None, max_depth: int | None = None, dependencies=None, min_eval_samples: float | int = 0)

Implementation of Joint Probability Trees (JPTs).

Create a JPT.

Parameters:

variables – The variables represented by this model.
targets – The variables where the information gain will be computed on.
features – The variables where splits are chosen from.
min_samples_leaf – If int, the minimum number of samples required to form a leaf. If float, the minimum fraction of samples.
min_eval_samples – Minimum number of EVALUATION samples required in each child partition when split validation is active in 'evaluation' mode. Only enforced when a split_validation_mask is passed to learn() and split_validation_mode='evaluation'. If int, the absolute minimum. If a float in (0, 1), the minimum fraction of the total training rows (same convention as min_samples_leaf). 0 disables the check (default).
min_impurity_improvement – The minimal information gain to justify a split.
max_leaves – The maximum number of leaves (deprecated).
max_depth – The maximum depth the tree may have.
dependencies –
Specifies which targets depend on which features. Accepts three forms:
- None: every target depends on every feature (default, fully connected).
- dict[Variable, list[Variable]]: explicit mapping from features to their dependent targets.
- A DependencyDiscovery instance: a callable strategy that discovers dependencies from training data during learn(). The strategy is re-invoked on each call to learn() and its configuration is preserved during serialization.

logger

_variables

varnames: collections.OrderedDict[str, jpt.variables.Variable]

_targets

leaves: dict[int, Leaf]

innernodes: dict[int, DecisionNode]

priors: jpt.variables.VariableMap

min_samples_leaf = 1

min_eval_samples = 0

_keep_samples = False

min_impurity_improvement = 0

minimal_distances: jpt.variables.VariableMap

_numsamples = 0

root = None

max_leaves = None

max_depth

_reset() → None: Delete all parameters of this model (not the hyperparameters)

property allnodes: MutableMapping[int, Node]

property variables: tuple[jpt.variables.Variable, Ellipsis]

property targets: tuple[jpt.variables.Variable, Ellipsis]

property features: tuple[jpt.variables.Variable, Ellipsis]

property numeric_variables: tuple[jpt.variables.Variable, Ellipsis]

property symbolic_variables: tuple[jpt.variables.Variable, Ellipsis]

property integer_variables: tuple[jpt.variables.Variable, Ellipsis]

property numeric_targets: tuple[jpt.variables.Variable, Ellipsis]

property symbolic_targets: tuple[jpt.variables.Variable, Ellipsis]

property integer_targets: tuple[jpt.variables.Variable, Ellipsis]

property numeric_features: tuple[jpt.variables.Variable, Ellipsis]

property symbolic_features: tuple[jpt.variables.Variable, Ellipsis]

property integer_features: tuple[jpt.variables.Variable, Ellipsis]

to_json() → dict[str, Any]: Convert the tree to a JSON-serializable dictionary.

static from_json(data: dict[str, Any], variables: Iterable[jpt.variables.Variable] | None = None) → JPT

Construct a tree from a json dict.

Data:: The JSON dictionary holding the serialized JPT data.
Variables:: (optional) An iterable holding the already de-serialized variables the JPT shall be constructed with.

__getstate__()

__setstate__(state)

__eq__(o) → bool

encode(samples: numpy.ndarray) → numpy.ndarray: Get the leaf index that describes the partition of each sample. Only works for fully initialized samples, i. e. a matrix of arbitrary many rows but #variables many columns. :param samples: the samples to evaluate :return: A 1D numpy array of integers containing the leaf index of every sample.

pdf(values: jpt.variables.VariableAssignment) → float: Get the likelihood of one world :param values: A VariableMap mapping some variables to one value. :return: The likelihood as float

For each candidate leaf l calculate the number of samples in which query is true:

(1)\[P(query|evidence) = \frac{p_q}{p_e}\]

(2)\[p_q = \frac{c}{N}\]

(3)\[c = \frac{\prod{F}}{x^{n-1}}\]

where Q is the set of variables in query, \(P_{l}\) is the set of variables that occur in l, \(F = \{v | v \in Q \wedge~v \notin P_{l}\}\) is the set of variables in the query that do not occur in l’s path, \(x = |S_{l}|\) is the number of samples in l, \(n = |F|\) is the number of free variables and N is the number of samples represented by the entire tree. reference to (1)

Parameters:

query (dict of {jpt.variables.Variable : jpt.learning.distributions.Distribution.value}) – the event to query for, i.e. the query part of the conditional P(query|evidence) or the prior P(query)
evidence (dict of {jpt.variables.Variable : jpt.learning.distributions.Distribution.value}) – the event conditioned on, i.e. the evidence part of the conditional P(query|evidence)
fail_on_unsatisfiability – whether an error is raised in case of unsatisfiable evidence or not.

posterior(variables: list[jpt.variables.Variable | str] = None, evidence: dict[jpt.variables.Variable | str, Any] | jpt.variables.VariableAssignment = None, fail_on_unsatisfiability: bool = True, report_inconsistencies: bool = False) → jpt.variables.VariableMap | None

Compute the posterior distribution of every variable in variables. The result contains independent distributions. Be aware that they might not actually be independent.

Parameters:

variables – The query variables of the posterior to be computed
evidence – The evidence given for the posterior to be computed
fail_on_unsatisfiability – Rather or not an Unsatisfiability error is raised if the likelihood of the evidence is 0.
report_inconsistencies – In case of an Unsatisfiability error, the exception raise will contain information about the variable assignments that caused the inconsistency.

Returns:

jpt.trees.PosteriorResult containing distributions, candidates and weights

Compute the expected value of all variables. If no variables are passed, it defaults to all variables not passed as evidence.

Parameters:

variables – The variables to compute the expectation distributions on
evidence – The raw evidence applied to the tree
fail_on_unsatisfiability – Rather or not an Unsatisfiability error is raised if the likelihood of the evidence is 0.

Returns:

VariableMap

mpe(evidence: Dict[jpt.variables.Variable | str, Any] | jpt.variables.VariableAssignment = None, fail_on_unsatisfiability: bool = True) → Tuple[list[jpt.variables.LabelAssignment], float] | None

Calculate the most probable explanation of all variables if the tree given the evidence.

Parameters:

evidence – The evidence that is applied to the tree
fail_on_unsatisfiability – Rather or not an Unsatisfiability error is raised if the likelihood of the evidence is 0.

Returns:

List of LabelAssignments that describes all maxima of the tree given the evidence. Additionally, a float describing the likelihood of all solutions is returned.

kmpe(evidence: dict[jpt.variables.Variable | str, Any] | jpt.variables.VariableAssignment = None, fail_on_unsatisfiability: bool = True, k: int = 0) → Iterator[Tuple[jpt.variables.LabelAssignment, float]] | None

Perform a k-MPE inference on this JPT under the given evidence.

k-MPE yields the k most probable explanation states in decreasing order.

Parameters:

evidence – The evidence to apply
fail_on_unsatisfiability – Rather to raise an Unsatisfiability Error on impossible evidence or not.
k – the number of solutions to return

Returns:

An iterator with states ordered by likelihood.

_preprocess_query(query: dict | jpt.variables.VariableMap, remove_none: bool = True, skip_unknown_variables: bool = False, allow_singular_values: bool = False, space: Literal['labels', 'values'] = 'labels') → jpt.variables.LabelAssignment

Transform a query entered by a user into an internal representation that can be further processed.

Parameters:

query – the raw query
remove_none – Rather to remove None entries or not
skip_unknown_variables – skip preprocessing for variable that does not exist in tree (may happen in multiple reverse tree inference). If False, an exception is raised; default: False
allow_singular_values – Allow singular values, such that they are transformed to the daomain specification of numeric variables but not transformed to intervals via the PPF.

Returns:

the preprocessed VariableMap

_check_variable_assignment(assignment: jpt.variables.VariableAssignment | None): Check the variable assignment for compatibility with the variables of this JPT.

apply(query: jpt.variables.VariableAssignment | dict[str, int | jpt.base.intervals.Interval | float | str]) → Iterator[Leaf]

Iterator that yields leaves tha are consistent with a query.

A leaf is consistent with a query, if either of the following propositions hold for all constaints expressed by its path to the root node:

the variable is not constrained by the query

the variable is constrained by the query and the query is not consistent with the path

Parameters:: query – the preprocessed query, either an instance of a subclass of VariableAssignment or a dict mapping variables to their respective labels.
Returns:

__str__() → str

__repr__() → str

to_string() → str

fancy_tree() → str

pfmt() → str

Returns:: a pretty-format string representation of this JPT.

_pfmt(node: Node, indent: int) → str

Parameters:

node – The starting node
indent – the indentation of each new level

Returns:

a pretty-format string representation of this JPT from node downward.

learn(data: pandas.DataFrame | numpy.ndarray, keep_samples: bool = False, close_convex_gaps: bool = False, verbose: bool = False, prune_or_split: Callable[[JPT, Any, numpy.ndarray, numpy.ndarray], bool] | None = None, multicore: int | None = None, split_validation_mask: numpy.ndarray | None = None, split_validation_mode: str = 'both') → JPT

Fit the jpt to data.

Parameters:

data ([[str or float or bool]]; (according to self.variables)) – The training examples (assumed in row-shape)
keep_samples – If true, stores the indices of the original data samples in the leaf nodes. For debugging purposes only. Default is false.
close_convex_gaps –
prune_or_split – A callable (jpt, partition, indices, data) -> bool that is invoked before each split. Returns True to prune (make the node a leaf) or False to allow splitting. indices and data are numpy arrays.
multicore – The number of cores to use for learning. If None, all available cores are used.
verbose –
split_validation_mask – A boolean or uint8 array of length len(data). True/1 marks training samples whose feature values serve as candidate split points; False/0 marks evaluation samples whose feature values are excluded from candidates. Target values of all samples always contribute to the impurity score (unless split_validation_mode restricts this). None disables split validation (default).
split_validation_mode – Controls which targets contribute to the impurity score: 'both' (default) uses all targets, 'training' uses only training targets, 'evaluation' uses only evaluation targets.

Returns:

the fitted model

fit

static sample(sample, ft)

likelihood(data: pandas.DataFrame | numpy.ndarray, dirac_scaling: float = 2.0, min_distances: Dict = None, preprocess: bool = True, multicore: int | None = None, verbose: bool = False, single_likelihoods: bool = False, variables: Iterable[jpt.variables.Variable] = None) → numpy.ndarray

Get the probabilities of a list of worlds. The worlds must be fully assigned with scalar values (no intervals or sets).

Parameters:

variables – Which variables in consider for their likelihood computat
data – An array containing the worlds. The shape is (x, len(variables)).
dirac_scaling – the minimal distance between the samples within a dimension are multiplied by this factor if a durac impulse is used to model the variable.
min_distances – A dict mapping the variables to the minimal distances between the observations. This can be useful to use the same likelihood parameters for different test sets for example in cross validation processes.
verbose – print status information to the console
multicore – how many cores should be used (defaults to all)
preprocess – whether to apply the preprocessing to the data passed.
single_likelihoods – will not only return the overall likelihoods but also the likelihoods per variable

Returns:

A np.ndarray with shape (x, ) containing the probabilities.

parallel_likelihood(data: numpy.ndarray | pandas.DataFrame, dirac_scaling: float = 2.0, min_distances: Dict = None, single_likelihoods: bool = False) → numpy.ndarray

Get the probabilities of a list of worlds. The worlds must be fully assigned with scalar values (no intervals or sets).

Parameters:

data – An array containing the worlds. The shape is (x, len(variables)).
dirac_scaling – the minimal distance between the samples within a dimension are multiplied by this factor if a durac impulse is used to model the variable.
min_distances – A dict mapping the variables to the minimal distances between the observations. This can be useful to use the same likelihood parameters for different test sets for example in cross validation processes.
single_likelihoods – will not only return the overall likelihoods but also the likelihoods per variable

Returns:

An np.array with shape (x, ) containing the probabilities.

reverse(query: Dict, confidence: float = 0.05) → List[tuple]

Determines the leaf nodes that match query best and returns them along with their respective confidence.

Parameters:

query – a mapping from featurenames to either numeric value intervals or an iterable of categorical values
confidence – the confidence level for this MPE inference

Returns:

a tuple of probabilities and jpt.trees.Leaf objects that match requirement (representing path to root)

plot(title: str = 'unnamed', filename: str | None = None, directory: str = None, plotvars: Iterable[jpt.variables.Variable] = None, view: bool = True, max_symb_values: int = 10, nodefill: str = None, leaffill: str = None, alphabet: bool = False, verbose: bool = False, engine=None) → str

Generates an SVG representation of the generated regression tree.

Parameters:

title – title of the plot
filename – the name of the JPT (will also be used as filename; extension will be added automatically)
directory – the location to save the SVG file to
plotvars – the variables to be plotted in the graph
view – whether the generated SVG file will be opened automatically
max_symb_values – limit the maximum number of symbolic values that are plotted to this number
nodefill – the color of the inner nodes in the plot; accepted formats: RGB, RGBA, HSV, HSVA or color name
leaffill – the color of the leaf nodes in the plot; accepted formats: RGB, RGBA, HSV, HSVA or color name
alphabet – whether to plot symbolic variables in alphabetic order, if False, they are sorted by probability (descending); default is False
verbose –
engine – the rendering engine for the distribution plots in the leafs; either ‘matplotlib’ or ‘plotly’;

Returns:

(str) the path under which the rendered image has been saved.

pickle(fpath: str) → None

Pickles the fitted regression tree to a file at the given location fpath.

Parameters:: fpath – the location for the pickled file

static calcnorm(sigma: float, mu: float, intervals)

Computes the CDF for a multivariate normal distribution.

Parameters:

sigma – the standard deviation
mu – the expected value
intervals (list of matcalo.utils.utils.Interval) – the boundaries of the integral

Returns:

copy() → JPT

Returns:: a new copy of this jpt where all references are the original tree are cut.

conditional_jpt(evidence: jpt.variables.VariableAssignment | None = None, fail_on_unsatisfiability: bool = True) → JPT | None

Apply evidence on a JPT and get a new JPT that represent P(x|evidence).

Parameters:

evidence – A VariableAssignment mapping the observed variables to there observed values
fail_on_unsatisfiability – whether an error is raised in case of unsatisfiable evidence or not

multiply_by_leaf_prior(prior: dict[int, float]) → JPT

Multiply every leafs prior by the given priors. This serves as handling the factor message from factor nodes. Be vary since this method overwrites the JPT in-place.

Parameters:: prior – The priors, a Dict mapping from leaf indices to float
Returns:: self

normalize() → JPT: Normalize the tree s. t. the sum of all leaf priors is 1. :return: self

save(file: str | IO, protocol: Literal['pickle', 'json'] = 'pickle') → None

Write this JPT persistently to disk.

Parameters:

file – either a string or file-like object.
protocol –

dump

dumps(protocol: Literal['pickle', 'json'] = 'pickle') → bytes

static load(file: str | IO, protocol: Literal['pickle', 'json'] = 'pickle') → JPT

Load a JPT from disk.

Parameters:

file – either a string or file-like object.
protocol –

Returns:

the JPT described in file

static loads(data: typing_extensions.Buffer, protocol: Literal['pickle', 'json'] = 'pickle') → JPT

depth() → int

Returns:: the maximal depth of a leaf in the tree.

total_samples() → int

Returns:: the total number of samples represented by this tree.

number_of_parameters() → int

Returns:: The number of relevant parameters in the entire tree

bind(*arg, **kwargs) → jpt.variables.LabelAssignment

Returns a LabelAssignment object with the assignments passed.

This method accepts one optional positional argument, which – if passed – must be a dictionary of the desired variable assignments.

Keyword arguments may specify additional variable, value pairs.

If a positional argument is passed, the following options may be passed in addition as keyword arguments:

Parameters:

allow_singular_values – Allow singular values, such that they are transformed to the daomain specification of numeric variables but not transformed to intervals via the PPF.
space – Literal[‘values’, ‘labels’] Whether the variables shall be assigned to terms in value or label space of the JPT.

moment(order: int = 1, center: jpt.variables.VariableAssignment | None = None, evidence: jpt.variables.VariableAssignment | None = None, fail_on_unsatisfiability: bool = True) → jpt.variables.VariableMap | None

Calculate the order of each numeric/integer random variable given the evidence.

Parameters:

order – The order of the moment
center – A VariableAssignment mapping each numeric/integer variable to some constant. If a variable has a constant, it will be interpreted as ‘c’ for the central moment. If it is not set, 0 will be used by default.
evidence – The evidence given for the posterior to be computed
fail_on_unsatisfiability – Rather or not an Unsatisfiability error is raised if the likelihood of the evidence is 0.

get_hyperparameters_dict() → dict[str, Any]: Get all hyperparameters as dict that can be used for MLFlow model tracking.

prune(similarity_threshold: float, approximate: float | dict[jpt.variables.Variable | str, float] | jpt.variables.VariableMap | None = None) → JPT

Prune this tree by repeatedly merging leaves with very similiar distributions.

Parameters:

similarity_threshold – the average similarity of distributions in [0, 1] that two leaves must exhibit in order to be considered for a merge.
approximate –

Returns: