Frequently Asked Questions
General
- What is the difference between Sum-Product Networks (SPNs) and JPTs?
SPNs define dependency trees where edges between variables indicate a direct probabilistic influence. JPTs define computational trees that are a mixture of local factorised distributions. In an SPN the structural assumptions are fixed before learning; in a JPT the partitioning of the data space is inferred from data, making JPTs non-parametric and free of prior structural assumptions.
- Do I need to specify variable types manually?
No.
infer_from_dataframe()inspects the DataFrame column dtypes and creates the appropriate variable type automatically (NumericVariablefor float/int columns,SymbolicVariablefor object/category columns). You only need to construct variables manually if you want fine-grained control over the domain or resolution.- Can JPTs handle missing values?
Not directly during training. Drop or impute missing values before calling
fit(). During inference, simply omit the variable from theevidencedict — marginalisation is exact and handles unobserved variables correctly.
Training
- What does ``min_samples_leaf`` control?
It sets the minimum number of training samples required to create a leaf. Values between 0 and 1 are treated as fractions of the training set size. Smaller values allow deeper, more expressive trees; larger values produce simpler, smoother models. Start with
0.01–0.05and tune using cross-validation or held-out likelihood.- What is the difference between generative and discriminative mode?
In generative mode (default) the tree is split to maximise information gain over all variables simultaneously. The resulting model represents the full joint distribution \(P(\mathcal{X})\). In discriminative mode (
targets=[...]) splits are scored only on the target variables, which gives better predictive accuracy for classification and regression at the cost of a less faithful joint model.- How do I avoid overfitting?
Increase
min_samples_leafor setmin_impurity_improvementto a small positive value (e.g.1e-4). You can also usemax_leavesto hard-cap the number of leaves.
Inference
- What does ``model.infer()`` return?
A scalar float: the (conditional) probability of the query given the evidence. For a marginal query (no evidence) this is \(P(Q)\). For a conditional query it is \(P(Q \mid E)\).
- What does ``model.posterior()`` return?
A dict mapping each queried variable to a marginal distribution object (
Multinomialfor symbolic variables, a quantile-based distribution for numeric variables). The distributions are independent conditional on the evidence, although the variables may be correlated.- What happens when evidence is unsatisfiable?
infer()returns0.0.posterior()raises aValueError. Check your evidence ranges before callingposterior()if you are not sure whether the evidence is reachable.- How does MPE differ from posterior expectation?
mpe()returns the most likely assignment (mode) of all query variables jointly. The posterior expectation (expectation()) returns the mean of each variable’s marginal distribution independently. For multimodal distributions they can differ substantially.
Performance
- My model is slow to query. What should I do?
Use
min_samples_leafto limit the number of leaves. For batch queries consider wrapping evidence rows in a loop over a pre-builtvarnameslookup dict to avoid repeated string lookups. Thebind()method also pre-computes an evidence-conditioned subtree that can be reused for multiple downstream queries.- Can I train on very large datasets?
Training is O(n log n) per variable per split level. For datasets above a few million rows consider sub-sampling for tree construction while keeping the full data for leaf distribution fitting, or use
min_samples_leafwith a higher fraction to limit tree depth.