Frequently Asked Questions ========================== General ------- **What is the difference between Sum-Product Networks (SPNs) and JPTs?** SPNs define *dependency* trees where edges between variables indicate a direct probabilistic influence. JPTs define *computational* trees that are a mixture of local factorised distributions. In an SPN the structural assumptions are fixed before learning; in a JPT the partitioning of the data space is inferred from data, making JPTs non-parametric and free of prior structural assumptions. **Do I need to specify variable types manually?** No. :py:func:`~jpt.variables.infer_from_dataframe` inspects the DataFrame column dtypes and creates the appropriate variable type automatically (``NumericVariable`` for float/int columns, ``SymbolicVariable`` for object/category columns). You only need to construct variables manually if you want fine-grained control over the domain or resolution. **Can JPTs handle missing values?** Not directly during training. Drop or impute missing values before calling ``fit()``. During inference, simply omit the variable from the ``evidence`` dict — marginalisation is exact and handles unobserved variables correctly. Training -------- **What does ``min_samples_leaf`` control?** It sets the minimum number of training samples required to create a leaf. Values between 0 and 1 are treated as fractions of the training set size. Smaller values allow deeper, more expressive trees; larger values produce simpler, smoother models. Start with ``0.01``–``0.05`` and tune using cross-validation or held-out likelihood. **What is the difference between generative and discriminative mode?** In *generative* mode (default) the tree is split to maximise information gain over all variables simultaneously. The resulting model represents the full joint distribution :math:`P(\mathcal{X})`. In *discriminative* mode (``targets=[...]``) splits are scored only on the target variables, which gives better predictive accuracy for classification and regression at the cost of a less faithful joint model. **How do I avoid overfitting?** Increase ``min_samples_leaf`` or set ``min_impurity_improvement`` to a small positive value (e.g. ``1e-4``). You can also use ``max_leaves`` to hard-cap the number of leaves. Inference --------- **What does ``model.infer()`` return?** A scalar float: the (conditional) probability of the query given the evidence. For a marginal query (no evidence) this is :math:`P(Q)`. For a conditional query it is :math:`P(Q \mid E)`. **What does ``model.posterior()`` return?** A dict mapping each queried variable to a marginal distribution object (:py:class:`~jpt.distributions.univariate.Multinomial` for symbolic variables, a quantile-based distribution for numeric variables). The distributions are independent conditional on the evidence, although the variables may be correlated. **What happens when evidence is unsatisfiable?** ``infer()`` returns ``0.0``. ``posterior()`` raises a ``ValueError``. Check your evidence ranges before calling ``posterior()`` if you are not sure whether the evidence is reachable. **How does MPE differ from posterior expectation?** :py:meth:`~jpt.trees.JPT.mpe` returns the *most likely assignment* (mode) of all query variables jointly. The posterior expectation (:py:meth:`~jpt.trees.JPT.expectation`) returns the *mean* of each variable's marginal distribution independently. For multimodal distributions they can differ substantially. Performance ----------- **My model is slow to query. What should I do?** Use ``min_samples_leaf`` to limit the number of leaves. For batch queries consider wrapping evidence rows in a loop over a pre-built ``varnames`` lookup dict to avoid repeated string lookups. The ``bind()`` method also pre-computes an evidence-conditioned subtree that can be reused for multiple downstream queries. **Can I train on very large datasets?** Training is O(n log n) per variable per split level. For datasets above a few million rows consider sub-sampling for tree construction while keeping the full data for leaf distribution fitting, or use ``min_samples_leaf`` with a higher fraction to limit tree depth.