How to Predict Continuous Values with JPTs
==========================================

JPTs can be used for regression by querying the posterior distribution of
a continuous target variable given observed feature values. Unlike
point-estimate regressors, a JPT returns a full probability distribution
over the target, which allows uncertainty quantification out of the box.

Problem Setup
-------------

Prepare a DataFrame with feature columns and one or more numeric target
columns:

.. code-block:: python

    import pandas as pd
    import sklearn.datasets
    from jpt.variables import infer_from_dataframe
    from jpt.trees import JPT

    boston = sklearn.datasets.fetch_california_housing()
    df = pd.DataFrame(boston.data, columns=boston.feature_names)
    df['MedHouseVal'] = boston.target

    variables = infer_from_dataframe(df)
    varnames = {v.name: v for v in variables}

Training a Discriminative JPT
------------------------------

Pass a ``targets`` list to concentrate splits on the target variable:

.. code-block:: python

    model = JPT(
        variables,
        targets=[varnames['MedHouseVal']],
        min_samples_leaf=0.05
    )
    model.fit(df)

Point Predictions via Expectation
----------------------------------

:py:meth:`~jpt.trees.JPT.expectation` returns the conditional mean of
the target given feature evidence:

.. code-block:: python

    evidence = {
        'MedInc':   [5.0, 6.0],
        'HouseAge': [20.0, 30.0],
    }

    result = model.expectation(
        variables=[varnames['MedHouseVal']],
        evidence=evidence
    )
    print(f"E[MedHouseVal | evidence] = {result[varnames['MedHouseVal']]:.3f}")

Full Posterior Distribution
----------------------------

:py:meth:`~jpt.trees.JPT.posterior` returns the conditional
distribution as a quantile-based PDF object.  Use it when you need
more than a point estimate:

.. code-block:: python

    import matplotlib.pyplot as plt
    import numpy as np

    post = model.posterior(
        variables=[varnames['MedHouseVal']],
        evidence=evidence
    )
    dist = post[varnames['MedHouseVal']]

    xs = np.linspace(dist.ppf(.01), dist.ppf(.99), 300)
    plt.plot(xs, [dist.pdf(x) for x in xs])
    plt.xlabel('MedHouseVal')
    plt.ylabel('Density')
    plt.title('Posterior distribution')
    plt.show()

Evaluating RMSE
---------------

Iterate over a held-out test set and compare the predicted mean to the
ground-truth target value:

.. code-block:: python

    import sklearn.model_selection
    import math

    train_df, test_df = sklearn.model_selection.train_test_split(
        df, test_size=0.2, random_state=0
    )
    model.fit(train_df)

    squared_errors = []
    for _, row in test_df.iterrows():
        evidence = {col: float(row[col]) for col in boston.feature_names}
        result = model.expectation(
            [varnames['MedHouseVal']],
            evidence=evidence
        )
        pred = result[varnames['MedHouseVal']]
        squared_errors.append((pred - row['MedHouseVal']) ** 2)

    rmse = math.sqrt(sum(squared_errors) / len(squared_errors))
    print(f'RMSE: {rmse:.4f}')

.. seealso::

    :doc:`notebooks/tutorial_regression` — a worked regression
    analysis with visualisations.

    :doc:`notebooks/tutorial_reasoning` — full walk-through of all
    query types including ``posterior`` and ``expectation``.