How to Predict Continuous Values with JPTs
JPTs can be used for regression by querying the posterior distribution of a continuous target variable given observed feature values. Unlike point-estimate regressors, a JPT returns a full probability distribution over the target, which allows uncertainty quantification out of the box.
Problem Setup
Prepare a DataFrame with feature columns and one or more numeric target columns:
import pandas as pd
import sklearn.datasets
from jpt.variables import infer_from_dataframe
from jpt.trees import JPT
boston = sklearn.datasets.fetch_california_housing()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df['MedHouseVal'] = boston.target
variables = infer_from_dataframe(df)
varnames = {v.name: v for v in variables}
Training a Discriminative JPT
Pass a targets list to concentrate splits on the target variable:
model = JPT(
variables,
targets=[varnames['MedHouseVal']],
min_samples_leaf=0.05
)
model.fit(df)
Point Predictions via Expectation
expectation() returns the conditional mean of
the target given feature evidence:
evidence = {
'MedInc': [5.0, 6.0],
'HouseAge': [20.0, 30.0],
}
result = model.expectation(
variables=[varnames['MedHouseVal']],
evidence=evidence
)
print(f"E[MedHouseVal | evidence] = {result[varnames['MedHouseVal']]:.3f}")
Full Posterior Distribution
posterior() returns the conditional
distribution as a quantile-based PDF object. Use it when you need
more than a point estimate:
import matplotlib.pyplot as plt
import numpy as np
post = model.posterior(
variables=[varnames['MedHouseVal']],
evidence=evidence
)
dist = post[varnames['MedHouseVal']]
xs = np.linspace(dist.ppf(.01), dist.ppf(.99), 300)
plt.plot(xs, [dist.pdf(x) for x in xs])
plt.xlabel('MedHouseVal')
plt.ylabel('Density')
plt.title('Posterior distribution')
plt.show()
Evaluating RMSE
Iterate over a held-out test set and compare the predicted mean to the ground-truth target value:
import sklearn.model_selection
import math
train_df, test_df = sklearn.model_selection.train_test_split(
df, test_size=0.2, random_state=0
)
model.fit(train_df)
squared_errors = []
for _, row in test_df.iterrows():
evidence = {col: float(row[col]) for col in boston.feature_names}
result = model.expectation(
[varnames['MedHouseVal']],
evidence=evidence
)
pred = result[varnames['MedHouseVal']]
squared_errors.append((pred - row['MedHouseVal']) ** 2)
rmse = math.sqrt(sum(squared_errors) / len(squared_errors))
print(f'RMSE: {rmse:.4f}')
See also
Regression — a worked regression analysis with visualisations.
Reasoning about Joint Probability Distributions — full walk-through of all
query types including posterior and expectation.