Basic Workflow Guide

This guide walks through the full ClearTrace model lifecycle: training a model, generating predictions and explanations, exploring counterfactual scenarios, segmenting your data, and producing narrative summaries.

Setup

import pandas as pd
import numpy as np
from outerproduct import ClearTrace

ct = ClearTrace()

The client automatically reads your API key from the OUTERPRODUCT_API_KEY environment variable. See the Quickstart for installation and authentication details.

1. Training a Model

From a DataFrame

The most common way to train is by passing a pandas DataFrame with your features and target column:

df = pd.read_csv("data.csv")

result = ct.fit(df, target="price")

print(result.status)        # "completed"
print(ct.model_id)          # auto-assigned model ID
print(ct.feature_names)     # columns used as features

ClearTrace automatically identifies the feature columns (everything except the target) and extracts their names from the DataFrame.

Selecting specific features

If your DataFrame contains columns you don't want to use as features, specify them explicitly:

result = ct.fit(
    df,
    target="price",
    feature_fields=["sqft", "bedrooms", "bathrooms", "lot_size"],
)

From a NumPy array

You can also train from a NumPy array by passing the target values separately:

X = df[["sqft", "bedrooms", "bathrooms"]].values
y = df["price"].values

result = ct.fit(X, target=y)

Tuning training configuration

Pass additional keyword arguments to control training behavior:

result = ct.fit(
    df,
    target="price",
    mode="fast",
    base_model_type="xgboost",
    n_hyperopt_steps=6,
)

Non-blocking training

By default, fit() blocks until training completes. Set wait=False to submit the job and return immediately:

job = ct.fit(df, target="price", wait=False)
print(job.status)  # "pending"

# Check status later
print(ct.status)

2. Uploading Data

For large datasets, you can upload a file first and then train against it server-side. This avoids sending data inline with the training request.

Upload a file from disk

upload = ct.upload_file("data.csv")
print(upload.model_id)  # auto-assigned model ID

# Train using the uploaded data
result = ct.fit(target="price")

upload_file() supports .csv, .parquet (requires the parquet extra), and .pkl formats. The file format is detected automatically from the extension.

Manual upload flow

For more control, you can create the upload URL and send the data separately:

upload = ct.create_upload("csv")

with open("data.csv", "rb") as f:
    ct.upload_fileobj(upload, f)

3. Model Distillation

If you have an existing black-box model served behind a prediction endpoint, ClearTrace can distill it into an interpretable surrogate model:

result = ct.fit_distill(
    df.drop(columns=["price"]),
    predict_url="https://your-model.example.com/predict",
    predict_headers={"Authorization": "Bearer token"},
    labels=df["price"],
)

ClearTrace sends your feature data to the prediction endpoint, captures the responses, and trains a surrogate model that mirrors the black-box model's behavior — while remaining fully explainable.

4. Loading an Existing Model

To work with a model you've already trained, load it by ID:

ct.load("your-model-id")

All subsequent operations (predict, explain, etc.) will use this model. The load() method returns the client instance, so you can chain calls:

predictions = ct.load("your-model-id").predict(X)

5. Predictions

Generate batch predictions from your trained model:

predictions = ct.predict(X)  # returns np.ndarray

predict() accepts a DataFrame, NumPy array, or nested list.

6. Explanations

ClearTrace provides AGOP-based (Averaged Gradient Outer Product) explanations that attribute each prediction to the input features.

Basic explanations

result = ct.explain(X)

result.predictions      # predicted values for each sample
result.local_agop       # per-sample AGOP scores
result.local_drivers    # per-sample feature attributions
result.feature_names    # feature names

Predict and explain in one call

When you need both predictions and richer explanations, use predict_and_explain():

result = ct.predict_and_explain(X)

result.predictions        # predicted values
result.local_drivers      # per-sample feature attributions
result.persona            # persona descriptions (if enabled)
result.persona_cluster_id # cluster assignment per sample
result.local_rules        # local rule explanations (if enabled)

Personas

Enable persona grouping to understand which behavioral archetype each sample belongs to:

result = ct.predict_and_explain(X, with_persona=True)

for i, p in enumerate(result.persona):
    print(f"Sample {i}: {p}")

Local rules

Enable local-rule extraction to get human-readable decision rules for each prediction:

result = ct.predict_and_explain(
    X,
    rule_kwargs={"selector": "lift_threshold", "lift_threshold": 0.9},
)

for rule in result.local_rules:
    print(rule)

Pass an empty dict rule_kwargs={} to use default rule settings.

Global feature importance

Get a model-wide view of which features matter most:

result = ct.interpret()

for name, importance in zip(result.feature_names, result.global_drivers):
    print(f"{name}: {importance:.4f}")

7. Counterfactual Scenarios

Scenario analysis answers the question: "What would need to change for this sample to get a different prediction?"

Basic scenario

queries = pd.DataFrame([
    {"age": 25, "income": 50000, "credit_score": 620},
    {"age": 40, "income": 85000, "credit_score": 710},
])

result = ct.scenario(queries, desired_class=1)

for item in result:
    print(f"Query {item.query_index}:")
    print(f"  Baseline prediction: {item.baseline_prediction:.2f}")
    print(f"  Already at desired class: {item.already_at_desired_class}")
    print(f"  Number of scenarios found: {item.scenario_count}")

    for candidate in item:
        print(f"  Changes needed ({candidate.n_changes} features):")
        for feature, change in candidate.changes.items():
            print(f"    {feature}: {change.from_} -> {change.to}")

Adding constraints

Constrain which features can change and how:

result = ct.scenario(
    queries,
    desired_class=1,
    constraints={
        "age": {"immutable": True},                          # cannot change
        "income": {"monotonic": "increase"},                 # can only go up
        "credit_score": {"value_range": [300, 850]},         # must stay in range
        "employment": {"allowed_values": ["full-time", "part-time"]},
    },
)

Tuning the search

Control the counterfactual search with these parameters:

result = ct.scenario(
    queries,
    desired_class=1,
    n_walks=1000,       # number of random walks (default: 500)
    max_steps=50,       # max steps per walk (default: 30)
    epsilon=0.1,        # step size (default: 0.2)
    random_state=42,    # seed for reproducibility
)

8. Segmentation

Segmentation groups your data into clusters that exhibit distinct explanation patterns, then optionally generates natural-language persona descriptions for each cluster.

result = ct.segment(
    data=X,
    target_values=y,
    min_clusters=3,
    max_clusters=8,
)

print(f"Optimal clusters: {result.n_clusters}")
print(f"Quality score: {result.quality:.3f}")

# Per-sample cluster assignments
print(result.cluster_ids)

# Persona descriptions for each cluster
for persona in result.personas:
    print(f"\nCluster {persona.cluster_id}: {persona.persona_name}")
    print(f"  {persona.persona_description}")

Like fit(), segmentation runs asynchronously. Pass wait=False to return immediately:

job = ct.segment(data=X, target_values=y, wait=False)

# Retrieve results later
result = ct.get_segments()

9. Narrative Generation

Generate an LLM-powered natural-language summary of model behavior on a dataset:

narrative = ct.narrative(
    df,
    kpi_name="Revenue",
    context="Q1 2026 customer cohort analysis",
)

print(narrative.narrative)

The narrative is generated asynchronously. Use wait=False and get_narrative() for non-blocking usage:

job = ct.narrative(df, kpi_name="Revenue", wait=False)

# Retrieve later
result = ct.get_narrative()
print(result.narrative)

10. Error Handling

The client raises specific exception types that let you handle errors precisely:

from outerproduct import (
    ClearTrace,
    AuthenticationError,
    NotFoundError,
    ValidationError,
    ServerError,
    PollTimeoutError,
)

ct = ClearTrace()

try:
    ct.fit(df, target="price")
except AuthenticationError:
    print("Invalid or missing API key")
except ValidationError as e:
    print(f"Invalid request: {e}")
except PollTimeoutError as e:
    print(f"Training timed out after {e.elapsed:.0f}s for model {e.model_id}")
except ServerError:
    print("Server error — try again later")

All exceptions inherit from OuterProductError, so you can catch that as a fallback:

from outerproduct import OuterProductError

try:
    ct.fit(df, target="price")
except OuterProductError as e:
    print(f"Something went wrong: {e}")

Async Usage

Every method shown above is also available on the async client. Replace ClearTrace with AsyncClearTrace and await each call:

from outerproduct import AsyncClearTrace

ct = AsyncClearTrace()

result = await ct.fit(df, target="price")
predictions = await ct.predict(X)
explanations = await ct.explain(X)

Putting it all together

Here's a complete workflow from training through analysis:

import pandas as pd
from outerproduct import ClearTrace

ct = ClearTrace()

# Load and train
df = pd.read_csv("loans.csv")
ct.fit(df, target="approved")

# Feature data for inference
X = df.drop(columns=["approved"])

# Global importance
importance = ct.interpret()
for name, score in zip(importance.feature_names, importance.global_drivers):
    print(f"{name}: {score:.4f}")

# Per-sample explanations with personas
result = ct.predict_and_explain(X, with_persona=True)

# Counterfactual: what would it take to get approved?
denied = X[result.predictions < 0.5]
scenarios = ct.scenario(denied, desired_class=1, constraints={
    "age": {"immutable": True},
})

for item in scenarios:
    if item.scenario_count > 0:
        top = item[0]
        print(f"Sample {item.query_index}: change {top.n_changes} features")

# Segment the data
segments = ct.segment(data=X, target_values=df["approved"])
for persona in segments.personas:
    print(f"{persona.persona_name}: {persona.persona_description}")

# Generate narrative summary
summary = ct.narrative(df, kpi_name="Approval Rate")
print(summary.narrative)