AI Summary Hub

Explainable AI (XAI)

Making AI decisions interpretable and explainable.

Definition

Explainable AI (XAI) is the set of methods and practices that make machine learning model behavior understandable to humans — identifying which inputs or features drove a decision, surfacing the internal logic of a model, or producing natural language justifications that non-technical stakeholders can act on. The goal is not just to describe what a model did, but to provide explanations that are faithful to its actual reasoning and useful to the person receiving them: an auditor checking for bias, a doctor validating a diagnosis, or a user disputing a credit decision.

Explanations can be global (describing overall model behavior — which features are generally most important) or local (explaining a specific prediction — why was this particular loan denied?). They can be post-hoc (applied after training to an existing model, such as SHAP or LIME) or intrinsic (models that are interpretable by design, like linear models, decision trees, or rule-based systems). The post-hoc vs. intrinsic distinction matters: post-hoc explanations are flexible and can be applied to any model, but may not fully capture the model's true mechanism; intrinsic models are more faithful but often less expressive.

XAI is a critical enabler of AI safety auditing and bias in AI investigations. It is legally required or strongly recommended in regulated domains — under the EU AI Act and GDPR, individuals affected by automated decisions have the right to a meaningful explanation. As LLMs and agents become more capable, the challenge of explaining emergent behavior and multi-step reasoning is an active research frontier.

How it works

Feature attribution methods

Feature attribution methods assign importance scores to each input feature for a given prediction. SHAP (SHapley Additive exPlanations) uses cooperative game theory to fairly distribute the contribution of each feature, guaranteeing consistency and local accuracy. LIME (Local Interpretable Model-agnostic Explanations) fits a simple interpretable model around the neighborhood of a single prediction. Both work with any model type but may differ from each other and from the true model mechanism.

Visual and attention-based explanations

For vision models, saliency maps and GradCAM highlight image regions that most influenced a prediction. For language models, attention weights show which tokens the model attended to; though attention is not always a reliable explanation of causal reasoning, it is widely used for debugging.

Inherently interpretable models

Linear models, logistic regression, decision trees, and rule lists are inherently interpretable: their logic can be read directly from model parameters. When interpretability is paramount — for example, in clinical risk scoring or regulated credit models — these simpler models may be preferred even at the cost of some predictive accuracy.

When to use / When NOT to use

Use whenAvoid when
Regulated domain requires explanations for automated decisions (credit, hiring, healthcare)The model is low-stakes with no individual impact and no regulatory requirement
Debugging model failures or unexpected behaviorExplanation overhead would introduce unacceptable production latency
Users need to understand or contest a model's decisionThe model is an intrinsically interpretable baseline that needs no separate explanation
Auditing for bias or safety issues before or after deploymentThe explanation method you have access to is known to be unreliable for your model type

Comparisons

MethodTypeModel-agnosticScopeFidelity
SHAPPost-hocYesLocal + globalHigh (mathematically grounded)
LIMEPost-hocYesLocalModerate (local approximation)
GradCAMPost-hocNo (gradient-based)Local (vision)Moderate
AttentionPost-hocNo (transformer-specific)LocalVariable
Decision treeIntrinsicN/AGlobalPerfect (model IS the explanation)

Pros and cons

ProsCons
Supports compliance with GDPR and EU AI Act right-to-explanationPost-hoc explanations may not reflect the true model mechanism
Enables bias detection by attributing decisions to protected proxiesExplanations can be manipulated to appear fair even when the model is not
Builds user trust and supports contestabilityInherently interpretable models often sacrifice predictive power
Facilitates model debugging and iterative improvementExplaining emergent behavior in LLMs and deep nets remains an open problem

Code examples

SHAP feature attribution for a tabular classifier (Python)

import shap
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification

# Train a simple classifier
X, y = make_classification(n_samples=500, n_features=10, random_state=42)
feature_names = [f"feature_{i}" for i in range(X.shape[1])]

model = GradientBoostingClassifier(n_estimators=50, random_state=42)
model.fit(X, y)

# Compute SHAP values for the test set
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X[:10])

# Show feature importances for the first prediction
print("SHAP values for prediction 0 (positive class):")
for name, value in sorted(
    zip(feature_names, shap_values[0]),
    key=lambda x: abs(x[1]),
    reverse=True,
):
    print(f"  {name}: {value:+.4f}")

Practical resources

See also