Tessl Tile for pypi/shap@0.48.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

explainers.md index.md utilities.md visualization.md

utilities.mddocs/

0
# Data Utilities and Helpers
1

2
SHAP provides comprehensive utilities including built-in datasets, masking strategies, helper functions, and model wrappers to support explainability workflows across different data types and use cases.
3

4
## Capabilities
5

6
### Built-in Datasets
7

8
Ready-to-use datasets for testing, benchmarking, and educational purposes, covering various domains and data types.
9

10
```python { .api }
11
# Real-world datasets
12
def adult(display=False, n_points=None) -> tuple[pd.DataFrame, np.ndarray]:
13
    """
14
    Census income prediction dataset (>50K income classification).
15
    
16
    Parameters:
17
    - display: Return human-readable labels instead of encoded values (bool)
18
    - n_points: Sample n data points (int, optional)
19
    
20
    Returns:
21
    (features, targets) - DataFrame with 14 features, binary target array
22
    """
23

24
def california(n_points=None) -> tuple[pd.DataFrame, np.ndarray]:
25
    """
26
    California housing regression dataset.
27
    
28
    Median house values for California districts with geographic and
29
    demographic features.
30
    
31
    Returns:
32
    (features, targets) - DataFrame with 8 features, continuous target array
33
    """
34

35
def imagenet50(resolution=224, n_points=None) -> tuple[np.ndarray, np.ndarray]:
36
    """
37
    50 representative ImageNet images for background distributions.
38
    
39
    Parameters:
40
    - resolution: Image resolution (currently only 224 supported)
41
    - n_points: Sample n images (optional)
42
    
43
    Returns:
44
    (images, labels) - Image array (N, H, W, C), label array
45
    """
46

47
def imdb(n_points=None) -> tuple[list[str], np.ndarray]:
48
    """
49
    Movie review sentiment classification dataset.
50
    
51
    Returns:
52
    (reviews, sentiments) - List of review text strings, binary sentiment array
53
    """
54

55
def diabetes(n_points=None) -> tuple[pd.DataFrame, np.ndarray]:
56
    """
57
    Diabetes progression prediction dataset.
58
    
59
    Physiological measurements predicting diabetes progression after one year.
60
    
61
    Returns:
62
    (features, targets) - DataFrame with 10 features, continuous target array
63
    """
64

65
def iris(display=False, n_points=None) -> tuple[pd.DataFrame, np.ndarray]:
66
    """
67
    Classic iris flower classification dataset.
68
    
69
    Parameters:
70
    - display: Return species names instead of encoded labels (bool)
71
    
72
    Returns:
73
    (features, targets) - DataFrame with 4 features, class labels
74
    """
75

76
def linnerud(n_points=None) -> tuple[pd.DataFrame, pd.DataFrame]:
77
    """
78
    Multi-target physiological/exercise dataset.
79
    
80
    Exercise measurements predicting physiological parameters.
81
    
82
    Returns:
83
    (exercise_features, physiological_targets) - Both as DataFrames
84
    """
85

86
def nhanesi(display=False, n_points=None) -> tuple[pd.DataFrame, np.ndarray]:
87
    """
88
    NHANES I survival analysis dataset.
89
    
90
    National Health and Nutrition Examination Survey data with survival times
91
    as labels, used for survival analysis and mortality prediction tasks.
92
    
93
    Parameters:
94
    - display: Return features with modified display format (bool)
95
    - n_points: Number of data points to sample (int, optional)
96
    
97
    Returns:
98
    (features, survival_times) - DataFrame with health measurements, survival time array
99
    """
100

101
def communitiesandcrime(n_points=None) -> tuple[pd.DataFrame, np.ndarray]:
102
    """
103
    Communities and Crime regression dataset from UCI ML Repository.
104
    
105
    Community demographic and social features for predicting total number
106
    of violent crimes per 100K population.
107
    
108
    Parameters:
109
    - n_points: Number of data points to sample (int, optional)
110
    
111
    Returns:
112
    (features, crime_rates) - DataFrame with community features, crime rate targets
113
    """
114

115
# Sparse and ranking datasets  
116
def a1a(n_points=None) -> tuple[scipy.sparse.csr_matrix, np.ndarray]:
117
    """
118
    Sparse binary classification dataset in SVM light format.
119
    
120
    High-dimensional sparse feature matrix for binary classification,
121
    commonly used for testing sparse algorithms.
122
    
123
    Parameters:
124
    - n_points: Number of data points to sample (int, optional)
125
    
126
    Returns:
127
    (sparse_features, binary_targets) - CSR sparse matrix and binary labels
128
    """
129

130
def rank() -> tuple[scipy.sparse.csr_matrix, np.ndarray, scipy.sparse.csr_matrix, 
131
                    np.ndarray, np.ndarray, np.ndarray]:
132
    """
133
    Learning-to-rank datasets from LightGBM repository.
134
    
135
    Ranking datasets with query-document pairs and relevance judgments,
136
    used for learning-to-rank model evaluation.
137
    
138
    Returns:
139
    (train_X, train_y, test_X, test_y, train_queries, test_queries) - 
140
    Training/test sparse matrices, relevance labels, and query group IDs
141
    """
142

143
# Synthetic datasets
144
def corrgroups60(n_points=1000) -> tuple[pd.DataFrame, np.ndarray]:
145
    """
146
    Synthetic dataset with 60 features organized in correlated groups.
147
    
148
    Generated dataset with known correlation structure between distinct
149
    feature groups, useful for testing correlation-aware algorithms.
150
    
151
    Parameters:
152
    - n_points: Number of data points to generate (int, default: 1000)
153
    
154
    Returns:
155
    (features, targets) - DataFrame with correlated features, linear targets
156
    """
157

158
def independentlinear60(n_points=1000) -> tuple[pd.DataFrame, np.ndarray]:
159
    """
160
    Synthetic dataset with 60 independent linear features.
161
    
162
    Generated dataset with independent Gaussian features and linear
163
    target relationships, used for benchmarking linear methods.
164
    
165
    Parameters:
166
    - n_points: Number of data points to generate (int, default: 1000)
167
    
168
    Returns:
169
    (features, targets) - DataFrame with independent features, linear targets
170
    """
171
```
172

173
**Usage Example:**
174

175
```python
176
import shap
177

178
# Load real-world dataset
179
X, y = shap.datasets.adult()
180
print(f"Adult dataset: {X.shape[0]} samples, {X.shape[1]} features")
181

182
# Load image dataset for computer vision
183
images, labels = shap.datasets.imagenet50(n_points=10)
184
print(f"ImageNet sample: {images.shape}")
185

186
# Load text dataset for NLP
187
reviews, sentiments = shap.datasets.imdb(n_points=100)
188
print(f"IMDB sample: {len(reviews)} reviews")
189
```
190

191
### Masking Strategies
192

193
Sophisticated masking approaches for different data types, handling feature dependencies and realistic perturbations.
194

195
```python { .api }
196
class Masker:
197
    """Abstract base class for all maskers."""
198
    def __call__(self, mask, *args):
199
        """Apply masking with binary mask array."""
200
    
201
    @property
202
    def shape(self):
203
        """Expected input dimensions."""
204
    
205
    @property  
206
    def supports_delta_masking(self):
207
        """Whether masker supports efficient delta masking."""
208

209
# Tabular data maskers
210
class Independent:
211
    """
212
    Independent feature masking with background data integration.
213
    
214
    Replaces masked features with values sampled independently
215
    from background distribution.
216
    """
217
    def __init__(self, data, max_samples=100):
218
        """
219
        Parameters:
220
        - data: Background dataset for sampling replacement values
221
        - max_samples: Maximum background samples to use
222
        """
223

224
class Partition:
225
    """
226
    Hierarchical feature masking respecting feature correlations.
227
    
228
    Groups correlated features and masks them together to maintain
229
    realistic feature relationships.
230
    """
231
    def __init__(self, data, max_samples=100, clustering="correlation"):
232
        """
233
        Parameters:
234
        - data: Background dataset for correlation analysis
235
        - clustering: Clustering method ("correlation", "tree", custom)
236
        """
237

238
class Impute:
239
    """
240
    Missing value imputation for masking.
241
    
242
    Uses feature correlations to impute realistic values for
243
    masked features instead of random sampling.
244
    """
245
    def __init__(self, data, method="linear"):
246
        """
247
        Parameters:
248
        - data: Training data for imputation model
249
        - method: Imputation method ("linear", "tree", "knn")
250
        """
251

252
# Specialized maskers  
253
class Text:
254
    """
255
    Text tokenization and masking for NLP models.
256
    
257
    Handles tokenization, special tokens, and text-specific
258
    masking strategies for language models.
259
    """
260
    def __init__(self, tokenizer=None, mask_token=None, 
261
                 collapse_mask_token="auto", output_type="string"):
262
        """
263
        Parameters:
264
        - tokenizer: Custom tokenizer (optional, uses default splitting)
265
        - mask_token: Token to use for masking (e.g., "[MASK]")
266
        - collapse_mask_token: How to handle consecutive masked tokens
267
        - output_type: Output format ("string", "token_ids", "tokens")
268
        """
269

270
class Image:
271
    """
272
    Image region masking with realistic perturbations.
273
    
274
    Supports various masking strategies including blur, inpainting,
275
    and noise for computer vision models.
276
    """
277
    def __init__(self, mask_value, shape=None):
278
        """
279
        Parameters:
280
        - mask_value: Value/strategy for masked regions (scalar, "blur", "inpaint", "noise")
281
        - shape: Expected image shape (optional, inferred from data)
282
        """
283

284
class Fixed:
285
    """
286
    Fixed background values for masking.
287
    
288
    Simple masking strategy using predetermined values
289
    for all masked features.
290
    """
291
    def __init__(self, mask_value):
292
        """
293
        Parameters:
294
        - mask_value: Fixed value(s) to use for masking
295
        """
296

297
# Composite maskers
298
class Composite:
299
    """
300
    Combine multiple maskers for different feature groups.
301
    
302
    Allows different masking strategies for different parts
303
    of the input (e.g., tabular + text + image).
304
    """
305
    def __init__(self, **maskers):
306
        """
307
        Parameters:
308
        - **maskers: Named maskers for different feature groups
309
        """
310

311
class FixedComposite:
312
    """Fixed composite masking with predetermined feature groups."""
313
    def __init__(self, **maskers):
314
        """Initialize with fixed feature-to-masker mapping."""
315

316
class OutputComposite:
317
    """Output-specific masking for multi-output models."""
318
    def __init__(self, **maskers):
319
        """Initialize with output-specific masking strategies."""
320
```
321

322
### Model Wrappers
323

324
Standardized model interfaces for consistent explainer usage across different frameworks.
325

326
```python { .api }
327
class Model:
328
    """
329
    Universal model wrapper with automatic tensor conversion.
330
    
331
    Standardizes model interfaces and handles tensor conversions
332
    between NumPy arrays and framework-specific tensors.
333
    """
334
    def __init__(self, model=None):
335
        """
336
        Parameters:
337
        - model: Model object to wrap (optional, can be set later)
338
        """
339
    
340
    def __call__(self, *args):
341
        """
342
        Call wrapped model with automatic tensor conversion.
343
        
344
        Converts NumPy inputs to appropriate framework tensors,
345
        calls model, and converts outputs back to NumPy arrays.
346
        """
347
    
348
    def save(self, out_file):
349
        """Serialize model to file."""
350
    
351
    @staticmethod
352
    def load(in_file, instantiate=True):
353
        """Load model from file."""
354

355
class TeacherForcing:
356
    """
357
    Model wrapper for teacher forcing in sequence models.
358
    
359
    Handles sequence generation with known target sequences
360
    during training/explanation phases.
361
    """
362
    def __init__(self, model, similarity_model=None, masker=None):
363
        """Initialize teacher forcing wrapper for sequence models."""
364

365
class TextGeneration:
366
    """
367
    Wrapper for text generation models.
368
    
369
    Standardizes interface for autoregressive text models
370
    with generation parameters and stopping criteria.
371
    """
372
    def __init__(self, model, masker=None, similarity_model=None):
373
        """Initialize text generation model wrapper."""
374

375
class TopKLM:
376
    """
377
    Top-K language model wrapper.
378
    
379
    Restricts language model outputs to top-K most likely tokens
380
    for more stable explanations.
381
    """
382
    def __init__(self, model, similarity_model=None, masker=None):
383
        """Initialize top-K language model wrapper."""
384

385
class TransformersPipeline:
386
    """
387
    HuggingFace transformers pipeline wrapper.
388
    
389
    Integrates with HuggingFace pipelines for standardized
390
    transformer model interfaces.
391
    """
392
    def __init__(self, pipeline):
393
        """
394
        Parameters:
395
        - pipeline: HuggingFace pipeline object
396
        """
397
```
398

399
### Utility Functions
400

401
Helper functions for data manipulation, sampling, and analysis workflows.
402

403
```python { .api }
404
# Sampling and data manipulation
405
def sample(X, nsamples=100, random_state=0):
406
    """
407
    Sample data points without replacement.
408
    
409
    Parameters:
410
    - X: Input data (array, DataFrame, sparse matrix)
411
    - nsamples: Number of samples to draw
412
    - random_state: Random seed for reproducibility
413
    
414
    Returns:
415
    Sampled data in same format as input
416
    """
417

418
def approximate_interactions(index, shap_values, X, feature_names=None) -> np.ndarray:
419
    """
420
    Find features with high interactions with target feature.
421
    
422
    Parameters:
423
    - index: Target feature index or name
424
    - shap_values: SHAP values array or Explanation object
425
    - X: Input feature data
426
    - feature_names: List of feature names (optional)
427
    
428
    Returns:
429
    Array of interaction strength scores for each feature
430
    """
431

432
# Clustering functions
433
def hclust(data, metric="sqeuclidean"):
434
    """
435
    Hierarchical clustering of features.
436
    
437
    Parameters:
438
    - data: Feature data for clustering
439
    - metric: Distance metric for clustering
440
    
441
    Returns:
442
    Clustering linkage matrix
443
    """
444

445
def hclust_ordering(X, metric="sqeuclidean"):
446
    """
447
    Optimal leaf ordering for hierarchical clustering dendrograms.
448
    
449
    Minimizes distances between adjacent leaves in dendrogram.
450
    """
451

452
def delta_minimization_order():
453
    """Compute ordering that minimizes partition tree delta."""
454

455
def partition_tree():  
456
    """Create hierarchical partition tree for feature grouping."""
457

458
def partition_tree_shuffle():
459
    """Shuffle partition tree leaves while preserving structure."""
460

461
# Mathematical utilities
462
def shapley_coefficients(n) -> np.ndarray:
463
    """
464
    Compute Shapley coefficients for n players.
465
    
466
    Parameters:
467
    - n: Number of features/players
468
    
469
    Returns:
470
    Array of Shapley coefficients
471
    """
472

473
# Utility classes
474
class OpChain:
475
    """
476
    Chainable operations for delayed execution.
477
    
478
    Enables method chaining on Explanation objects with
479
    lazy evaluation for performance optimization.
480
    """
481
    def __init__(self, op, *args, **kwargs):
482
        """Initialize operation chain."""
483
    
484
    def __call__(self, obj):
485
        """Apply operation chain to object."""
486

487
class MaskedModel:
488
    """
489
    Wrapper for masked model evaluation.
490
    
491
    Handles feature masking during model evaluation with
492
    efficient batching and caching.
493
    """
494
    def __init__(self, model, masker, *args, **kwargs):
495
        """
496
        Parameters:
497
        - model: Model function to wrap
498
        - masker: Masker object for feature perturbation
499
        """
500
    
501
    def __call__(self, masks, *args, **kwargs):
502
        """Evaluate model with masked inputs."""
503

504
def make_masks():
505
    """Generate binary masks for features."""
506

507
# Display and progress utilities
508
def show_progress():
509
    """Display progress bars for long computations."""
510

511
# Import and error handling
512
def assert_import(package_name):
513
    """Assert that required package is available."""
514

515
def record_import_error(package_name, msg, e):
516
    """Record import errors for debugging."""
517

518
def safe_isinstance(obj, class_path_str) -> bool:
519
    """Safe type checking without importing classes."""
520

521
# String formatting utilities  
522
def format_value(s, format_str):
523
    """Format values for display in plots and outputs."""
524

525
def ordinal_str(n):
526
    """Convert numbers to ordinal strings (1st, 2nd, 3rd, etc.)."""
527

528
def convert_name():
529
    """Convert feature names between different formats."""
530

531
def potential_interactions(shap_values_column, shap_values_matrix):
532
    """
533
    Order features by interaction strength with target feature.
534
    
535
    Bins SHAP values for a feature along that feature's value to identify
536
    potential interactions. For exact Shapley interaction values, use
537
    interaction_contribs in XGBoost.
538
    
539
    Parameters:
540
    - shap_values_column: SHAP values for target feature
541
    - shap_values_matrix: SHAP values matrix for all features
542
    
543
    Returns:
544
    Feature ordering by interaction strength
545
    """
546

547
def make_masks(cluster_matrix):
548
    """
549
    Build sparse CSR mask matrix from hierarchical clustering.
550
    
551
    Optimized function for creating binary masks from clustering results,
552
    particularly useful for large image datasets and tree structures.
553
    
554
    Parameters:
555
    - cluster_matrix: Hierarchical clustering matrix
556
    
557
    Returns:
558
    scipy.sparse.csr_matrix: Binary mask matrix for feature groups
559
    """
560

561
def suppress_stderr():
562
    """Context manager to suppress stderr output during operations."""
563
```
564

565
### Action Optimization
566

567
Framework for constrained optimization and action recommendation.
568

569
```python { .api }
570
class Action:
571
    """
572
    Abstract action class with cost parameter.
573
    
574
    Base class for defining actions in optimization problems
575
    with associated costs and execution logic.
576
    """
577
    def __init__(self, cost):
578
        """
579
        Parameters:
580
        - cost: Cost of executing this action (numeric)
581
        """
582
    
583
    def __call__(self, *args):
584
        """Execute the action - must be implemented by subclasses."""
585
    
586
    def __lt__(self, other_action):
587
        """Compare actions by cost for priority queue ordering."""
588

589
class ActionOptimizer:
590
    """
591
    Optimize action sequences to satisfy model constraints.
592
    
593
    Uses priority queue search to find minimum-cost action sequences
594
    that satisfy specified model constraints.
595
    
596
    Warning:
597
        ActionOptimizer is in alpha state and subject to API changes.
598
    """
599
    def __init__(self, model, actions):
600
        """
601
        Parameters:
602
        - model: Function returning True when constraints are satisfied
603
        - actions: List of Action objects or lists of mutually exclusive actions
604
        """
605
    
606
    def __call__(self, *args, max_evals=10000):
607
        """
608
        Find optimal action sequence.
609
        
610
        Parameters:
611
        - max_evals: Maximum evaluations before raising ConvergenceError
612
        
613
        Returns:
614
        List of actions that satisfy constraints with minimum cost
615
        """
616
```
617

618
### Link Functions
619

620
Output transformation functions for different model types and scales.
621

622
```python { .api }
623
def identity(x):
624
    """
625
    Identity link function (no transformation).
626
    
627
    Returns input unchanged. Used for regression models
628
    and when no output transformation is needed.
629
    
630
    Parameters:
631
    - x: Input values
632
    
633
    Returns:
634
    Unchanged input values
635
    """
636

637
identity.inverse = lambda x: x  # Inverse transformation
638

639
def logit(x):
640
    """
641
    Logit link function for probability to log-odds conversion.
642
    
643
    Transforms probabilities [0,1] to log-odds (-∞,∞).
644
    Useful for binary classification models.
645
    
646
    Parameters:
647
    - x: Probability values in [0,1]
648
    
649
    Returns:
650
    Log-odds values log(x/(1-x))
651
    """
652

653
logit.inverse = lambda x: 1 / (1 + np.exp(-x))  # Sigmoid inverse
654
```
655

656
## Usage Patterns
657

658
### Dataset Loading and Preprocessing
659

660
```python
661
import shap
662

663
# Load dataset with optional sampling
664
X, y = shap.datasets.adult(n_points=1000)
665

666
# Use for model training
667
from sklearn.ensemble import RandomForestClassifier
668
model = RandomForestClassifier()
669
model.fit(X, y)
670

671
# Background data for explanations
672
X_background = shap.utils.sample(X, 100)
673
```
674

675
### Masking Strategy Selection
676

677
```python
678
# Tabular data with correlations
679
masker = shap.maskers.Partition(X_background, clustering="correlation")
680

681
# Text data
682
masker = shap.maskers.Text(mask_token="[MASK]", output_type="string")
683

684
# Image data
685
masker = shap.maskers.Image(mask_value="blur")
686

687
# Composite data (tabular + text)
688
masker = shap.maskers.Composite(
689
    tabular=shap.maskers.Independent(X_tabular),
690
    text=shap.maskers.Text()
691
)
692
```
693

694
### Model Wrapping and Standardization
695

696
```python
697
# Wrap PyTorch model for consistent interface
698
wrapped_model = shap.models.Model(pytorch_model)
699

700
# Use with any explainer
701
explainer = shap.KernelExplainer(wrapped_model, X_background)
702
shap_values = explainer(X_test)
703
```
704

705
### Error Handling
706

707
Common utility errors and solutions:
708

709
- **DataError**: Invalid data format or empty dataset
710
- **DimensionError**: Incompatible data dimensions between components
711
- **ImportError**: Missing optional dependencies for specific maskers/models
712
- **ValueError**: Invalid parameters for utility functions
713
- **ConvergenceError**: Action optimization failed to find solution (ActionOptimizer)

Version

Tile

Files

utilities.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

utilities.mddocs/