Tessl Tile for pypi/skl2onnx@1.19.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

algebra.md conversion.md data-types.md helpers.md index.md registration.md

helpers.mddocs/

0
# Helper Utilities
1

2
Investigation and integration utilities for debugging conversions, comparing outputs between scikit-learn and ONNX models, analyzing pipeline structures, and integrating custom ONNX graphs. These utilities support development, testing, and troubleshooting of ONNX conversions.
3

4
## Capabilities
5

6
### Investigation and Debugging
7

8
Tools for analyzing conversion processes, collecting intermediate results, and debugging conversion issues.
9

10
```python { .api }
11
def collect_intermediate_steps(model, X=None, target_opset=None):
12
    """
13
    Collect intermediate outputs during conversion process for debugging.
14
    
15
    Provides detailed information about shape inference, operator creation,
16
    and conversion steps to help diagnose conversion issues.
17
    
18
    Parameters:
19
    - model: scikit-learn model to analyze
20
    - X: array-like, sample input data for type inference (optional)
21
    - target_opset: int, target ONNX opset version (optional)
22
    
23
    Returns:
24
    - dict: Detailed conversion information including:
25
      - 'shapes': Shape inference results for each step
26
      - 'operators': Generated ONNX operators
27
      - 'variables': Variable names and types
28
      - 'topology': Model topology structure
29
    """
30

31
def compare_objects(sklearn_output, onnx_output, decimal=5):
32
    """
33
    Compare outputs between scikit-learn and ONNX models.
34
    
35
    Validates conversion accuracy by comparing predictions from original
36
    sklearn model with converted ONNX model outputs.
37
    
38
    Parameters:
39
    - sklearn_output: array-like, output from sklearn model
40
    - onnx_output: array-like, output from ONNX model
41
    - decimal: int, number of decimal places for comparison (default 5)
42
    
43
    Returns:
44
    - bool: True if outputs match within specified precision
45
    
46
    Raises:
47
    - AssertionError: If outputs don't match within tolerance
48
    - ValueError: If output shapes or types are incompatible
49
    """
50

51
def enumerate_pipeline_models(model):
52
    """
53
    Enumerate all models within a pipeline or ensemble.
54
    
55
    Recursively discovers all sub-models in complex pipelines,
56
    feature unions, and ensemble models for analysis or debugging.
57
    
58
    Parameters:
59
    - model: scikit-learn model, pipeline, or ensemble
60
    
61
    Returns:
62
    - list: List of tuples (model_name, model_instance, path)
63
      where path indicates the location within the pipeline structure
64
    """
65
```
66

67
### Integration Utilities
68

69
Functions for integrating custom ONNX graphs and extending existing models.
70

71
```python { .api }
72
def add_onnx_graph(onx, to_add, inputs, outputs):
73
    """
74
    Add a custom ONNX graph to an existing ONNX model.
75
    
76
    Enables integration of custom operators or preprocessing/postprocessing
77
    steps by merging ONNX graphs while maintaining proper variable connections.
78
    
79
    Parameters:
80
    - onx: ModelProto, existing ONNX model
81
    - to_add: GraphProto or ModelProto, graph/model to add
82
    - inputs: list, input variable names for connection
83
    - outputs: list, output variable names for connection
84
    
85
    Returns:
86
    - ModelProto: Modified ONNX model with integrated graph
87
    
88
    Raises:
89
    - ValueError: If input/output connections are invalid
90
    - TypeError: If graph types are incompatible
91
    """
92
```
93

94
### Performance and Benchmarking
95

96
Utilities for measuring and comparing performance between sklearn and ONNX models.
97

98
```python { .api }
99
def measure_time(stmt, context, repeat=10, number=50, div_by_number=False):
100
    """
101
    Measure execution time for model operations.
102
    
103
    Provides accurate timing measurements for comparing sklearn vs ONNX
104
    model performance, including statistical analysis of multiple runs.
105
    
106
    Parameters:
107
    - stmt: str, statement to time (e.g., 'model.predict(X)')
108
    - context: dict, variable context dictionary for statement execution
109
    - repeat: int, number of timing runs for statistical analysis (default 10)
110
    - number: int, number of executions per timing run (default 50)
111
    - div_by_number: bool, divide timing results by number of executions (default False)
112
    
113
    Returns:
114
    - dict: Timing results including:
115
      - 'average': Average execution time
116
      - 'deviation': Standard deviation
117
      - 'min_exec': Minimum execution time
118
      - 'max_exec': Maximum execution time
119
      - 'repeat': Number of repeat runs
120
      - 'number': Number of executions per run
121
    """
122
```
123

124
## Usage Examples
125

126
### Debugging Conversion Issues
127

128
```python
129
from skl2onnx.helpers.investigate import collect_intermediate_steps
130
from sklearn.ensemble import RandomForestClassifier
131
from sklearn.datasets import make_classification
132

133
# Create model
134
X, y = make_classification(n_samples=100, n_features=10, random_state=42)
135
model = RandomForestClassifier(n_estimators=5, random_state=42)
136
model.fit(X, y)
137

138
# Collect detailed conversion information
139
debug_info = collect_intermediate_steps(model, X, target_opset=18)
140

141
# Analyze the results
142
print("Shape inference results:")
143
for step, shapes in debug_info['shapes'].items():
144
    print(f"  {step}: {shapes}")
145

146
print("\nGenerated operators:")
147
for i, op in enumerate(debug_info['operators']):
148
    print(f"  {i}: {op.op_type} ({op.inputs} -> {op.outputs})")
149

150
print("\nVariable information:")
151
for name, var_info in debug_info['variables'].items():
152
    print(f"  {name}: {var_info}")
153
```
154

155
### Validating Conversion Accuracy
156

157
```python
158
from skl2onnx.helpers.investigate import compare_objects
159
from skl2onnx import to_onnx
160
import onnxruntime as rt
161
import numpy as np
162

163
# Convert model
164
onnx_model = to_onnx(model, X)
165

166
# Get sklearn predictions
167
sklearn_pred = model.predict_proba(X)
168

169
# Get ONNX predictions
170
sess = rt.InferenceSession(onnx_model.SerializeToString())
171
input_name = sess.get_inputs()[0].name
172
onnx_pred = sess.run(None, {input_name: X.astype(np.float32)})[1]
173

174
# Compare outputs
175
try:
176
    match = compare_objects(sklearn_pred, onnx_pred, decimal=4)
177
    print("Conversion validated: outputs match within tolerance")
178
except AssertionError as e:
179
    print(f"Conversion issue detected: {e}")
180
```
181

182
### Analyzing Pipeline Structure
183

184
```python
185
from skl2onnx.helpers.investigate import enumerate_pipeline_models
186
from sklearn.pipeline import Pipeline
187
from sklearn.preprocessing import StandardScaler
188
from sklearn.feature_selection import SelectKBest
189
from sklearn.ensemble import RandomForestClassifier
190

191
# Create complex pipeline
192
pipeline = Pipeline([
193
    ('scaler', StandardScaler()),
194
    ('selector', SelectKBest(k=5)),
195
    ('classifier', RandomForestClassifier(n_estimators=10))
196
])
197
pipeline.fit(X, y)
198

199
# Enumerate all models in pipeline
200
models = enumerate_pipeline_models(pipeline)
201

202
print("Pipeline structure:")
203
for name, model_instance, path in models:
204
    print(f"  {path}: {name} ({type(model_instance).__name__})")
205
```
206

207
### Adding Custom ONNX Operations
208

209
```python
210
from skl2onnx.helpers.integration import add_onnx_graph
211
from skl2onnx import to_onnx
212
import onnx
213
from onnx import helper, TensorProto
214

215
# Convert base model
216
base_model = to_onnx(model, X)
217

218
# Create custom preprocessing graph
219
custom_inputs = [helper.make_tensor_value_info('input', TensorProto.FLOAT, [None, 10])]
220
custom_outputs = [helper.make_tensor_value_info('processed', TensorProto.FLOAT, [None, 10])]
221

222
# Custom operation: multiply by constant
223
multiply_node = helper.make_node(
224
    'Mul',
225
    inputs=['input', 'scale_factor'],
226
    outputs=['processed'],
227
    name='custom_scaling'
228
)
229

230
# Create scale factor initializer
231
scale_factor = helper.make_tensor(
232
    'scale_factor',
233
    TensorProto.FLOAT,
234
    [1],
235
    [2.0]  # Scale factor value
236
)
237

238
custom_graph = helper.make_graph(
239
    [multiply_node],
240
    'custom_preprocessing',
241
    custom_inputs,
242
    custom_outputs,
243
    [scale_factor]
244
)
245

246
# Integrate custom graph with base model
247
enhanced_model = add_onnx_graph(
248
    base_model,
249
    custom_graph,
250
    inputs=['input'],
251
    outputs=['processed']
252
)
253
```
254

255
### Performance Benchmarking
256

257
```python
258
from skl2onnx.tutorial import measure_time
259
import onnxruntime as rt
260
import numpy as np
261
from sklearn.ensemble import RandomForestClassifier
262
from skl2onnx import to_onnx
263

264
# Create and train model
265
X_test = np.random.randn(1000, 10).astype(np.float32)
266
model = RandomForestClassifier(n_estimators=100, random_state=42)
267
model.fit(X_test[:100], np.random.randint(0, 2, 100))
268

269
# Convert to ONNX
270
onnx_model = to_onnx(model, X_test[:1])
271
sess = rt.InferenceSession(onnx_model.SerializeToString())
272
input_name = sess.get_inputs()[0].name
273

274
# Measure sklearn performance
275
sklearn_context = {
276
    'model': model,
277
    'X_test': X_test
278
}
279
sklearn_times = measure_time(
280
    'model.predict_proba(X_test)',
281
    context=sklearn_context,
282
    number=10,
283
    repeat=5
284
)
285

286
# Measure ONNX performance
287
onnx_context = {
288
    'sess': sess,
289
    'input_name': input_name,
290
    'X_test': X_test
291
}
292
onnx_times = measure_time(
293
    'sess.run(None, {input_name: X_test})',
294
    context=onnx_context,
295
    number=10,
296
    repeat=5
297
)
298

299
print(f"Sklearn average time: {sklearn_times['average']:.4f}s (±{sklearn_times['deviation']:.4f})")
300
print(f"ONNX average time: {onnx_times['average']:.4f}s (±{onnx_times['deviation']:.4f})")
301
print(f"Speedup: {sklearn_times['average'] / onnx_times['average']:.2f}x")
302
```
303

304
### Advanced Pipeline Analysis
305

306
```python
307
# Analyze complex nested pipeline
308
from sklearn.compose import ColumnTransformer
309
from sklearn.preprocessing import OneHotEncoder, StandardScaler
310

311
# Create complex pipeline with column transformer
312
preprocessor = ColumnTransformer([
313
    ('num', StandardScaler(), [0, 1, 2]),
314
    ('cat', OneHotEncoder(), [3, 4])
315
])
316

317
complex_pipeline = Pipeline([
318
    ('preprocessing', preprocessor),
319
    ('classifier', RandomForestClassifier())
320
])
321

322
# Enumerate all components
323
all_models = enumerate_pipeline_models(complex_pipeline)
324

325
print("Complex pipeline analysis:")
326
for name, instance, path in all_models:
327
    print(f"  {path}: {name}")
328
    if hasattr(instance, 'get_params'):
329
        key_params = {k: v for k, v in instance.get_params().items() 
330
                     if not k.endswith('__') and not callable(v)}
331
        print(f"    Key parameters: {key_params}")
332
```
333

334
## Debugging Guidelines
335

336
### Common Investigation Patterns
337
1. **Shape Mismatches**: Use `collect_intermediate_steps` to trace shape inference
338
2. **Type Errors**: Check data type consistency with `compare_objects`
339
3. **Pipeline Issues**: Use `enumerate_pipeline_models` to understand structure
340
4. **Performance Problems**: Use `measure_time` for systematic benchmarking
341

342
### Troubleshooting Tips
343
- **Enable verbose logging** during conversion for detailed information
344
- **Compare intermediate outputs** at each pipeline stage
345
- **Validate with simple test cases** before complex scenarios
346
- **Check ONNX opset compatibility** for target deployment environment
347

348
### Integration Best Practices
349
- **Test custom graphs separately** before integration
350
- **Validate variable connections** between graph components
351
- **Consider performance implications** of additional operations
352
- **Document custom modifications** for maintainability

Version

Tile

Files

helpers.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

helpers.mddocs/