0
# Helper Utilities
1
2
Investigation and integration utilities for debugging conversions, comparing outputs between scikit-learn and ONNX models, analyzing pipeline structures, and integrating custom ONNX graphs. These utilities support development, testing, and troubleshooting of ONNX conversions.
3
4
## Capabilities
5
6
### Investigation and Debugging
7
8
Tools for analyzing conversion processes, collecting intermediate results, and debugging conversion issues.
9
10
```python { .api }
11
def collect_intermediate_steps(model, X=None, target_opset=None):
12
"""
13
Collect intermediate outputs during conversion process for debugging.
14
15
Provides detailed information about shape inference, operator creation,
16
and conversion steps to help diagnose conversion issues.
17
18
Parameters:
19
- model: scikit-learn model to analyze
20
- X: array-like, sample input data for type inference (optional)
21
- target_opset: int, target ONNX opset version (optional)
22
23
Returns:
24
- dict: Detailed conversion information including:
25
- 'shapes': Shape inference results for each step
26
- 'operators': Generated ONNX operators
27
- 'variables': Variable names and types
28
- 'topology': Model topology structure
29
"""
30
31
def compare_objects(sklearn_output, onnx_output, decimal=5):
32
"""
33
Compare outputs between scikit-learn and ONNX models.
34
35
Validates conversion accuracy by comparing predictions from original
36
sklearn model with converted ONNX model outputs.
37
38
Parameters:
39
- sklearn_output: array-like, output from sklearn model
40
- onnx_output: array-like, output from ONNX model
41
- decimal: int, number of decimal places for comparison (default 5)
42
43
Returns:
44
- bool: True if outputs match within specified precision
45
46
Raises:
47
- AssertionError: If outputs don't match within tolerance
48
- ValueError: If output shapes or types are incompatible
49
"""
50
51
def enumerate_pipeline_models(model):
52
"""
53
Enumerate all models within a pipeline or ensemble.
54
55
Recursively discovers all sub-models in complex pipelines,
56
feature unions, and ensemble models for analysis or debugging.
57
58
Parameters:
59
- model: scikit-learn model, pipeline, or ensemble
60
61
Returns:
62
- list: List of tuples (model_name, model_instance, path)
63
where path indicates the location within the pipeline structure
64
"""
65
```
66
67
### Integration Utilities
68
69
Functions for integrating custom ONNX graphs and extending existing models.
70
71
```python { .api }
72
def add_onnx_graph(onx, to_add, inputs, outputs):
73
"""
74
Add a custom ONNX graph to an existing ONNX model.
75
76
Enables integration of custom operators or preprocessing/postprocessing
77
steps by merging ONNX graphs while maintaining proper variable connections.
78
79
Parameters:
80
- onx: ModelProto, existing ONNX model
81
- to_add: GraphProto or ModelProto, graph/model to add
82
- inputs: list, input variable names for connection
83
- outputs: list, output variable names for connection
84
85
Returns:
86
- ModelProto: Modified ONNX model with integrated graph
87
88
Raises:
89
- ValueError: If input/output connections are invalid
90
- TypeError: If graph types are incompatible
91
"""
92
```
93
94
### Performance and Benchmarking
95
96
Utilities for measuring and comparing performance between sklearn and ONNX models.
97
98
```python { .api }
99
def measure_time(stmt, context, repeat=10, number=50, div_by_number=False):
100
"""
101
Measure execution time for model operations.
102
103
Provides accurate timing measurements for comparing sklearn vs ONNX
104
model performance, including statistical analysis of multiple runs.
105
106
Parameters:
107
- stmt: str, statement to time (e.g., 'model.predict(X)')
108
- context: dict, variable context dictionary for statement execution
109
- repeat: int, number of timing runs for statistical analysis (default 10)
110
- number: int, number of executions per timing run (default 50)
111
- div_by_number: bool, divide timing results by number of executions (default False)
112
113
Returns:
114
- dict: Timing results including:
115
- 'average': Average execution time
116
- 'deviation': Standard deviation
117
- 'min_exec': Minimum execution time
118
- 'max_exec': Maximum execution time
119
- 'repeat': Number of repeat runs
120
- 'number': Number of executions per run
121
"""
122
```
123
124
## Usage Examples
125
126
### Debugging Conversion Issues
127
128
```python
129
from skl2onnx.helpers.investigate import collect_intermediate_steps
130
from sklearn.ensemble import RandomForestClassifier
131
from sklearn.datasets import make_classification
132
133
# Create model
134
X, y = make_classification(n_samples=100, n_features=10, random_state=42)
135
model = RandomForestClassifier(n_estimators=5, random_state=42)
136
model.fit(X, y)
137
138
# Collect detailed conversion information
139
debug_info = collect_intermediate_steps(model, X, target_opset=18)
140
141
# Analyze the results
142
print("Shape inference results:")
143
for step, shapes in debug_info['shapes'].items():
144
print(f" {step}: {shapes}")
145
146
print("\nGenerated operators:")
147
for i, op in enumerate(debug_info['operators']):
148
print(f" {i}: {op.op_type} ({op.inputs} -> {op.outputs})")
149
150
print("\nVariable information:")
151
for name, var_info in debug_info['variables'].items():
152
print(f" {name}: {var_info}")
153
```
154
155
### Validating Conversion Accuracy
156
157
```python
158
from skl2onnx.helpers.investigate import compare_objects
159
from skl2onnx import to_onnx
160
import onnxruntime as rt
161
import numpy as np
162
163
# Convert model
164
onnx_model = to_onnx(model, X)
165
166
# Get sklearn predictions
167
sklearn_pred = model.predict_proba(X)
168
169
# Get ONNX predictions
170
sess = rt.InferenceSession(onnx_model.SerializeToString())
171
input_name = sess.get_inputs()[0].name
172
onnx_pred = sess.run(None, {input_name: X.astype(np.float32)})[1]
173
174
# Compare outputs
175
try:
176
match = compare_objects(sklearn_pred, onnx_pred, decimal=4)
177
print("Conversion validated: outputs match within tolerance")
178
except AssertionError as e:
179
print(f"Conversion issue detected: {e}")
180
```
181
182
### Analyzing Pipeline Structure
183
184
```python
185
from skl2onnx.helpers.investigate import enumerate_pipeline_models
186
from sklearn.pipeline import Pipeline
187
from sklearn.preprocessing import StandardScaler
188
from sklearn.feature_selection import SelectKBest
189
from sklearn.ensemble import RandomForestClassifier
190
191
# Create complex pipeline
192
pipeline = Pipeline([
193
('scaler', StandardScaler()),
194
('selector', SelectKBest(k=5)),
195
('classifier', RandomForestClassifier(n_estimators=10))
196
])
197
pipeline.fit(X, y)
198
199
# Enumerate all models in pipeline
200
models = enumerate_pipeline_models(pipeline)
201
202
print("Pipeline structure:")
203
for name, model_instance, path in models:
204
print(f" {path}: {name} ({type(model_instance).__name__})")
205
```
206
207
### Adding Custom ONNX Operations
208
209
```python
210
from skl2onnx.helpers.integration import add_onnx_graph
211
from skl2onnx import to_onnx
212
import onnx
213
from onnx import helper, TensorProto
214
215
# Convert base model
216
base_model = to_onnx(model, X)
217
218
# Create custom preprocessing graph
219
custom_inputs = [helper.make_tensor_value_info('input', TensorProto.FLOAT, [None, 10])]
220
custom_outputs = [helper.make_tensor_value_info('processed', TensorProto.FLOAT, [None, 10])]
221
222
# Custom operation: multiply by constant
223
multiply_node = helper.make_node(
224
'Mul',
225
inputs=['input', 'scale_factor'],
226
outputs=['processed'],
227
name='custom_scaling'
228
)
229
230
# Create scale factor initializer
231
scale_factor = helper.make_tensor(
232
'scale_factor',
233
TensorProto.FLOAT,
234
[1],
235
[2.0] # Scale factor value
236
)
237
238
custom_graph = helper.make_graph(
239
[multiply_node],
240
'custom_preprocessing',
241
custom_inputs,
242
custom_outputs,
243
[scale_factor]
244
)
245
246
# Integrate custom graph with base model
247
enhanced_model = add_onnx_graph(
248
base_model,
249
custom_graph,
250
inputs=['input'],
251
outputs=['processed']
252
)
253
```
254
255
### Performance Benchmarking
256
257
```python
258
from skl2onnx.tutorial import measure_time
259
import onnxruntime as rt
260
import numpy as np
261
from sklearn.ensemble import RandomForestClassifier
262
from skl2onnx import to_onnx
263
264
# Create and train model
265
X_test = np.random.randn(1000, 10).astype(np.float32)
266
model = RandomForestClassifier(n_estimators=100, random_state=42)
267
model.fit(X_test[:100], np.random.randint(0, 2, 100))
268
269
# Convert to ONNX
270
onnx_model = to_onnx(model, X_test[:1])
271
sess = rt.InferenceSession(onnx_model.SerializeToString())
272
input_name = sess.get_inputs()[0].name
273
274
# Measure sklearn performance
275
sklearn_context = {
276
'model': model,
277
'X_test': X_test
278
}
279
sklearn_times = measure_time(
280
'model.predict_proba(X_test)',
281
context=sklearn_context,
282
number=10,
283
repeat=5
284
)
285
286
# Measure ONNX performance
287
onnx_context = {
288
'sess': sess,
289
'input_name': input_name,
290
'X_test': X_test
291
}
292
onnx_times = measure_time(
293
'sess.run(None, {input_name: X_test})',
294
context=onnx_context,
295
number=10,
296
repeat=5
297
)
298
299
print(f"Sklearn average time: {sklearn_times['average']:.4f}s (±{sklearn_times['deviation']:.4f})")
300
print(f"ONNX average time: {onnx_times['average']:.4f}s (±{onnx_times['deviation']:.4f})")
301
print(f"Speedup: {sklearn_times['average'] / onnx_times['average']:.2f}x")
302
```
303
304
### Advanced Pipeline Analysis
305
306
```python
307
# Analyze complex nested pipeline
308
from sklearn.compose import ColumnTransformer
309
from sklearn.preprocessing import OneHotEncoder, StandardScaler
310
311
# Create complex pipeline with column transformer
312
preprocessor = ColumnTransformer([
313
('num', StandardScaler(), [0, 1, 2]),
314
('cat', OneHotEncoder(), [3, 4])
315
])
316
317
complex_pipeline = Pipeline([
318
('preprocessing', preprocessor),
319
('classifier', RandomForestClassifier())
320
])
321
322
# Enumerate all components
323
all_models = enumerate_pipeline_models(complex_pipeline)
324
325
print("Complex pipeline analysis:")
326
for name, instance, path in all_models:
327
print(f" {path}: {name}")
328
if hasattr(instance, 'get_params'):
329
key_params = {k: v for k, v in instance.get_params().items()
330
if not k.endswith('__') and not callable(v)}
331
print(f" Key parameters: {key_params}")
332
```
333
334
## Debugging Guidelines
335
336
### Common Investigation Patterns
337
1. **Shape Mismatches**: Use `collect_intermediate_steps` to trace shape inference
338
2. **Type Errors**: Check data type consistency with `compare_objects`
339
3. **Pipeline Issues**: Use `enumerate_pipeline_models` to understand structure
340
4. **Performance Problems**: Use `measure_time` for systematic benchmarking
341
342
### Troubleshooting Tips
343
- **Enable verbose logging** during conversion for detailed information
344
- **Compare intermediate outputs** at each pipeline stage
345
- **Validate with simple test cases** before complex scenarios
346
- **Check ONNX opset compatibility** for target deployment environment
347
348
### Integration Best Practices
349
- **Test custom graphs separately** before integration
350
- **Validate variable connections** between graph components
351
- **Consider performance implications** of additional operations
352
- **Document custom modifications** for maintainability