0
# Machine Learning and Organizational Mining
1
2
Machine learning features for predictive process analytics and organizational mining for resource analysis and social network discovery. PM4PY provides tools for feature extraction, predictive modeling, and organizational pattern analysis.
3
4
## Capabilities
5
6
### Machine Learning - Data Preparation
7
8
Prepare event log data for machine learning applications including train/test splits and prefix extraction.
9
10
```python { .api }
11
def split_train_test(log, train_percentage=0.8, case_id_key='case:concept:name'):
12
"""
13
Split event log into training and test sets for machine learning.
14
15
Parameters:
16
- log (Union[EventLog, pd.DataFrame]): Event log data
17
- train_percentage (float): Percentage of data for training (0.0-1.0)
18
- case_id_key (str): Case ID attribute name
19
20
Returns:
21
Union[Tuple[EventLog, EventLog], Tuple[pd.DataFrame, pd.DataFrame]]: (train_log, test_log)
22
"""
23
24
def get_prefixes_from_log(log, length, case_id_key='case:concept:name'):
25
"""
26
Extract trace prefixes of specified length for predictive modeling.
27
28
Parameters:
29
- log (Union[EventLog, pd.DataFrame]): Event log data
30
- length (int): Length of prefixes to extract
31
- case_id_key (str): Case ID attribute name
32
33
Returns:
34
Union[EventLog, pd.DataFrame]: Event log with prefixes
35
"""
36
```
37
38
### Machine Learning - Feature Extraction
39
40
Extract features from event logs and OCEL for machine learning applications.
41
42
```python { .api }
43
def extract_ocel_features(ocel, **kwargs):
44
"""
45
Extract machine learning features from Object-Centric Event Log.
46
47
Parameters:
48
- ocel (OCEL): Object-centric event log
49
- **kwargs: Feature extraction parameters including:
50
- feature_types: List of feature types to extract
51
- aggregation_methods: Methods for aggregating object features
52
- temporal_features: Whether to include temporal features
53
54
Returns:
55
pd.DataFrame: Feature matrix with one row per object or event
56
"""
57
58
def extract_features_dataframe(log, **kwargs):
59
"""
60
Extract comprehensive features from traditional event log.
61
62
Parameters:
63
- log (Union[EventLog, pd.DataFrame]): Event log data
64
- **kwargs: Feature extraction parameters including:
65
- case_features: Whether to extract case-level features
66
- event_features: Whether to extract event-level features
67
- temporal_features: Include temporal patterns
68
- categorical_encoding: Method for encoding categorical variables
69
70
Returns:
71
pd.DataFrame: Feature matrix ready for machine learning
72
"""
73
74
def extract_temporal_features_dataframe(log, **kwargs):
75
"""
76
Extract temporal features from event log for time-aware modeling.
77
78
Parameters:
79
- log (Union[EventLog, pd.DataFrame]): Event log data
80
- **kwargs: Temporal feature parameters including:
81
- time_windows: Time windows for feature aggregation
82
- cyclical_features: Include cyclical time features (hour, day, month)
83
- duration_features: Include activity and case duration features
84
85
Returns:
86
pd.DataFrame: Temporal feature matrix
87
"""
88
89
def extract_outcome_enriched_dataframe(log, **kwargs):
90
"""
91
Extract features enriched with outcome information for supervised learning.
92
93
Parameters:
94
- log (Union[EventLog, pd.DataFrame]): Event log data
95
- **kwargs: Outcome enrichment parameters including:
96
- outcome_definition: How to define positive/negative outcomes
97
- prediction_horizon: Time horizon for outcome prediction
98
- feature_encoding: Method for encoding features
99
100
Returns:
101
pd.DataFrame: Feature matrix with outcome labels
102
"""
103
104
def extract_target_vector(log, **kwargs):
105
"""
106
Extract target vector for supervised machine learning.
107
108
Parameters:
109
- log (Union[EventLog, pd.DataFrame]): Event log data
110
- **kwargs: Target extraction parameters including:
111
- target_type: Type of target ('next_activity', 'remaining_time', 'outcome')
112
- prediction_point: Point in trace for prediction
113
- encoding_method: Method for encoding targets
114
115
Returns:
116
List[Any]: Target vector for machine learning
117
"""
118
```
119
120
### Organizational Mining - Social Network Analysis
121
122
Discover and analyze social networks and organizational patterns from event logs.
123
124
```python { .api }
125
def discover_handover_of_work_network(log, beta=0, resource_key='org:resource', timestamp_key='time:timestamp', case_id_key='case:concept:name'):
126
"""
127
Discover handover of work network showing resource collaboration patterns.
128
129
Parameters:
130
- log (Union[EventLog, pd.DataFrame]): Event log data
131
- beta (float): Beta parameter for network weight calculation
132
- resource_key (str): Resource attribute name
133
- timestamp_key (str): Timestamp attribute name
134
- case_id_key (str): Case ID attribute name
135
136
Returns:
137
SNA: Social Network Analysis object with handover relationships
138
"""
139
140
def discover_activity_based_resource_similarity(log, **kwargs):
141
"""
142
Discover resource similarity network based on shared activities.
143
144
Parameters:
145
- log (Union[EventLog, pd.DataFrame]): Event log data
146
- **kwargs: Similarity calculation parameters including:
147
- similarity_metric: Method for calculating similarity
148
- min_shared_activities: Minimum shared activities for connection
149
- normalization: Normalization method for similarities
150
151
Returns:
152
SNA: Social network with resource similarities
153
"""
154
155
def discover_subcontracting_network(log, **kwargs):
156
"""
157
Discover subcontracting relationships between resources.
158
159
Parameters:
160
- log (Union[EventLog, pd.DataFrame]): Event log data
161
- **kwargs: Subcontracting detection parameters including:
162
- time_window: Time window for detecting subcontracting
163
- activity_patterns: Patterns indicating subcontracting
164
- threshold: Threshold for relationship strength
165
166
Returns:
167
SNA: Social network with subcontracting relationships
168
"""
169
170
def discover_working_together_network(log, resource_key='org:resource', timestamp_key='time:timestamp', case_id_key='case:concept:name'):
171
"""
172
Discover working together network showing collaborative relationships.
173
174
Parameters:
175
- log (Union[EventLog, pd.DataFrame]): Event log data
176
- resource_key (str): Resource attribute name
177
- timestamp_key (str): Timestamp attribute name
178
- case_id_key (str): Case ID attribute name
179
180
Returns:
181
SNA: Social network with collaboration relationships
182
"""
183
```
184
185
### Organizational Mining - Role Discovery
186
187
Discover organizational roles and structures from resource behavior patterns.
188
189
```python { .api }
190
def discover_organizational_roles(log, **kwargs):
191
"""
192
Discover organizational roles based on resource activity patterns.
193
194
Parameters:
195
- log (Union[EventLog, pd.DataFrame]): Event log data
196
- **kwargs: Role discovery parameters including:
197
- clustering_method: Method for role clustering
198
- min_role_size: Minimum number of resources per role
199
- activity_similarity_threshold: Threshold for activity similarity
200
201
Returns:
202
List[Role]: List of discovered organizational roles
203
"""
204
205
def discover_network_analysis(log, **kwargs):
206
"""
207
Perform comprehensive network analysis on organizational data.
208
209
Parameters:
210
- log (Union[EventLog, pd.DataFrame]): Event log data
211
- **kwargs: Network analysis parameters including:
212
- network_types: Types of networks to analyze
213
- centrality_measures: Centrality measures to compute
214
- community_detection: Whether to detect communities
215
216
Returns:
217
Dict[str, Any]: Comprehensive network analysis results
218
"""
219
```
220
221
## Usage Examples
222
223
### Machine Learning Data Preparation
224
225
```python
226
import pm4py
227
from sklearn.ensemble import RandomForestClassifier
228
from sklearn.metrics import accuracy_score, classification_report
229
230
# Load event log
231
log = pm4py.read_xes('event_log.xes')
232
233
# Split into train/test sets
234
train_log, test_log = pm4py.split_train_test(log, train_percentage=0.8)
235
print(f"Training cases: {len(train_log)}")
236
print(f"Test cases: {len(test_log)}")
237
238
# Extract prefixes for predictive modeling
239
prefix_length = 5
240
train_prefixes = pm4py.get_prefixes_from_log(train_log, prefix_length)
241
test_prefixes = pm4py.get_prefixes_from_log(test_log, prefix_length)
242
243
print(f"Training prefixes: {len(train_prefixes)} events")
244
print(f"Test prefixes: {len(test_prefixes)} events")
245
```
246
247
### Feature Extraction for Traditional Logs
248
249
```python
250
import pm4py
251
import pandas as pd
252
253
# Extract comprehensive features
254
features_train = pm4py.extract_features_dataframe(
255
train_prefixes,
256
case_features=True,
257
event_features=True,
258
temporal_features=True,
259
categorical_encoding='onehot'
260
)
261
262
features_test = pm4py.extract_features_dataframe(
263
test_prefixes,
264
case_features=True,
265
event_features=True,
266
temporal_features=True,
267
categorical_encoding='onehot'
268
)
269
270
print("Extracted Features:")
271
print(f" Training features shape: {features_train.shape}")
272
print(f" Test features shape: {features_test.shape}")
273
print(f" Feature columns: {list(features_train.columns)}")
274
275
# Extract temporal features specifically
276
temporal_features = pm4py.extract_temporal_features_dataframe(
277
train_log,
278
time_windows=['1h', '1d', '1w'],
279
cyclical_features=True,
280
duration_features=True
281
)
282
283
print(f"Temporal features shape: {temporal_features.shape}")
284
```
285
286
### Next Activity Prediction
287
288
```python
289
import pm4py
290
from sklearn.ensemble import RandomForestClassifier
291
292
# Extract features and targets for next activity prediction
293
X_train = pm4py.extract_features_dataframe(train_prefixes)
294
y_train = pm4py.extract_target_vector(
295
train_prefixes,
296
target_type='next_activity',
297
encoding_method='label'
298
)
299
300
X_test = pm4py.extract_features_dataframe(test_prefixes)
301
y_test = pm4py.extract_target_vector(
302
test_prefixes,
303
target_type='next_activity',
304
encoding_method='label'
305
)
306
307
# Train model
308
model = RandomForestClassifier(n_estimators=100, random_state=42)
309
model.fit(X_train, y_train)
310
311
# Predict and evaluate
312
y_pred = model.predict(X_test)
313
accuracy = accuracy_score(y_test, y_pred)
314
315
print(f"Next Activity Prediction Accuracy: {accuracy:.3f}")
316
print("\nClassification Report:")
317
print(classification_report(y_test, y_pred))
318
319
# Feature importance
320
feature_importance = pd.DataFrame({
321
'feature': X_train.columns,
322
'importance': model.feature_importances_
323
}).sort_values('importance', ascending=False)
324
325
print("\nTop 10 Important Features:")
326
print(feature_importance.head(10))
327
```
328
329
### Remaining Time Prediction
330
331
```python
332
import pm4py
333
from sklearn.ensemble import GradientBoostingRegressor
334
from sklearn.metrics import mean_absolute_error, r2_score
335
336
# Extract features and targets for remaining time prediction
337
X_train = pm4py.extract_temporal_features_dataframe(train_prefixes)
338
y_train = pm4py.extract_target_vector(
339
train_prefixes,
340
target_type='remaining_time',
341
time_unit='hours'
342
)
343
344
X_test = pm4py.extract_temporal_features_dataframe(test_prefixes)
345
y_test = pm4py.extract_target_vector(
346
test_prefixes,
347
target_type='remaining_time',
348
time_unit='hours'
349
)
350
351
# Train regression model
352
model = GradientBoostingRegressor(n_estimators=100, random_state=42)
353
model.fit(X_train, y_train)
354
355
# Predict and evaluate
356
y_pred = model.predict(X_test)
357
mae = mean_absolute_error(y_test, y_pred)
358
r2 = r2_score(y_test, y_pred)
359
360
print(f"Remaining Time Prediction:")
361
print(f" Mean Absolute Error: {mae:.2f} hours")
362
print(f" R² Score: {r2:.3f}")
363
```
364
365
### Object-Centric Feature Extraction
366
367
```python
368
import pm4py
369
370
# Load OCEL and extract features
371
ocel = pm4py.read_ocel('ocel_data.csv')
372
373
# Extract OCEL-specific features
374
ocel_features = pm4py.extract_ocel_features(
375
ocel,
376
feature_types=['object_lifecycle', 'interaction_patterns', 'temporal_patterns'],
377
aggregation_methods=['count', 'mean', 'std'],
378
temporal_features=True
379
)
380
381
print("OCEL Features:")
382
print(f" Feature matrix shape: {ocel_features.shape}")
383
print(f" Object types covered: {ocel_features['object_type'].nunique()}")
384
385
# Group features by object type
386
for obj_type in ocel_features['object_type'].unique():
387
obj_features = ocel_features[ocel_features['object_type'] == obj_type]
388
print(f" {obj_type}: {len(obj_features)} objects, {obj_features.shape[1]-1} features")
389
```
390
391
### Social Network Analysis
392
393
```python
394
import pm4py
395
396
# Discover handover of work network
397
handover_network = pm4py.discover_handover_of_work_network(log, beta=0.5)
398
print("Handover Network Statistics:")
399
print(f" Nodes (resources): {len(handover_network.nodes)}")
400
print(f" Edges (handovers): {len(handover_network.edges)}")
401
402
# Visualize handover network
403
pm4py.view_sna(handover_network)
404
pm4py.save_vis_sna(handover_network, 'handover_network.png')
405
406
# Discover working together network
407
collaboration_network = pm4py.discover_working_together_network(log)
408
print("Collaboration Network Statistics:")
409
print(f" Nodes: {len(collaboration_network.nodes)}")
410
print(f" Edges: {len(collaboration_network.edges)}")
411
412
# Activity-based similarity network
413
similarity_network = pm4py.discover_activity_based_resource_similarity(
414
log,
415
similarity_metric='jaccard',
416
min_shared_activities=3
417
)
418
print("Resource Similarity Network:")
419
print(f" Nodes: {len(similarity_network.nodes)}")
420
print(f" Edges: {len(similarity_network.edges)}")
421
```
422
423
### Organizational Role Discovery
424
425
```python
426
import pm4py
427
428
# Discover organizational roles
429
roles = pm4py.discover_organizational_roles(
430
log,
431
clustering_method='kmeans',
432
min_role_size=3,
433
activity_similarity_threshold=0.7
434
)
435
436
print(f"Discovered {len(roles)} organizational roles:")
437
for i, role in enumerate(roles):
438
print(f"\nRole {i+1}:")
439
print(f" Resources: {len(role.resources)}")
440
print(f" Main activities: {role.main_activities}")
441
print(f" Activity coverage: {role.activity_coverage:.2f}")
442
print(f" Resources: {list(role.resources)[:5]}{'...' if len(role.resources) > 5 else ''}")
443
444
# Comprehensive network analysis
445
network_analysis = pm4py.discover_network_analysis(
446
log,
447
network_types=['handover', 'collaboration', 'similarity'],
448
centrality_measures=['betweenness', 'closeness', 'degree'],
449
community_detection=True
450
)
451
452
print("\nComprehensive Network Analysis:")
453
print(f" Network types analyzed: {len(network_analysis['networks'])}")
454
print(f" Communities detected: {network_analysis['communities']['count']}")
455
print(f" Key resources (high centrality): {network_analysis['key_resources']}")
456
```
457
458
### Outcome Prediction
459
460
```python
461
import pm4py
462
from sklearn.linear_model import LogisticRegression
463
464
# Extract outcome-enriched features
465
outcome_features = pm4py.extract_outcome_enriched_dataframe(
466
log,
467
outcome_definition='case_duration > average',
468
prediction_horizon='50%', # Predict at 50% of case completion
469
feature_encoding='numerical'
470
)
471
472
print("Outcome Prediction Dataset:")
473
print(f" Total instances: {len(outcome_features)}")
474
print(f" Positive outcomes: {outcome_features['outcome'].sum()}")
475
print(f" Features: {outcome_features.shape[1] - 1}")
476
477
# Split features and targets
478
X = outcome_features.drop(['case_id', 'outcome'], axis=1)
479
y = outcome_features['outcome']
480
481
# Train outcome prediction model
482
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
483
model = LogisticRegression(random_state=42)
484
model.fit(X_train, y_train)
485
486
# Evaluate
487
y_pred = model.predict(X_test)
488
accuracy = accuracy_score(y_test, y_pred)
489
490
print(f"\nOutcome Prediction Results:")
491
print(f" Accuracy: {accuracy:.3f}")
492
print(f" Classification Report:")
493
print(classification_report(y_test, y_pred))
494
```
495
496
### Predictive Process Monitoring Pipeline
497
498
```python
499
import pm4py
500
from sklearn.ensemble import RandomForestClassifier
501
from sklearn.metrics import accuracy_score
502
import pandas as pd
503
504
def predictive_monitoring_pipeline(log, prefix_lengths=[3, 5, 7, 10]):
505
"""Complete predictive process monitoring pipeline."""
506
507
results = {}
508
509
# Split data
510
train_log, test_log = pm4py.split_train_test(log, 0.8)
511
512
for prefix_length in prefix_lengths:
513
print(f"\nAnalyzing prefix length: {prefix_length}")
514
515
# Extract prefixes
516
train_prefixes = pm4py.get_prefixes_from_log(train_log, prefix_length)
517
test_prefixes = pm4py.get_prefixes_from_log(test_log, prefix_length)
518
519
if len(train_prefixes) == 0 or len(test_prefixes) == 0:
520
print(f" Insufficient data for prefix length {prefix_length}")
521
continue
522
523
# Extract features
524
X_train = pm4py.extract_features_dataframe(train_prefixes)
525
X_test = pm4py.extract_features_dataframe(test_prefixes)
526
527
# Next activity prediction
528
y_train_activity = pm4py.extract_target_vector(train_prefixes, target_type='next_activity')
529
y_test_activity = pm4py.extract_target_vector(test_prefixes, target_type='next_activity')
530
531
model_activity = RandomForestClassifier(n_estimators=50, random_state=42)
532
model_activity.fit(X_train, y_train_activity)
533
534
activity_accuracy = accuracy_score(y_test_activity, model_activity.predict(X_test))
535
536
# Remaining time prediction (if applicable)
537
try:
538
y_train_time = pm4py.extract_target_vector(train_prefixes, target_type='remaining_time')
539
y_test_time = pm4py.extract_target_vector(test_prefixes, target_type='remaining_time')
540
541
from sklearn.ensemble import GradientBoostingRegressor
542
model_time = GradientBoostingRegressor(n_estimators=50, random_state=42)
543
model_time.fit(X_train, y_train_time)
544
545
time_mae = mean_absolute_error(y_test_time, model_time.predict(X_test))
546
except:
547
time_mae = None
548
549
results[prefix_length] = {
550
'activity_accuracy': activity_accuracy,
551
'time_mae': time_mae,
552
'train_samples': len(train_prefixes),
553
'test_samples': len(test_prefixes),
554
'features': X_train.shape[1]
555
}
556
557
print(f" Next activity accuracy: {activity_accuracy:.3f}")
558
if time_mae:
559
print(f" Remaining time MAE: {time_mae:.2f}")
560
561
return results
562
563
# Run predictive monitoring analysis
564
prediction_results = predictive_monitoring_pipeline(log)
565
566
# Summarize results
567
print("\n" + "="*50)
568
print("PREDICTIVE MONITORING SUMMARY")
569
print("="*50)
570
for prefix_len, metrics in prediction_results.items():
571
print(f"Prefix Length {prefix_len}:")
572
print(f" Activity Prediction Accuracy: {metrics['activity_accuracy']:.3f}")
573
if metrics['time_mae']:
574
print(f" Time Prediction MAE: {metrics['time_mae']:.2f}")
575
print(f" Training Samples: {metrics['train_samples']}")
576
print(f" Features: {metrics['features']}")
577
```
578
579
### Resource Performance Analysis
580
581
```python
582
import pm4py
583
584
def analyze_resource_performance(log):
585
"""Analyze individual resource performance and collaboration patterns."""
586
587
# Get basic resource statistics
588
resources = log['org:resource'].unique() if 'org:resource' in log.columns else []
589
590
print(f"Resource Performance Analysis ({len(resources)} resources)")
591
print("-" * 50)
592
593
# Discover networks
594
handover_net = pm4py.discover_handover_of_work_network(log)
595
collab_net = pm4py.discover_working_together_network(log)
596
597
# Calculate resource metrics
598
resource_metrics = []
599
600
for resource in resources:
601
# Filter log for this resource
602
resource_log = log[log['org:resource'] == resource]
603
604
# Basic metrics
605
cases_handled = resource_log['case:concept:name'].nunique()
606
events_performed = len(resource_log)
607
activities = resource_log['concept:name'].nunique()
608
609
# Network metrics
610
handover_connections = len([e for e in handover_net.edges if resource in e])
611
collab_connections = len([e for e in collab_net.edges if resource in e])
612
613
resource_metrics.append({
614
'resource': resource,
615
'cases_handled': cases_handled,
616
'events_performed': events_performed,
617
'activities': activities,
618
'handover_connections': handover_connections,
619
'collaboration_connections': collab_connections
620
})
621
622
# Convert to DataFrame for analysis
623
metrics_df = pd.DataFrame(resource_metrics)
624
625
print("Top 10 Resources by Cases Handled:")
626
top_resources = metrics_df.nlargest(10, 'cases_handled')
627
for _, row in top_resources.iterrows():
628
print(f" {row['resource']}: {row['cases_handled']} cases, "
629
f"{row['activities']} activities, {row['handover_connections']} handovers")
630
631
return metrics_df
632
633
# Run resource performance analysis
634
resource_analysis = analyze_resource_performance(log)
635
```