or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

conformance-checking.mdfiltering.mdindex.mdml-organizational.mdobject-centric.mdprocess-discovery.mdreading-writing.mdstatistics-analysis.mdutilities-conversion.mdvisualization.md

ml-organizational.mddocs/

0

# Machine Learning and Organizational Mining

1

2

Machine learning features for predictive process analytics and organizational mining for resource analysis and social network discovery. PM4PY provides tools for feature extraction, predictive modeling, and organizational pattern analysis.

3

4

## Capabilities

5

6

### Machine Learning - Data Preparation

7

8

Prepare event log data for machine learning applications including train/test splits and prefix extraction.

9

10

```python { .api }

11

def split_train_test(log, train_percentage=0.8, case_id_key='case:concept:name'):

12

"""

13

Split event log into training and test sets for machine learning.

14

15

Parameters:

16

- log (Union[EventLog, pd.DataFrame]): Event log data

17

- train_percentage (float): Percentage of data for training (0.0-1.0)

18

- case_id_key (str): Case ID attribute name

19

20

Returns:

21

Union[Tuple[EventLog, EventLog], Tuple[pd.DataFrame, pd.DataFrame]]: (train_log, test_log)

22

"""

23

24

def get_prefixes_from_log(log, length, case_id_key='case:concept:name'):

25

"""

26

Extract trace prefixes of specified length for predictive modeling.

27

28

Parameters:

29

- log (Union[EventLog, pd.DataFrame]): Event log data

30

- length (int): Length of prefixes to extract

31

- case_id_key (str): Case ID attribute name

32

33

Returns:

34

Union[EventLog, pd.DataFrame]: Event log with prefixes

35

"""

36

```

37

38

### Machine Learning - Feature Extraction

39

40

Extract features from event logs and OCEL for machine learning applications.

41

42

```python { .api }

43

def extract_ocel_features(ocel, **kwargs):

44

"""

45

Extract machine learning features from Object-Centric Event Log.

46

47

Parameters:

48

- ocel (OCEL): Object-centric event log

49

- **kwargs: Feature extraction parameters including:

50

- feature_types: List of feature types to extract

51

- aggregation_methods: Methods for aggregating object features

52

- temporal_features: Whether to include temporal features

53

54

Returns:

55

pd.DataFrame: Feature matrix with one row per object or event

56

"""

57

58

def extract_features_dataframe(log, **kwargs):

59

"""

60

Extract comprehensive features from traditional event log.

61

62

Parameters:

63

- log (Union[EventLog, pd.DataFrame]): Event log data

64

- **kwargs: Feature extraction parameters including:

65

- case_features: Whether to extract case-level features

66

- event_features: Whether to extract event-level features

67

- temporal_features: Include temporal patterns

68

- categorical_encoding: Method for encoding categorical variables

69

70

Returns:

71

pd.DataFrame: Feature matrix ready for machine learning

72

"""

73

74

def extract_temporal_features_dataframe(log, **kwargs):

75

"""

76

Extract temporal features from event log for time-aware modeling.

77

78

Parameters:

79

- log (Union[EventLog, pd.DataFrame]): Event log data

80

- **kwargs: Temporal feature parameters including:

81

- time_windows: Time windows for feature aggregation

82

- cyclical_features: Include cyclical time features (hour, day, month)

83

- duration_features: Include activity and case duration features

84

85

Returns:

86

pd.DataFrame: Temporal feature matrix

87

"""

88

89

def extract_outcome_enriched_dataframe(log, **kwargs):

90

"""

91

Extract features enriched with outcome information for supervised learning.

92

93

Parameters:

94

- log (Union[EventLog, pd.DataFrame]): Event log data

95

- **kwargs: Outcome enrichment parameters including:

96

- outcome_definition: How to define positive/negative outcomes

97

- prediction_horizon: Time horizon for outcome prediction

98

- feature_encoding: Method for encoding features

99

100

Returns:

101

pd.DataFrame: Feature matrix with outcome labels

102

"""

103

104

def extract_target_vector(log, **kwargs):

105

"""

106

Extract target vector for supervised machine learning.

107

108

Parameters:

109

- log (Union[EventLog, pd.DataFrame]): Event log data

110

- **kwargs: Target extraction parameters including:

111

- target_type: Type of target ('next_activity', 'remaining_time', 'outcome')

112

- prediction_point: Point in trace for prediction

113

- encoding_method: Method for encoding targets

114

115

Returns:

116

List[Any]: Target vector for machine learning

117

"""

118

```

119

120

### Organizational Mining - Social Network Analysis

121

122

Discover and analyze social networks and organizational patterns from event logs.

123

124

```python { .api }

125

def discover_handover_of_work_network(log, beta=0, resource_key='org:resource', timestamp_key='time:timestamp', case_id_key='case:concept:name'):

126

"""

127

Discover handover of work network showing resource collaboration patterns.

128

129

Parameters:

130

- log (Union[EventLog, pd.DataFrame]): Event log data

131

- beta (float): Beta parameter for network weight calculation

132

- resource_key (str): Resource attribute name

133

- timestamp_key (str): Timestamp attribute name

134

- case_id_key (str): Case ID attribute name

135

136

Returns:

137

SNA: Social Network Analysis object with handover relationships

138

"""

139

140

def discover_activity_based_resource_similarity(log, **kwargs):

141

"""

142

Discover resource similarity network based on shared activities.

143

144

Parameters:

145

- log (Union[EventLog, pd.DataFrame]): Event log data

146

- **kwargs: Similarity calculation parameters including:

147

- similarity_metric: Method for calculating similarity

148

- min_shared_activities: Minimum shared activities for connection

149

- normalization: Normalization method for similarities

150

151

Returns:

152

SNA: Social network with resource similarities

153

"""

154

155

def discover_subcontracting_network(log, **kwargs):

156

"""

157

Discover subcontracting relationships between resources.

158

159

Parameters:

160

- log (Union[EventLog, pd.DataFrame]): Event log data

161

- **kwargs: Subcontracting detection parameters including:

162

- time_window: Time window for detecting subcontracting

163

- activity_patterns: Patterns indicating subcontracting

164

- threshold: Threshold for relationship strength

165

166

Returns:

167

SNA: Social network with subcontracting relationships

168

"""

169

170

def discover_working_together_network(log, resource_key='org:resource', timestamp_key='time:timestamp', case_id_key='case:concept:name'):

171

"""

172

Discover working together network showing collaborative relationships.

173

174

Parameters:

175

- log (Union[EventLog, pd.DataFrame]): Event log data

176

- resource_key (str): Resource attribute name

177

- timestamp_key (str): Timestamp attribute name

178

- case_id_key (str): Case ID attribute name

179

180

Returns:

181

SNA: Social network with collaboration relationships

182

"""

183

```

184

185

### Organizational Mining - Role Discovery

186

187

Discover organizational roles and structures from resource behavior patterns.

188

189

```python { .api }

190

def discover_organizational_roles(log, **kwargs):

191

"""

192

Discover organizational roles based on resource activity patterns.

193

194

Parameters:

195

- log (Union[EventLog, pd.DataFrame]): Event log data

196

- **kwargs: Role discovery parameters including:

197

- clustering_method: Method for role clustering

198

- min_role_size: Minimum number of resources per role

199

- activity_similarity_threshold: Threshold for activity similarity

200

201

Returns:

202

List[Role]: List of discovered organizational roles

203

"""

204

205

def discover_network_analysis(log, **kwargs):

206

"""

207

Perform comprehensive network analysis on organizational data.

208

209

Parameters:

210

- log (Union[EventLog, pd.DataFrame]): Event log data

211

- **kwargs: Network analysis parameters including:

212

- network_types: Types of networks to analyze

213

- centrality_measures: Centrality measures to compute

214

- community_detection: Whether to detect communities

215

216

Returns:

217

Dict[str, Any]: Comprehensive network analysis results

218

"""

219

```

220

221

## Usage Examples

222

223

### Machine Learning Data Preparation

224

225

```python

226

import pm4py

227

from sklearn.ensemble import RandomForestClassifier

228

from sklearn.metrics import accuracy_score, classification_report

229

230

# Load event log

231

log = pm4py.read_xes('event_log.xes')

232

233

# Split into train/test sets

234

train_log, test_log = pm4py.split_train_test(log, train_percentage=0.8)

235

print(f"Training cases: {len(train_log)}")

236

print(f"Test cases: {len(test_log)}")

237

238

# Extract prefixes for predictive modeling

239

prefix_length = 5

240

train_prefixes = pm4py.get_prefixes_from_log(train_log, prefix_length)

241

test_prefixes = pm4py.get_prefixes_from_log(test_log, prefix_length)

242

243

print(f"Training prefixes: {len(train_prefixes)} events")

244

print(f"Test prefixes: {len(test_prefixes)} events")

245

```

246

247

### Feature Extraction for Traditional Logs

248

249

```python

250

import pm4py

251

import pandas as pd

252

253

# Extract comprehensive features

254

features_train = pm4py.extract_features_dataframe(

255

train_prefixes,

256

case_features=True,

257

event_features=True,

258

temporal_features=True,

259

categorical_encoding='onehot'

260

)

261

262

features_test = pm4py.extract_features_dataframe(

263

test_prefixes,

264

case_features=True,

265

event_features=True,

266

temporal_features=True,

267

categorical_encoding='onehot'

268

)

269

270

print("Extracted Features:")

271

print(f" Training features shape: {features_train.shape}")

272

print(f" Test features shape: {features_test.shape}")

273

print(f" Feature columns: {list(features_train.columns)}")

274

275

# Extract temporal features specifically

276

temporal_features = pm4py.extract_temporal_features_dataframe(

277

train_log,

278

time_windows=['1h', '1d', '1w'],

279

cyclical_features=True,

280

duration_features=True

281

)

282

283

print(f"Temporal features shape: {temporal_features.shape}")

284

```

285

286

### Next Activity Prediction

287

288

```python

289

import pm4py

290

from sklearn.ensemble import RandomForestClassifier

291

292

# Extract features and targets for next activity prediction

293

X_train = pm4py.extract_features_dataframe(train_prefixes)

294

y_train = pm4py.extract_target_vector(

295

train_prefixes,

296

target_type='next_activity',

297

encoding_method='label'

298

)

299

300

X_test = pm4py.extract_features_dataframe(test_prefixes)

301

y_test = pm4py.extract_target_vector(

302

test_prefixes,

303

target_type='next_activity',

304

encoding_method='label'

305

)

306

307

# Train model

308

model = RandomForestClassifier(n_estimators=100, random_state=42)

309

model.fit(X_train, y_train)

310

311

# Predict and evaluate

312

y_pred = model.predict(X_test)

313

accuracy = accuracy_score(y_test, y_pred)

314

315

print(f"Next Activity Prediction Accuracy: {accuracy:.3f}")

316

print("\nClassification Report:")

317

print(classification_report(y_test, y_pred))

318

319

# Feature importance

320

feature_importance = pd.DataFrame({

321

'feature': X_train.columns,

322

'importance': model.feature_importances_

323

}).sort_values('importance', ascending=False)

324

325

print("\nTop 10 Important Features:")

326

print(feature_importance.head(10))

327

```

328

329

### Remaining Time Prediction

330

331

```python

332

import pm4py

333

from sklearn.ensemble import GradientBoostingRegressor

334

from sklearn.metrics import mean_absolute_error, r2_score

335

336

# Extract features and targets for remaining time prediction

337

X_train = pm4py.extract_temporal_features_dataframe(train_prefixes)

338

y_train = pm4py.extract_target_vector(

339

train_prefixes,

340

target_type='remaining_time',

341

time_unit='hours'

342

)

343

344

X_test = pm4py.extract_temporal_features_dataframe(test_prefixes)

345

y_test = pm4py.extract_target_vector(

346

test_prefixes,

347

target_type='remaining_time',

348

time_unit='hours'

349

)

350

351

# Train regression model

352

model = GradientBoostingRegressor(n_estimators=100, random_state=42)

353

model.fit(X_train, y_train)

354

355

# Predict and evaluate

356

y_pred = model.predict(X_test)

357

mae = mean_absolute_error(y_test, y_pred)

358

r2 = r2_score(y_test, y_pred)

359

360

print(f"Remaining Time Prediction:")

361

print(f" Mean Absolute Error: {mae:.2f} hours")

362

print(f" R² Score: {r2:.3f}")

363

```

364

365

### Object-Centric Feature Extraction

366

367

```python

368

import pm4py

369

370

# Load OCEL and extract features

371

ocel = pm4py.read_ocel('ocel_data.csv')

372

373

# Extract OCEL-specific features

374

ocel_features = pm4py.extract_ocel_features(

375

ocel,

376

feature_types=['object_lifecycle', 'interaction_patterns', 'temporal_patterns'],

377

aggregation_methods=['count', 'mean', 'std'],

378

temporal_features=True

379

)

380

381

print("OCEL Features:")

382

print(f" Feature matrix shape: {ocel_features.shape}")

383

print(f" Object types covered: {ocel_features['object_type'].nunique()}")

384

385

# Group features by object type

386

for obj_type in ocel_features['object_type'].unique():

387

obj_features = ocel_features[ocel_features['object_type'] == obj_type]

388

print(f" {obj_type}: {len(obj_features)} objects, {obj_features.shape[1]-1} features")

389

```

390

391

### Social Network Analysis

392

393

```python

394

import pm4py

395

396

# Discover handover of work network

397

handover_network = pm4py.discover_handover_of_work_network(log, beta=0.5)

398

print("Handover Network Statistics:")

399

print(f" Nodes (resources): {len(handover_network.nodes)}")

400

print(f" Edges (handovers): {len(handover_network.edges)}")

401

402

# Visualize handover network

403

pm4py.view_sna(handover_network)

404

pm4py.save_vis_sna(handover_network, 'handover_network.png')

405

406

# Discover working together network

407

collaboration_network = pm4py.discover_working_together_network(log)

408

print("Collaboration Network Statistics:")

409

print(f" Nodes: {len(collaboration_network.nodes)}")

410

print(f" Edges: {len(collaboration_network.edges)}")

411

412

# Activity-based similarity network

413

similarity_network = pm4py.discover_activity_based_resource_similarity(

414

log,

415

similarity_metric='jaccard',

416

min_shared_activities=3

417

)

418

print("Resource Similarity Network:")

419

print(f" Nodes: {len(similarity_network.nodes)}")

420

print(f" Edges: {len(similarity_network.edges)}")

421

```

422

423

### Organizational Role Discovery

424

425

```python

426

import pm4py

427

428

# Discover organizational roles

429

roles = pm4py.discover_organizational_roles(

430

log,

431

clustering_method='kmeans',

432

min_role_size=3,

433

activity_similarity_threshold=0.7

434

)

435

436

print(f"Discovered {len(roles)} organizational roles:")

437

for i, role in enumerate(roles):

438

print(f"\nRole {i+1}:")

439

print(f" Resources: {len(role.resources)}")

440

print(f" Main activities: {role.main_activities}")

441

print(f" Activity coverage: {role.activity_coverage:.2f}")

442

print(f" Resources: {list(role.resources)[:5]}{'...' if len(role.resources) > 5 else ''}")

443

444

# Comprehensive network analysis

445

network_analysis = pm4py.discover_network_analysis(

446

log,

447

network_types=['handover', 'collaboration', 'similarity'],

448

centrality_measures=['betweenness', 'closeness', 'degree'],

449

community_detection=True

450

)

451

452

print("\nComprehensive Network Analysis:")

453

print(f" Network types analyzed: {len(network_analysis['networks'])}")

454

print(f" Communities detected: {network_analysis['communities']['count']}")

455

print(f" Key resources (high centrality): {network_analysis['key_resources']}")

456

```

457

458

### Outcome Prediction

459

460

```python

461

import pm4py

462

from sklearn.linear_model import LogisticRegression

463

464

# Extract outcome-enriched features

465

outcome_features = pm4py.extract_outcome_enriched_dataframe(

466

log,

467

outcome_definition='case_duration > average',

468

prediction_horizon='50%', # Predict at 50% of case completion

469

feature_encoding='numerical'

470

)

471

472

print("Outcome Prediction Dataset:")

473

print(f" Total instances: {len(outcome_features)}")

474

print(f" Positive outcomes: {outcome_features['outcome'].sum()}")

475

print(f" Features: {outcome_features.shape[1] - 1}")

476

477

# Split features and targets

478

X = outcome_features.drop(['case_id', 'outcome'], axis=1)

479

y = outcome_features['outcome']

480

481

# Train outcome prediction model

482

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

483

model = LogisticRegression(random_state=42)

484

model.fit(X_train, y_train)

485

486

# Evaluate

487

y_pred = model.predict(X_test)

488

accuracy = accuracy_score(y_test, y_pred)

489

490

print(f"\nOutcome Prediction Results:")

491

print(f" Accuracy: {accuracy:.3f}")

492

print(f" Classification Report:")

493

print(classification_report(y_test, y_pred))

494

```

495

496

### Predictive Process Monitoring Pipeline

497

498

```python

499

import pm4py

500

from sklearn.ensemble import RandomForestClassifier

501

from sklearn.metrics import accuracy_score

502

import pandas as pd

503

504

def predictive_monitoring_pipeline(log, prefix_lengths=[3, 5, 7, 10]):

505

"""Complete predictive process monitoring pipeline."""

506

507

results = {}

508

509

# Split data

510

train_log, test_log = pm4py.split_train_test(log, 0.8)

511

512

for prefix_length in prefix_lengths:

513

print(f"\nAnalyzing prefix length: {prefix_length}")

514

515

# Extract prefixes

516

train_prefixes = pm4py.get_prefixes_from_log(train_log, prefix_length)

517

test_prefixes = pm4py.get_prefixes_from_log(test_log, prefix_length)

518

519

if len(train_prefixes) == 0 or len(test_prefixes) == 0:

520

print(f" Insufficient data for prefix length {prefix_length}")

521

continue

522

523

# Extract features

524

X_train = pm4py.extract_features_dataframe(train_prefixes)

525

X_test = pm4py.extract_features_dataframe(test_prefixes)

526

527

# Next activity prediction

528

y_train_activity = pm4py.extract_target_vector(train_prefixes, target_type='next_activity')

529

y_test_activity = pm4py.extract_target_vector(test_prefixes, target_type='next_activity')

530

531

model_activity = RandomForestClassifier(n_estimators=50, random_state=42)

532

model_activity.fit(X_train, y_train_activity)

533

534

activity_accuracy = accuracy_score(y_test_activity, model_activity.predict(X_test))

535

536

# Remaining time prediction (if applicable)

537

try:

538

y_train_time = pm4py.extract_target_vector(train_prefixes, target_type='remaining_time')

539

y_test_time = pm4py.extract_target_vector(test_prefixes, target_type='remaining_time')

540

541

from sklearn.ensemble import GradientBoostingRegressor

542

model_time = GradientBoostingRegressor(n_estimators=50, random_state=42)

543

model_time.fit(X_train, y_train_time)

544

545

time_mae = mean_absolute_error(y_test_time, model_time.predict(X_test))

546

except:

547

time_mae = None

548

549

results[prefix_length] = {

550

'activity_accuracy': activity_accuracy,

551

'time_mae': time_mae,

552

'train_samples': len(train_prefixes),

553

'test_samples': len(test_prefixes),

554

'features': X_train.shape[1]

555

}

556

557

print(f" Next activity accuracy: {activity_accuracy:.3f}")

558

if time_mae:

559

print(f" Remaining time MAE: {time_mae:.2f}")

560

561

return results

562

563

# Run predictive monitoring analysis

564

prediction_results = predictive_monitoring_pipeline(log)

565

566

# Summarize results

567

print("\n" + "="*50)

568

print("PREDICTIVE MONITORING SUMMARY")

569

print("="*50)

570

for prefix_len, metrics in prediction_results.items():

571

print(f"Prefix Length {prefix_len}:")

572

print(f" Activity Prediction Accuracy: {metrics['activity_accuracy']:.3f}")

573

if metrics['time_mae']:

574

print(f" Time Prediction MAE: {metrics['time_mae']:.2f}")

575

print(f" Training Samples: {metrics['train_samples']}")

576

print(f" Features: {metrics['features']}")

577

```

578

579

### Resource Performance Analysis

580

581

```python

582

import pm4py

583

584

def analyze_resource_performance(log):

585

"""Analyze individual resource performance and collaboration patterns."""

586

587

# Get basic resource statistics

588

resources = log['org:resource'].unique() if 'org:resource' in log.columns else []

589

590

print(f"Resource Performance Analysis ({len(resources)} resources)")

591

print("-" * 50)

592

593

# Discover networks

594

handover_net = pm4py.discover_handover_of_work_network(log)

595

collab_net = pm4py.discover_working_together_network(log)

596

597

# Calculate resource metrics

598

resource_metrics = []

599

600

for resource in resources:

601

# Filter log for this resource

602

resource_log = log[log['org:resource'] == resource]

603

604

# Basic metrics

605

cases_handled = resource_log['case:concept:name'].nunique()

606

events_performed = len(resource_log)

607

activities = resource_log['concept:name'].nunique()

608

609

# Network metrics

610

handover_connections = len([e for e in handover_net.edges if resource in e])

611

collab_connections = len([e for e in collab_net.edges if resource in e])

612

613

resource_metrics.append({

614

'resource': resource,

615

'cases_handled': cases_handled,

616

'events_performed': events_performed,

617

'activities': activities,

618

'handover_connections': handover_connections,

619

'collaboration_connections': collab_connections

620

})

621

622

# Convert to DataFrame for analysis

623

metrics_df = pd.DataFrame(resource_metrics)

624

625

print("Top 10 Resources by Cases Handled:")

626

top_resources = metrics_df.nlargest(10, 'cases_handled')

627

for _, row in top_resources.iterrows():

628

print(f" {row['resource']}: {row['cases_handled']} cases, "

629

f"{row['activities']} activities, {row['handover_connections']} handovers")

630

631

return metrics_df

632

633

# Run resource performance analysis

634

resource_analysis = analyze_resource_performance(log)

635

```