or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

client.mdconfiguration.mddata.mdframeworks.mdgenai.mdindex.mdmodels.mdprojects.mdtracing.mdtracking.md

genai.mddocs/

0

# GenAI and LLM Integration

1

2

MLflow's GenAI capabilities provide comprehensive support for large language models, prompt engineering, evaluation, and LLM application development. The system includes specialized tools for prompt management, LLM evaluation, automated scoring, and interactive labeling workflows for GenAI applications.

3

4

## Capabilities

5

6

### Model Evaluation and Testing

7

8

Comprehensive evaluation framework specifically designed for LLM and GenAI applications with built-in metrics and custom evaluators.

9

10

```python { .api }

11

def evaluate(model=None, data=None, model_type="text", evaluators=None, targets=None, evaluator_config=None, custom_metrics=None, extra_metrics=None, baseline_model=None, inference_params=None, model_config=None):

12

"""

13

Evaluate GenAI models with specialized LLM metrics.

14

15

Parameters:

16

- model: Model, callable, or URI - LLM model to evaluate

17

- data: DataFrame, Dataset, or URI - Evaluation dataset with inputs

18

- model_type: str - Type of model ("text", "chat", "question-answering")

19

- evaluators: list, optional - List of evaluator names or objects

20

- targets: str or array, optional - Ground truth targets for evaluation

21

- evaluator_config: dict, optional - Configuration for evaluators

22

- custom_metrics: list, optional - Custom metric functions

23

- extra_metrics: list, optional - Additional built-in metrics

24

- baseline_model: Model or URI, optional - Baseline model for comparison

25

- inference_params: dict, optional - Model inference parameters

26

- model_config: dict, optional - Model configuration parameters

27

28

Returns:

29

EvaluationResult object with LLM-specific metrics and artifacts

30

"""

31

32

def to_predict_fn(model_uri, inference_params=None):

33

"""

34

Convert MLflow model to prediction function for evaluation.

35

36

Parameters:

37

- model_uri: str - URI pointing to MLflow model

38

- inference_params: dict, optional - Parameters for model inference

39

40

Returns:

41

Callable prediction function compatible with evaluation

42

"""

43

```

44

45

### Prompt Management

46

47

Comprehensive prompt engineering and versioning system for managing prompts across LLM applications.

48

49

```python { .api }

50

def register_prompt(name, prompt, model_config=None, description=None, tags=None):

51

"""

52

Register a prompt template in MLflow.

53

54

Parameters:

55

- name: str - Unique prompt name (format: "name/version")

56

- prompt: str or PromptTemplate - Prompt content or template

57

- model_config: dict, optional - Associated model configuration

58

- description: str, optional - Prompt description

59

- tags: dict, optional - Prompt tags for organization

60

61

Returns:

62

Prompt object representing registered prompt

63

"""

64

65

def load_prompt(name):

66

"""

67

Load registered prompt by name.

68

69

Parameters:

70

- name: str - Prompt name with optional version ("name" or "name/version")

71

72

Returns:

73

Prompt object with template and configuration

74

"""

75

76

def search_prompts(name_like=None, tags=None, max_results=None):

77

"""

78

Search registered prompts by criteria.

79

80

Parameters:

81

- name_like: str, optional - Pattern to match prompt names

82

- tags: dict, optional - Tags to filter prompts

83

- max_results: int, optional - Maximum number of results

84

85

Returns:

86

List of Prompt objects matching criteria

87

"""

88

89

def set_prompt_alias(name, alias, version):

90

"""

91

Set alias for prompt version.

92

93

Parameters:

94

- name: str - Prompt name

95

- alias: str - Alias name (e.g., "champion", "latest")

96

- version: str or int - Prompt version number

97

"""

98

99

def delete_prompt_alias(name, alias):

100

"""

101

Delete prompt alias.

102

103

Parameters:

104

- name: str - Prompt name

105

- alias: str - Alias to delete

106

"""

107

```

108

109

### Prompt Optimization

110

111

Automated prompt optimization and improvement using various optimization strategies.

112

113

```python { .api }

114

def optimize_prompt(task, num_candidates=20, max_iterations=10, model=None, prompt_template=None, model_config=None, evaluator_config=None):

115

"""

116

Automatically optimize prompts for better performance.

117

118

Parameters:

119

- task: str - Description of the task for prompt optimization

120

- num_candidates: int - Number of prompt candidates to generate

121

- max_iterations: int - Maximum optimization iterations

122

- model: Model or URI, optional - Model for prompt testing

123

- prompt_template: str, optional - Base prompt template

124

- model_config: dict, optional - Model configuration

125

- evaluator_config: dict, optional - Evaluation configuration

126

127

Returns:

128

OptimizationResult with best prompt and performance metrics

129

"""

130

```

131

132

### Custom Scorers and Metrics

133

134

Framework for creating custom scoring functions and metrics for LLM evaluation.

135

136

```python { .api }

137

def scorer(name=None, version=None, greater_is_better=True, long_name=None, model_type=None):

138

"""

139

Decorator for creating custom LLM scorer functions.

140

141

Parameters:

142

- name: str, optional - Scorer name (inferred if not provided)

143

- version: str, optional - Scorer version

144

- greater_is_better: bool - Whether higher scores are better

145

- long_name: str, optional - Human-readable scorer name

146

- model_type: str, optional - Compatible model types

147

148

Returns:

149

Scorer object wrapping the function

150

"""

151

152

class Scorer:

153

def __init__(self, eval_fn, name=None, version=None, greater_is_better=True, long_name=None, model_type=None):

154

"""

155

Create custom LLM scorer.

156

157

Parameters:

158

- eval_fn: callable - Function that computes score

159

- name: str, optional - Scorer name

160

- version: str, optional - Scorer version

161

- greater_is_better: bool - Whether higher scores are better

162

- long_name: str, optional - Human-readable name

163

- model_type: str, optional - Compatible model types

164

"""

165

166

def score(self, predictions, targets=None, **kwargs):

167

"""

168

Compute scores for predictions.

169

170

Parameters:

171

- predictions: list - Model predictions to score

172

- targets: list, optional - Ground truth targets

173

- kwargs: Additional scoring arguments

174

175

Returns:

176

Scores or metrics dictionary

177

"""

178

```

179

180

### Scheduled Scoring

181

182

Configuration and management of automated scoring pipelines for continuous evaluation.

183

184

```python { .api }

185

class ScorerScheduleConfig:

186

def __init__(self, schedule_type, frequency, start_time=None, end_time=None, timezone=None):

187

"""

188

Configuration for scheduled scoring jobs.

189

190

Parameters:

191

- schedule_type: str - Type of schedule ("cron", "interval")

192

- frequency: str or int - Schedule frequency specification

193

- start_time: str, optional - Start time for scheduled jobs

194

- end_time: str, optional - End time for scheduled jobs

195

- timezone: str, optional - Timezone for schedule

196

"""

197

```

198

199

### Dataset Management

200

201

Specialized dataset operations for LLM training and evaluation datasets.

202

203

```python { .api }

204

def create_dataset(name, data_source=None, description=None, tags=None):

205

"""

206

Create GenAI dataset for LLM evaluation.

207

208

Parameters:

209

- name: str - Dataset name

210

- data_source: str or DataFrame, optional - Data source location or content

211

- description: str, optional - Dataset description

212

- tags: dict, optional - Dataset tags

213

214

Returns:

215

Dataset object for GenAI applications

216

"""

217

218

def get_dataset(name, version=None):

219

"""

220

Retrieve GenAI dataset by name.

221

222

Parameters:

223

- name: str - Dataset name

224

- version: str or int, optional - Dataset version

225

226

Returns:

227

Dataset object with LLM evaluation data

228

"""

229

230

def delete_dataset(name, version=None):

231

"""

232

Delete GenAI dataset.

233

234

Parameters:

235

- name: str - Dataset name to delete

236

- version: str or int, optional - Specific version to delete

237

"""

238

```

239

240

### Interactive Labeling and Review

241

242

Tools for human-in-the-loop evaluation and data labeling for LLM applications.

243

244

```python { .api }

245

def create_labeling_session(name, dataset=None, instructions=None, labelers=None, config=None):

246

"""

247

Create interactive labeling session for LLM data.

248

249

Parameters:

250

- name: str - Session name

251

- dataset: Dataset or str, optional - Dataset to label

252

- instructions: str, optional - Labeling instructions

253

- labelers: list, optional - List of labeler identifiers

254

- config: dict, optional - Labeling session configuration

255

256

Returns:

257

LabelingSession object

258

"""

259

260

def get_labeling_session(session_id):

261

"""

262

Retrieve labeling session by ID.

263

264

Parameters:

265

- session_id: str - Labeling session identifier

266

267

Returns:

268

LabelingSession object

269

"""

270

271

def get_labeling_sessions(experiment_id=None, status=None):

272

"""

273

List labeling sessions with optional filtering.

274

275

Parameters:

276

- experiment_id: str, optional - Filter by experiment

277

- status: str, optional - Filter by session status

278

279

Returns:

280

List of LabelingSession objects

281

"""

282

283

def delete_labeling_session(session_id):

284

"""

285

Delete labeling session.

286

287

Parameters:

288

- session_id: str - Session ID to delete

289

"""

290

291

class LabelingSession:

292

def __init__(self, name, dataset=None, instructions=None, config=None):

293

"""

294

Interactive labeling session for GenAI data.

295

296

Parameters:

297

- name: str - Session name

298

- dataset: Dataset, optional - Dataset to label

299

- instructions: str, optional - Labeling instructions

300

- config: dict, optional - Session configuration

301

"""

302

303

def add_labels(self, labels):

304

"""Add labels to session."""

305

306

def get_labels(self):

307

"""Get current session labels."""

308

309

def export_labels(self, format="json"):

310

"""Export labels in specified format."""

311

312

class Agent:

313

def __init__(self, name, model=None, tools=None, instructions=None):

314

"""

315

GenAI agent for automated evaluation and labeling.

316

317

Parameters:

318

- name: str - Agent name

319

- model: Model or str, optional - LLM model for agent

320

- tools: list, optional - Available tools for agent

321

- instructions: str, optional - Agent instructions

322

"""

323

324

def get_review_app(session_id):

325

"""

326

Get review application for labeling session.

327

328

Parameters:

329

- session_id: str - Labeling session ID

330

331

Returns:

332

ReviewApp object for interactive review

333

"""

334

335

class ReviewApp:

336

def __init__(self, session):

337

"""

338

Web application for reviewing and labeling LLM outputs.

339

340

Parameters:

341

- session: LabelingSession - Associated labeling session

342

"""

343

344

def launch(self, port=8080, host="localhost"):

345

"""Launch review application."""

346

347

def stop(self):

348

"""Stop review application."""

349

```

350

351

### Built-in Evaluators and Judges

352

353

Pre-built evaluators and judge models for common LLM evaluation tasks.

354

355

```python { .api }

356

# Built-in judge models for evaluation

357

judges = {

358

"gpt4_as_judge": "GPT-4 based evaluation judge",

359

"claude_as_judge": "Claude based evaluation judge",

360

"llama_as_judge": "Llama based evaluation judge"

361

}

362

363

# Built-in scorer functions

364

scorers = {

365

"answer_relevance": "Evaluate answer relevance to question",

366

"answer_correctness": "Evaluate factual correctness of answers",

367

"answer_similarity": "Semantic similarity between answers",

368

"faithfulness": "Evaluate faithfulness to source context",

369

"context_precision": "Precision of retrieved context",

370

"context_recall": "Recall of retrieved context",

371

"toxicity": "Detect toxic or harmful content",

372

"readability": "Evaluate text readability and clarity"

373

}

374

375

# Dataset utilities

376

datasets = {

377

"common_datasets": "Access to common LLM evaluation datasets",

378

"benchmarks": "Standard LLM benchmarks and test sets"

379

}

380

```

381

382

## Usage Examples

383

384

### Basic LLM Evaluation

385

386

```python

387

import mlflow

388

import mlflow.genai

389

import pandas as pd

390

391

# Prepare evaluation dataset

392

eval_data = pd.DataFrame({

393

"inputs": [

394

"What is machine learning?",

395

"Explain deep learning",

396

"How does AI work?"

397

],

398

"targets": [

399

"Machine learning is a subset of AI that learns from data",

400

"Deep learning uses neural networks with multiple layers",

401

"AI works by processing data to make predictions or decisions"

402

]

403

})

404

405

# Evaluate LLM model

406

with mlflow.start_run():

407

results = mlflow.genai.evaluate(

408

model="openai:/gpt-4", # Model URI

409

data=eval_data,

410

model_type="text",

411

evaluators=["default", "answer_relevance", "toxicity"],

412

targets="targets"

413

)

414

415

# Log evaluation results

416

mlflow.log_metrics(results.metrics)

417

418

print("Evaluation Results:")

419

for metric_name, score in results.metrics.items():

420

print(f"{metric_name}: {score:.3f}")

421

```

422

423

### Custom Scorer Creation

424

425

```python

426

import mlflow.genai

427

from mlflow.genai import scorer

428

import re

429

430

# Create custom scorer using decorator

431

@scorer(name="question_detection", greater_is_better=True)

432

def detect_questions(predictions, targets=None, **kwargs):

433

"""Custom scorer to detect if text contains questions."""

434

scores = []

435

for pred in predictions:

436

# Count question marks and question words

437

question_marks = pred.count('?')

438

question_words = len(re.findall(r'\b(what|how|why|when|where|who)\b', pred.lower()))

439

score = min(1.0, (question_marks + question_words * 0.5) / 2)

440

scores.append(score)

441

return scores

442

443

# Create scorer using class

444

class SentimentScorer(mlflow.genai.Scorer):

445

def __init__(self):

446

super().__init__(

447

eval_fn=self._score_sentiment,

448

name="sentiment_positivity",

449

greater_is_better=True

450

)

451

452

def _score_sentiment(self, predictions, **kwargs):

453

"""Score text sentiment positivity."""

454

# Simplified sentiment scoring

455

positive_words = ["good", "great", "excellent", "amazing", "wonderful"]

456

negative_words = ["bad", "terrible", "awful", "horrible", "worst"]

457

458

scores = []

459

for pred in predictions:

460

pred_lower = pred.lower()

461

pos_count = sum(word in pred_lower for word in positive_words)

462

neg_count = sum(word in pred_lower for word in negative_words)

463

464

if pos_count + neg_count == 0:

465

score = 0.5 # Neutral

466

else:

467

score = pos_count / (pos_count + neg_count)

468

469

scores.append(score)

470

return scores

471

472

# Use custom scorers in evaluation

473

sentiment_scorer = SentimentScorer()

474

475

results = mlflow.genai.evaluate(

476

model="openai:/gpt-3.5-turbo",

477

data=eval_data,

478

custom_metrics=[detect_questions, sentiment_scorer],

479

model_type="text"

480

)

481

482

print("Custom metric results:")

483

print(f"Question detection: {results.metrics['question_detection']:.3f}")

484

print(f"Sentiment positivity: {results.metrics['sentiment_positivity']:.3f}")

485

```

486

487

### Prompt Management Workflow

488

489

```python

490

import mlflow.genai

491

492

# Register prompt templates

493

classification_prompt = """

494

You are an expert classifier. Given the following text, classify it into one of these categories: {categories}

495

496

Text: {text}

497

498

Classification:

499

"""

500

501

mlflow.genai.register_prompt(

502

name="text_classification/v1",

503

prompt=classification_prompt,

504

description="Multi-class text classification prompt",

505

tags={"task": "classification", "version": "1.0"}

506

)

507

508

# Register improved version

509

improved_prompt = """

510

You are an expert text classifier with high accuracy. Analyze the following text carefully and classify it into exactly one of these categories: {categories}

511

512

Text to classify: "{text}"

513

514

Think step by step:

515

1. What are the key themes in this text?

516

2. Which category best matches these themes?

517

3. Why is this the best classification?

518

519

Final classification:

520

"""

521

522

mlflow.genai.register_prompt(

523

name="text_classification/v2",

524

prompt=improved_prompt,

525

description="Improved classification prompt with reasoning",

526

tags={"task": "classification", "version": "2.0", "reasoning": "true"}

527

)

528

529

# Set alias for best performing version

530

mlflow.genai.set_prompt_alias(

531

name="text_classification",

532

alias="champion",

533

version="2"

534

)

535

536

# Load and use prompt

537

prompt = mlflow.genai.load_prompt("text_classification@champion")

538

formatted_prompt = prompt.format(

539

categories=["positive", "negative", "neutral"],

540

text="I love this product!"

541

)

542

543

print("Formatted prompt:")

544

print(formatted_prompt)

545

546

# Search for prompts

547

classification_prompts = mlflow.genai.search_prompts(

548

name_like="classification*",

549

tags={"task": "classification"}

550

)

551

552

print(f"\nFound {len(classification_prompts)} classification prompts")

553

for p in classification_prompts:

554

print(f"- {p.name}: {p.description}")

555

```

556

557

### Prompt Optimization

558

559

```python

560

import mlflow.genai

561

562

# Define optimization task

563

task_description = """

564

Create a prompt that helps an AI assistant generate engaging

565

product descriptions for e-commerce items. The descriptions

566

should be persuasive, informative, and highlight key features.

567

"""

568

569

# Base prompt template

570

base_prompt = """

571

Write a product description for: {product_name}

572

573

Features: {features}

574

Price: {price}

575

576

Description:

577

"""

578

579

# Optimize prompt automatically

580

with mlflow.start_run():

581

optimization_result = mlflow.genai.optimize_prompt(

582

task=task_description,

583

prompt_template=base_prompt,

584

num_candidates=10,

585

max_iterations=5,

586

model="openai:/gpt-4",

587

evaluator_config={

588

"metrics": ["engagement", "clarity", "persuasiveness"]

589

}

590

)

591

592

# Log optimization results

593

mlflow.log_metric("optimization_score", optimization_result.best_score)

594

mlflow.log_param("iterations_completed", optimization_result.iterations)

595

596

# Register optimized prompt

597

mlflow.genai.register_prompt(

598

name="product_description/optimized",

599

prompt=optimization_result.best_prompt,

600

description="Auto-optimized product description prompt",

601

tags={"optimized": "true", "score": str(optimization_result.best_score)}

602

)

603

604

print(f"Optimization completed with score: {optimization_result.best_score:.3f}")

605

print(f"Best prompt:\n{optimization_result.best_prompt}")

606

```

607

608

### Interactive Labeling Session

609

610

```python

611

import mlflow.genai

612

import pandas as pd

613

614

# Create dataset for labeling

615

unlabeled_data = pd.DataFrame({

616

"text": [

617

"The movie was absolutely fantastic!",

618

"I didn't like the service at all.",

619

"The product works as expected.",

620

"This is the worst experience ever.",

621

"Pretty good, would recommend."

622

]

623

})

624

625

# Create labeling session

626

session = mlflow.genai.create_labeling_session(

627

name="sentiment_labeling_v1",

628

dataset=unlabeled_data,

629

instructions="""

630

Label each text with sentiment:

631

- positive: Text expresses positive sentiment

632

- negative: Text expresses negative sentiment

633

- neutral: Text expresses neutral sentiment

634

635

Consider the overall emotional tone and opinion expressed.

636

""",

637

config={

638

"labels": ["positive", "negative", "neutral"],

639

"allow_multiple": False,

640

"require_confidence": True

641

}

642

)

643

644

print(f"Created labeling session: {session.session_id}")

645

646

# Simulate adding labels (normally done through UI)

647

labels = [

648

{"text_id": 0, "label": "positive", "confidence": 0.95},

649

{"text_id": 1, "label": "negative", "confidence": 0.90},

650

{"text_id": 2, "label": "neutral", "confidence": 0.80},

651

{"text_id": 3, "label": "negative", "confidence": 0.98},

652

{"text_id": 4, "label": "positive", "confidence": 0.85}

653

]

654

655

session.add_labels(labels)

656

657

# Export labeled data

658

labeled_dataset = session.export_labels(format="json")

659

print(f"Exported {len(labeled_dataset)} labeled examples")

660

661

# Create review app for quality control

662

review_app = mlflow.genai.get_review_app(session.session_id)

663

# review_app.launch(port=8080) # Launches web interface

664

```

665

666

### GenAI Agent Implementation

667

668

```python

669

import mlflow.genai

670

671

# Create GenAI agent for automated evaluation

672

evaluation_agent = mlflow.genai.Agent(

673

name="evaluation_agent",

674

model="openai:/gpt-4",

675

tools=["web_search", "calculator", "code_execution"],

676

instructions="""

677

You are an expert evaluator for AI-generated content.

678

Analyze responses for accuracy, relevance, and quality.

679

Use available tools to fact-check when needed.

680

Provide detailed feedback and numerical scores.

681

"""

682

)

683

684

# Agent evaluates model outputs

685

test_outputs = [

686

"Paris is the capital of France and has a population of about 2.1 million.",

687

"The square root of 144 is 12.",

688

"Python is a programming language created in 1991 by Guido van Rossum."

689

]

690

691

evaluation_results = []

692

for output in test_outputs:

693

# Agent evaluates each output

694

result = evaluation_agent.evaluate(

695

text=output,

696

criteria=["factual_accuracy", "completeness", "clarity"]

697

)

698

evaluation_results.append(result)

699

700

# Create automated labeling agent

701

labeling_agent = mlflow.genai.Agent(

702

name="auto_labeler",

703

model="anthropic:/claude-3",

704

instructions="""

705

You are an expert data labeler. Label text data according to

706

the provided schema and guidelines. Be consistent and accurate.

707

"""

708

)

709

710

# Use agent for automated labeling

711

auto_labels = labeling_agent.label_batch(

712

texts=unlabeled_data["text"].tolist(),

713

schema={"sentiment": ["positive", "negative", "neutral"]},

714

guidelines="Focus on overall emotional tone and opinion"

715

)

716

717

print("Automated labeling results:")

718

for text, label in zip(unlabeled_data["text"], auto_labels):

719

print(f"'{text}' -> {label}")

720

```

721

722

### Comprehensive LLM Evaluation Pipeline

723

724

```python

725

import mlflow

726

import mlflow.genai

727

import pandas as pd

728

729

def create_llm_evaluation_pipeline():

730

"""Comprehensive LLM evaluation workflow."""

731

732

# Set up experiment

733

mlflow.set_experiment("llm_evaluation_pipeline")

734

735

with mlflow.start_run():

736

# 1. Prepare evaluation dataset

737

eval_data = pd.DataFrame({

738

"questions": [

739

"What is artificial intelligence?",

740

"How do neural networks work?",

741

"What are the benefits of machine learning?",

742

"Explain natural language processing",

743

"What is deep learning?"

744

],

745

"ground_truth": [

746

"AI is the simulation of human intelligence in machines",

747

"Neural networks are computing systems inspired by biological neural networks",

748

"ML provides automation, insights, and improved decision-making",

749

"NLP enables computers to understand and process human language",

750

"Deep learning is a subset of ML using artificial neural networks"

751

]

752

})

753

754

# 2. Create custom evaluators

755

@mlflow.genai.scorer(name="technical_accuracy")

756

def technical_accuracy(predictions, targets, **kwargs):

757

# Simplified technical accuracy scoring

758

scores = []

759

for pred, target in zip(predictions, targets):

760

# Check for technical keywords overlap

761

pred_words = set(pred.lower().split())

762

target_words = set(target.lower().split())

763

overlap = len(pred_words & target_words) / len(target_words | pred_words)

764

scores.append(overlap)

765

return scores

766

767

# 3. Evaluate multiple models

768

models_to_evaluate = [

769

"openai:/gpt-3.5-turbo",

770

"openai:/gpt-4",

771

"anthropic:/claude-3"

772

]

773

774

comparison_results = {}

775

776

for model_name in models_to_evaluate:

777

print(f"\nEvaluating {model_name}...")

778

779

# Evaluate model

780

results = mlflow.genai.evaluate(

781

model=model_name,

782

data=eval_data,

783

targets="ground_truth",

784

model_type="text",

785

evaluators=["default", "answer_relevance", "faithfulness"],

786

custom_metrics=[technical_accuracy],

787

evaluator_config={

788

"answer_relevance": {"threshold": 0.7},

789

"faithfulness": {"threshold": 0.8}

790

}

791

)

792

793

comparison_results[model_name] = results.metrics

794

795

# Log individual model results

796

for metric, value in results.metrics.items():

797

mlflow.log_metric(f"{model_name}_{metric}", value)

798

799

# 4. Create comparison report

800

print("\n=== Model Comparison Results ===")

801

for metric in ["answer_relevance", "faithfulness", "technical_accuracy"]:

802

print(f"\n{metric}:")

803

for model, metrics in comparison_results.items():

804

print(f" {model}: {metrics.get(metric, 0):.3f}")

805

806

# 5. Register best performing prompt

807

best_model = max(

808

comparison_results.items(),

809

key=lambda x: x[1].get("answer_relevance", 0)

810

)[0]

811

812

mlflow.log_param("best_model", best_model)

813

mlflow.log_metric("best_answer_relevance",

814

comparison_results[best_model]["answer_relevance"])

815

816

# 6. Save evaluation artifacts

817

comparison_df = pd.DataFrame(comparison_results).T

818

comparison_df.to_csv("model_comparison.csv")

819

mlflow.log_artifact("model_comparison.csv")

820

821

print(f"\nBest performing model: {best_model}")

822

823

return comparison_results

824

825

# Run evaluation pipeline

826

results = create_llm_evaluation_pipeline()

827

```

828

829

## Types

830

831

```python { .api }

832

from typing import Dict, List, Any, Optional, Union, Callable

833

from mlflow.entities import Dataset

834

import pandas as pd

835

836

# Core evaluation types

837

class EvaluationResult:

838

metrics: Dict[str, float]

839

artifacts: Dict[str, str]

840

tables: Dict[str, pd.DataFrame]

841

842

def to_predict_fn(

843

model_uri: str,

844

inference_params: Optional[Dict[str, Any]] = None

845

) -> Callable[[pd.DataFrame], List[str]]: ...

846

847

# Prompt management types

848

class Prompt:

849

name: str

850

version: str

851

template: str

852

model_config: Optional[Dict[str, Any]]

853

description: Optional[str]

854

tags: Dict[str, str]

855

856

def format(self, **kwargs) -> str: ...

857

858

class PromptTemplate:

859

template: str

860

input_variables: List[str]

861

862

def format(self, **kwargs) -> str: ...

863

864

# Scorer types

865

class Scorer:

866

name: str

867

version: Optional[str]

868

greater_is_better: bool

869

long_name: Optional[str]

870

model_type: Optional[str]

871

872

def score(self, predictions: List[str], targets: Optional[List[str]] = None, **kwargs) -> List[float]: ...

873

874

def scorer(

875

name: Optional[str] = None,

876

version: Optional[str] = None,

877

greater_is_better: bool = True,

878

long_name: Optional[str] = None,

879

model_type: Optional[str] = None

880

) -> Callable: ...

881

882

# Optimization types

883

class OptimizationResult:

884

best_prompt: str

885

best_score: float

886

iterations: int

887

candidate_prompts: List[str]

888

scores: List[float]

889

890

# Scheduling types

891

class ScorerScheduleConfig:

892

schedule_type: str

893

frequency: Union[str, int]

894

start_time: Optional[str]

895

end_time: Optional[str]

896

timezone: Optional[str]

897

898

# Labeling types

899

class LabelingSession:

900

session_id: str

901

name: str

902

dataset: Optional[Dataset]

903

instructions: Optional[str]

904

config: Dict[str, Any]

905

status: str

906

907

def add_labels(self, labels: List[Dict[str, Any]]) -> None: ...

908

def get_labels(self) -> List[Dict[str, Any]]: ...

909

def export_labels(self, format: str = "json") -> Union[List[Dict], pd.DataFrame]: ...

910

911

class Agent:

912

name: str

913

model: Optional[str]

914

tools: List[str]

915

instructions: Optional[str]

916

917

def evaluate(self, text: str, criteria: List[str]) -> Dict[str, Any]: ...

918

def label_batch(self, texts: List[str], schema: Dict[str, Any], guidelines: str) -> List[Dict[str, Any]]: ...

919

920

class ReviewApp:

921

session: LabelingSession

922

923

def launch(self, port: int = 8080, host: str = "localhost") -> None: ...

924

def stop(self) -> None: ...

925

926

# Dataset types

927

class GenAIDataset(Dataset):

928

name: str

929

version: Optional[str]

930

description: Optional[str]

931

tags: Dict[str, str]

932

933

# Built-in resources

934

judges: Dict[str, str]

935

scorers: Dict[str, str]

936

datasets: Dict[str, str]

937

```