Tessl Tile for pypi/mlflow@3.3.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

client.md configuration.md data.md frameworks.md genai.md index.md models.md projects.md tracing.md tracking.md

genai.mddocs/

0
# GenAI and LLM Integration
1

2
MLflow's GenAI capabilities provide comprehensive support for large language models, prompt engineering, evaluation, and LLM application development. The system includes specialized tools for prompt management, LLM evaluation, automated scoring, and interactive labeling workflows for GenAI applications.
3

4
## Capabilities
5

6
### Model Evaluation and Testing
7

8
Comprehensive evaluation framework specifically designed for LLM and GenAI applications with built-in metrics and custom evaluators.
9

10
```python { .api }
11
def evaluate(model=None, data=None, model_type="text", evaluators=None, targets=None, evaluator_config=None, custom_metrics=None, extra_metrics=None, baseline_model=None, inference_params=None, model_config=None):
12
    """
13
    Evaluate GenAI models with specialized LLM metrics.
14
    
15
    Parameters:
16
    - model: Model, callable, or URI - LLM model to evaluate
17
    - data: DataFrame, Dataset, or URI - Evaluation dataset with inputs
18
    - model_type: str - Type of model ("text", "chat", "question-answering")  
19
    - evaluators: list, optional - List of evaluator names or objects
20
    - targets: str or array, optional - Ground truth targets for evaluation
21
    - evaluator_config: dict, optional - Configuration for evaluators
22
    - custom_metrics: list, optional - Custom metric functions
23
    - extra_metrics: list, optional - Additional built-in metrics
24
    - baseline_model: Model or URI, optional - Baseline model for comparison
25
    - inference_params: dict, optional - Model inference parameters
26
    - model_config: dict, optional - Model configuration parameters
27
    
28
    Returns:
29
    EvaluationResult object with LLM-specific metrics and artifacts
30
    """
31

32
def to_predict_fn(model_uri, inference_params=None):
33
    """
34
    Convert MLflow model to prediction function for evaluation.
35
    
36
    Parameters:
37
    - model_uri: str - URI pointing to MLflow model
38
    - inference_params: dict, optional - Parameters for model inference
39
    
40
    Returns:
41
    Callable prediction function compatible with evaluation
42
    """
43
```
44

45
### Prompt Management
46

47
Comprehensive prompt engineering and versioning system for managing prompts across LLM applications.
48

49
```python { .api }
50
def register_prompt(name, prompt, model_config=None, description=None, tags=None):
51
    """
52
    Register a prompt template in MLflow.
53
    
54
    Parameters:
55
    - name: str - Unique prompt name (format: "name/version")
56
    - prompt: str or PromptTemplate - Prompt content or template
57
    - model_config: dict, optional - Associated model configuration
58
    - description: str, optional - Prompt description
59
    - tags: dict, optional - Prompt tags for organization
60
    
61
    Returns:
62
    Prompt object representing registered prompt
63
    """
64

65
def load_prompt(name):
66
    """
67
    Load registered prompt by name.
68
    
69
    Parameters:
70
    - name: str - Prompt name with optional version ("name" or "name/version")
71
    
72
    Returns:
73
    Prompt object with template and configuration
74
    """
75

76
def search_prompts(name_like=None, tags=None, max_results=None):
77
    """
78
    Search registered prompts by criteria.
79
    
80
    Parameters:
81
    - name_like: str, optional - Pattern to match prompt names
82
    - tags: dict, optional - Tags to filter prompts
83
    - max_results: int, optional - Maximum number of results
84
    
85
    Returns:
86
    List of Prompt objects matching criteria
87
    """
88

89
def set_prompt_alias(name, alias, version):
90
    """
91
    Set alias for prompt version.
92
    
93
    Parameters:
94
    - name: str - Prompt name
95
    - alias: str - Alias name (e.g., "champion", "latest")
96
    - version: str or int - Prompt version number
97
    """
98

99
def delete_prompt_alias(name, alias):
100
    """
101
    Delete prompt alias.
102
    
103
    Parameters:
104
    - name: str - Prompt name
105
    - alias: str - Alias to delete
106
    """
107
```
108

109
### Prompt Optimization
110

111
Automated prompt optimization and improvement using various optimization strategies.
112

113
```python { .api }
114
def optimize_prompt(task, num_candidates=20, max_iterations=10, model=None, prompt_template=None, model_config=None, evaluator_config=None):
115
    """
116
    Automatically optimize prompts for better performance.
117
    
118
    Parameters:
119
    - task: str - Description of the task for prompt optimization
120
    - num_candidates: int - Number of prompt candidates to generate
121
    - max_iterations: int - Maximum optimization iterations
122
    - model: Model or URI, optional - Model for prompt testing
123
    - prompt_template: str, optional - Base prompt template
124
    - model_config: dict, optional - Model configuration
125
    - evaluator_config: dict, optional - Evaluation configuration
126
    
127
    Returns:
128
    OptimizationResult with best prompt and performance metrics
129
    """
130
```
131

132
### Custom Scorers and Metrics
133

134
Framework for creating custom scoring functions and metrics for LLM evaluation.
135

136
```python { .api }
137
def scorer(name=None, version=None, greater_is_better=True, long_name=None, model_type=None):
138
    """
139
    Decorator for creating custom LLM scorer functions.
140
    
141
    Parameters:
142
    - name: str, optional - Scorer name (inferred if not provided)
143
    - version: str, optional - Scorer version
144
    - greater_is_better: bool - Whether higher scores are better
145
    - long_name: str, optional - Human-readable scorer name
146
    - model_type: str, optional - Compatible model types
147
    
148
    Returns:
149
    Scorer object wrapping the function
150
    """
151

152
class Scorer:
153
    def __init__(self, eval_fn, name=None, version=None, greater_is_better=True, long_name=None, model_type=None):
154
        """
155
        Create custom LLM scorer.
156
        
157
        Parameters:
158
        - eval_fn: callable - Function that computes score
159
        - name: str, optional - Scorer name
160
        - version: str, optional - Scorer version  
161
        - greater_is_better: bool - Whether higher scores are better
162
        - long_name: str, optional - Human-readable name
163
        - model_type: str, optional - Compatible model types
164
        """
165
    
166
    def score(self, predictions, targets=None, **kwargs):
167
        """
168
        Compute scores for predictions.
169
        
170
        Parameters:
171
        - predictions: list - Model predictions to score
172
        - targets: list, optional - Ground truth targets
173
        - kwargs: Additional scoring arguments
174
        
175
        Returns:
176
        Scores or metrics dictionary
177
        """
178
```
179

180
### Scheduled Scoring
181

182
Configuration and management of automated scoring pipelines for continuous evaluation.
183

184
```python { .api }
185
class ScorerScheduleConfig:
186
    def __init__(self, schedule_type, frequency, start_time=None, end_time=None, timezone=None):
187
        """
188
        Configuration for scheduled scoring jobs.
189
        
190
        Parameters:
191
        - schedule_type: str - Type of schedule ("cron", "interval")
192
        - frequency: str or int - Schedule frequency specification
193
        - start_time: str, optional - Start time for scheduled jobs
194
        - end_time: str, optional - End time for scheduled jobs  
195
        - timezone: str, optional - Timezone for schedule
196
        """
197
```
198

199
### Dataset Management
200

201
Specialized dataset operations for LLM training and evaluation datasets.
202

203
```python { .api }
204
def create_dataset(name, data_source=None, description=None, tags=None):
205
    """
206
    Create GenAI dataset for LLM evaluation.
207
    
208
    Parameters:
209
    - name: str - Dataset name
210
    - data_source: str or DataFrame, optional - Data source location or content
211
    - description: str, optional - Dataset description
212
    - tags: dict, optional - Dataset tags
213
    
214
    Returns:
215
    Dataset object for GenAI applications
216
    """
217

218
def get_dataset(name, version=None):
219
    """
220
    Retrieve GenAI dataset by name.
221
    
222
    Parameters:
223
    - name: str - Dataset name
224
    - version: str or int, optional - Dataset version
225
    
226
    Returns:
227
    Dataset object with LLM evaluation data
228
    """
229

230
def delete_dataset(name, version=None):
231
    """
232
    Delete GenAI dataset.
233
    
234
    Parameters:
235
    - name: str - Dataset name to delete
236
    - version: str or int, optional - Specific version to delete
237
    """
238
```
239

240
### Interactive Labeling and Review
241

242
Tools for human-in-the-loop evaluation and data labeling for LLM applications.
243

244
```python { .api }
245
def create_labeling_session(name, dataset=None, instructions=None, labelers=None, config=None):
246
    """
247
    Create interactive labeling session for LLM data.
248
    
249
    Parameters:
250
    - name: str - Session name
251
    - dataset: Dataset or str, optional - Dataset to label
252
    - instructions: str, optional - Labeling instructions
253
    - labelers: list, optional - List of labeler identifiers
254
    - config: dict, optional - Labeling session configuration
255
    
256
    Returns:
257
    LabelingSession object
258
    """
259

260
def get_labeling_session(session_id):
261
    """
262
    Retrieve labeling session by ID.
263
    
264
    Parameters:
265
    - session_id: str - Labeling session identifier
266
    
267
    Returns:
268
    LabelingSession object
269
    """
270

271
def get_labeling_sessions(experiment_id=None, status=None):
272
    """
273
    List labeling sessions with optional filtering.
274
    
275
    Parameters:
276
    - experiment_id: str, optional - Filter by experiment
277
    - status: str, optional - Filter by session status
278
    
279
    Returns:
280
    List of LabelingSession objects
281
    """
282

283
def delete_labeling_session(session_id):
284
    """
285
    Delete labeling session.
286
    
287
    Parameters:
288
    - session_id: str - Session ID to delete
289
    """
290

291
class LabelingSession:
292
    def __init__(self, name, dataset=None, instructions=None, config=None):
293
        """
294
        Interactive labeling session for GenAI data.
295
        
296
        Parameters:
297
        - name: str - Session name
298
        - dataset: Dataset, optional - Dataset to label  
299
        - instructions: str, optional - Labeling instructions
300
        - config: dict, optional - Session configuration
301
        """
302
    
303
    def add_labels(self, labels):
304
        """Add labels to session."""
305
    
306
    def get_labels(self):
307
        """Get current session labels."""
308
    
309
    def export_labels(self, format="json"):
310
        """Export labels in specified format."""
311

312
class Agent:
313
    def __init__(self, name, model=None, tools=None, instructions=None):
314
        """
315
        GenAI agent for automated evaluation and labeling.
316
        
317
        Parameters:
318
        - name: str - Agent name
319
        - model: Model or str, optional - LLM model for agent
320
        - tools: list, optional - Available tools for agent
321
        - instructions: str, optional - Agent instructions
322
        """
323

324
def get_review_app(session_id):
325
    """
326
    Get review application for labeling session.
327
    
328
    Parameters:
329
    - session_id: str - Labeling session ID
330
    
331
    Returns:
332
    ReviewApp object for interactive review
333
    """
334

335
class ReviewApp:
336
    def __init__(self, session):
337
        """
338
        Web application for reviewing and labeling LLM outputs.
339
        
340
        Parameters:
341
        - session: LabelingSession - Associated labeling session
342
        """
343
    
344
    def launch(self, port=8080, host="localhost"):
345
        """Launch review application."""
346
    
347
    def stop(self):
348
        """Stop review application."""
349
```
350

351
### Built-in Evaluators and Judges
352

353
Pre-built evaluators and judge models for common LLM evaluation tasks.
354

355
```python { .api }
356
# Built-in judge models for evaluation
357
judges = {
358
    "gpt4_as_judge": "GPT-4 based evaluation judge",
359
    "claude_as_judge": "Claude based evaluation judge", 
360
    "llama_as_judge": "Llama based evaluation judge"
361
}
362

363
# Built-in scorer functions  
364
scorers = {
365
    "answer_relevance": "Evaluate answer relevance to question",
366
    "answer_correctness": "Evaluate factual correctness of answers",
367
    "answer_similarity": "Semantic similarity between answers",
368
    "faithfulness": "Evaluate faithfulness to source context",
369
    "context_precision": "Precision of retrieved context",
370
    "context_recall": "Recall of retrieved context",
371
    "toxicity": "Detect toxic or harmful content",
372
    "readability": "Evaluate text readability and clarity"
373
}
374

375
# Dataset utilities
376
datasets = {
377
    "common_datasets": "Access to common LLM evaluation datasets",
378
    "benchmarks": "Standard LLM benchmarks and test sets"
379
}
380
```
381

382
## Usage Examples
383

384
### Basic LLM Evaluation
385

386
```python
387
import mlflow
388
import mlflow.genai
389
import pandas as pd
390

391
# Prepare evaluation dataset
392
eval_data = pd.DataFrame({
393
    "inputs": [
394
        "What is machine learning?",
395
        "Explain deep learning",
396
        "How does AI work?"
397
    ],
398
    "targets": [
399
        "Machine learning is a subset of AI that learns from data",
400
        "Deep learning uses neural networks with multiple layers", 
401
        "AI works by processing data to make predictions or decisions"
402
    ]
403
})
404

405
# Evaluate LLM model
406
with mlflow.start_run():
407
    results = mlflow.genai.evaluate(
408
        model="openai:/gpt-4",  # Model URI
409
        data=eval_data,
410
        model_type="text",
411
        evaluators=["default", "answer_relevance", "toxicity"],
412
        targets="targets"
413
    )
414
    
415
    # Log evaluation results
416
    mlflow.log_metrics(results.metrics)
417
    
418
    print("Evaluation Results:")
419
    for metric_name, score in results.metrics.items():
420
        print(f"{metric_name}: {score:.3f}")
421
```
422

423
### Custom Scorer Creation
424

425
```python
426
import mlflow.genai
427
from mlflow.genai import scorer
428
import re
429

430
# Create custom scorer using decorator
431
@scorer(name="question_detection", greater_is_better=True)
432
def detect_questions(predictions, targets=None, **kwargs):
433
    """Custom scorer to detect if text contains questions."""
434
    scores = []
435
    for pred in predictions:
436
        # Count question marks and question words
437
        question_marks = pred.count('?')
438
        question_words = len(re.findall(r'\b(what|how|why|when|where|who)\b', pred.lower()))
439
        score = min(1.0, (question_marks + question_words * 0.5) / 2)
440
        scores.append(score)
441
    return scores
442

443
# Create scorer using class
444
class SentimentScorer(mlflow.genai.Scorer):
445
    def __init__(self):
446
        super().__init__(
447
            eval_fn=self._score_sentiment,
448
            name="sentiment_positivity",
449
            greater_is_better=True
450
        )
451
    
452
    def _score_sentiment(self, predictions, **kwargs):
453
        """Score text sentiment positivity."""
454
        # Simplified sentiment scoring
455
        positive_words = ["good", "great", "excellent", "amazing", "wonderful"]
456
        negative_words = ["bad", "terrible", "awful", "horrible", "worst"]
457
        
458
        scores = []
459
        for pred in predictions:
460
            pred_lower = pred.lower()
461
            pos_count = sum(word in pred_lower for word in positive_words)
462
            neg_count = sum(word in pred_lower for word in negative_words)
463
            
464
            if pos_count + neg_count == 0:
465
                score = 0.5  # Neutral
466
            else:
467
                score = pos_count / (pos_count + neg_count)
468
            
469
            scores.append(score)
470
        return scores
471

472
# Use custom scorers in evaluation
473
sentiment_scorer = SentimentScorer()
474

475
results = mlflow.genai.evaluate(
476
    model="openai:/gpt-3.5-turbo",
477
    data=eval_data,
478
    custom_metrics=[detect_questions, sentiment_scorer],
479
    model_type="text"
480
)
481

482
print("Custom metric results:")
483
print(f"Question detection: {results.metrics['question_detection']:.3f}")
484
print(f"Sentiment positivity: {results.metrics['sentiment_positivity']:.3f}")
485
```
486

487
### Prompt Management Workflow
488

489
```python
490
import mlflow.genai
491

492
# Register prompt templates
493
classification_prompt = """
494
You are an expert classifier. Given the following text, classify it into one of these categories: {categories}
495

496
Text: {text}
497

498
Classification:
499
"""
500

501
mlflow.genai.register_prompt(
502
    name="text_classification/v1",
503
    prompt=classification_prompt,
504
    description="Multi-class text classification prompt",
505
    tags={"task": "classification", "version": "1.0"}
506
)
507

508
# Register improved version
509
improved_prompt = """
510
You are an expert text classifier with high accuracy. Analyze the following text carefully and classify it into exactly one of these categories: {categories}
511

512
Text to classify: "{text}"
513

514
Think step by step:
515
1. What are the key themes in this text?
516
2. Which category best matches these themes?
517
3. Why is this the best classification?
518

519
Final classification:
520
"""
521

522
mlflow.genai.register_prompt(
523
    name="text_classification/v2", 
524
    prompt=improved_prompt,
525
    description="Improved classification prompt with reasoning",
526
    tags={"task": "classification", "version": "2.0", "reasoning": "true"}
527
)
528

529
# Set alias for best performing version
530
mlflow.genai.set_prompt_alias(
531
    name="text_classification",
532
    alias="champion", 
533
    version="2"
534
)
535

536
# Load and use prompt
537
prompt = mlflow.genai.load_prompt("text_classification@champion")
538
formatted_prompt = prompt.format(
539
    categories=["positive", "negative", "neutral"],
540
    text="I love this product!"
541
)
542

543
print("Formatted prompt:")
544
print(formatted_prompt)
545

546
# Search for prompts
547
classification_prompts = mlflow.genai.search_prompts(
548
    name_like="classification*",
549
    tags={"task": "classification"}
550
)
551

552
print(f"\nFound {len(classification_prompts)} classification prompts")
553
for p in classification_prompts:
554
    print(f"- {p.name}: {p.description}")
555
```
556

557
### Prompt Optimization
558

559
```python
560
import mlflow.genai
561

562
# Define optimization task
563
task_description = """
564
Create a prompt that helps an AI assistant generate engaging 
565
product descriptions for e-commerce items. The descriptions 
566
should be persuasive, informative, and highlight key features.
567
"""
568

569
# Base prompt template
570
base_prompt = """
571
Write a product description for: {product_name}
572

573
Features: {features}
574
Price: {price}
575

576
Description:
577
"""
578

579
# Optimize prompt automatically
580
with mlflow.start_run():
581
    optimization_result = mlflow.genai.optimize_prompt(
582
        task=task_description,
583
        prompt_template=base_prompt,
584
        num_candidates=10,
585
        max_iterations=5,
586
        model="openai:/gpt-4",
587
        evaluator_config={
588
            "metrics": ["engagement", "clarity", "persuasiveness"]
589
        }
590
    )
591
    
592
    # Log optimization results
593
    mlflow.log_metric("optimization_score", optimization_result.best_score)
594
    mlflow.log_param("iterations_completed", optimization_result.iterations)
595
    
596
    # Register optimized prompt
597
    mlflow.genai.register_prompt(
598
        name="product_description/optimized",
599
        prompt=optimization_result.best_prompt,
600
        description="Auto-optimized product description prompt",
601
        tags={"optimized": "true", "score": str(optimization_result.best_score)}
602
    )
603
    
604
    print(f"Optimization completed with score: {optimization_result.best_score:.3f}")
605
    print(f"Best prompt:\n{optimization_result.best_prompt}")
606
```
607

608
### Interactive Labeling Session
609

610
```python
611
import mlflow.genai
612
import pandas as pd
613

614
# Create dataset for labeling
615
unlabeled_data = pd.DataFrame({
616
    "text": [
617
        "The movie was absolutely fantastic!",
618
        "I didn't like the service at all.", 
619
        "The product works as expected.",
620
        "This is the worst experience ever.",
621
        "Pretty good, would recommend."
622
    ]
623
})
624

625
# Create labeling session
626
session = mlflow.genai.create_labeling_session(
627
    name="sentiment_labeling_v1",
628
    dataset=unlabeled_data,
629
    instructions="""
630
    Label each text with sentiment:
631
    - positive: Text expresses positive sentiment
632
    - negative: Text expresses negative sentiment  
633
    - neutral: Text expresses neutral sentiment
634
    
635
    Consider the overall emotional tone and opinion expressed.
636
    """,
637
    config={
638
        "labels": ["positive", "negative", "neutral"],
639
        "allow_multiple": False,
640
        "require_confidence": True
641
    }
642
)
643

644
print(f"Created labeling session: {session.session_id}")
645

646
# Simulate adding labels (normally done through UI)
647
labels = [
648
    {"text_id": 0, "label": "positive", "confidence": 0.95},
649
    {"text_id": 1, "label": "negative", "confidence": 0.90},
650
    {"text_id": 2, "label": "neutral", "confidence": 0.80},
651
    {"text_id": 3, "label": "negative", "confidence": 0.98},
652
    {"text_id": 4, "label": "positive", "confidence": 0.85}
653
]
654

655
session.add_labels(labels)
656

657
# Export labeled data
658
labeled_dataset = session.export_labels(format="json")
659
print(f"Exported {len(labeled_dataset)} labeled examples")
660

661
# Create review app for quality control
662
review_app = mlflow.genai.get_review_app(session.session_id)
663
# review_app.launch(port=8080)  # Launches web interface
664
```
665

666
### GenAI Agent Implementation
667

668
```python
669
import mlflow.genai
670

671
# Create GenAI agent for automated evaluation
672
evaluation_agent = mlflow.genai.Agent(
673
    name="evaluation_agent",
674
    model="openai:/gpt-4",
675
    tools=["web_search", "calculator", "code_execution"],
676
    instructions="""
677
    You are an expert evaluator for AI-generated content. 
678
    Analyze responses for accuracy, relevance, and quality.
679
    Use available tools to fact-check when needed.
680
    Provide detailed feedback and numerical scores.
681
    """
682
)
683

684
# Agent evaluates model outputs
685
test_outputs = [
686
    "Paris is the capital of France and has a population of about 2.1 million.",
687
    "The square root of 144 is 12.",
688
    "Python is a programming language created in 1991 by Guido van Rossum."
689
]
690

691
evaluation_results = []
692
for output in test_outputs:
693
    # Agent evaluates each output
694
    result = evaluation_agent.evaluate(
695
        text=output,
696
        criteria=["factual_accuracy", "completeness", "clarity"]
697
    )
698
    evaluation_results.append(result)
699

700
# Create automated labeling agent  
701
labeling_agent = mlflow.genai.Agent(
702
    name="auto_labeler",
703
    model="anthropic:/claude-3",
704
    instructions="""
705
    You are an expert data labeler. Label text data according to 
706
    the provided schema and guidelines. Be consistent and accurate.
707
    """
708
)
709

710
# Use agent for automated labeling
711
auto_labels = labeling_agent.label_batch(
712
    texts=unlabeled_data["text"].tolist(),
713
    schema={"sentiment": ["positive", "negative", "neutral"]},
714
    guidelines="Focus on overall emotional tone and opinion"
715
)
716

717
print("Automated labeling results:")
718
for text, label in zip(unlabeled_data["text"], auto_labels):
719
    print(f"'{text}' -> {label}")
720
```
721

722
### Comprehensive LLM Evaluation Pipeline
723

724
```python
725
import mlflow
726
import mlflow.genai
727
import pandas as pd
728

729
def create_llm_evaluation_pipeline():
730
    """Comprehensive LLM evaluation workflow."""
731
    
732
    # Set up experiment
733
    mlflow.set_experiment("llm_evaluation_pipeline")
734
    
735
    with mlflow.start_run():
736
        # 1. Prepare evaluation dataset
737
        eval_data = pd.DataFrame({
738
            "questions": [
739
                "What is artificial intelligence?",
740
                "How do neural networks work?",
741
                "What are the benefits of machine learning?",
742
                "Explain natural language processing",
743
                "What is deep learning?"
744
            ],
745
            "ground_truth": [
746
                "AI is the simulation of human intelligence in machines",
747
                "Neural networks are computing systems inspired by biological neural networks",
748
                "ML provides automation, insights, and improved decision-making",
749
                "NLP enables computers to understand and process human language",
750
                "Deep learning is a subset of ML using artificial neural networks"
751
            ]
752
        })
753
        
754
        # 2. Create custom evaluators
755
        @mlflow.genai.scorer(name="technical_accuracy")
756
        def technical_accuracy(predictions, targets, **kwargs):
757
            # Simplified technical accuracy scoring
758
            scores = []
759
            for pred, target in zip(predictions, targets):
760
                # Check for technical keywords overlap
761
                pred_words = set(pred.lower().split())
762
                target_words = set(target.lower().split())
763
                overlap = len(pred_words & target_words) / len(target_words | pred_words)
764
                scores.append(overlap)
765
            return scores
766
        
767
        # 3. Evaluate multiple models
768
        models_to_evaluate = [
769
            "openai:/gpt-3.5-turbo",
770
            "openai:/gpt-4",
771
            "anthropic:/claude-3"
772
        ]
773
        
774
        comparison_results = {}
775
        
776
        for model_name in models_to_evaluate:
777
            print(f"\nEvaluating {model_name}...")
778
            
779
            # Evaluate model
780
            results = mlflow.genai.evaluate(
781
                model=model_name,
782
                data=eval_data,
783
                targets="ground_truth",
784
                model_type="text",
785
                evaluators=["default", "answer_relevance", "faithfulness"],
786
                custom_metrics=[technical_accuracy],
787
                evaluator_config={
788
                    "answer_relevance": {"threshold": 0.7},
789
                    "faithfulness": {"threshold": 0.8}
790
                }
791
            )
792
            
793
            comparison_results[model_name] = results.metrics
794
            
795
            # Log individual model results
796
            for metric, value in results.metrics.items():
797
                mlflow.log_metric(f"{model_name}_{metric}", value)
798
        
799
        # 4. Create comparison report
800
        print("\n=== Model Comparison Results ===")
801
        for metric in ["answer_relevance", "faithfulness", "technical_accuracy"]:
802
            print(f"\n{metric}:")
803
            for model, metrics in comparison_results.items():
804
                print(f"  {model}: {metrics.get(metric, 0):.3f}")
805
        
806
        # 5. Register best performing prompt
807
        best_model = max(
808
            comparison_results.items(),
809
            key=lambda x: x[1].get("answer_relevance", 0)
810
        )[0]
811
        
812
        mlflow.log_param("best_model", best_model)
813
        mlflow.log_metric("best_answer_relevance", 
814
                         comparison_results[best_model]["answer_relevance"])
815
        
816
        # 6. Save evaluation artifacts
817
        comparison_df = pd.DataFrame(comparison_results).T
818
        comparison_df.to_csv("model_comparison.csv")
819
        mlflow.log_artifact("model_comparison.csv")
820
        
821
        print(f"\nBest performing model: {best_model}")
822
        
823
        return comparison_results
824

825
# Run evaluation pipeline
826
results = create_llm_evaluation_pipeline()
827
```
828

829
## Types
830

831
```python { .api }
832
from typing import Dict, List, Any, Optional, Union, Callable
833
from mlflow.entities import Dataset
834
import pandas as pd
835

836
# Core evaluation types
837
class EvaluationResult:
838
    metrics: Dict[str, float]
839
    artifacts: Dict[str, str]
840
    tables: Dict[str, pd.DataFrame]
841

842
def to_predict_fn(
843
    model_uri: str, 
844
    inference_params: Optional[Dict[str, Any]] = None
845
) -> Callable[[pd.DataFrame], List[str]]: ...
846

847
# Prompt management types
848
class Prompt:
849
    name: str
850
    version: str
851
    template: str
852
    model_config: Optional[Dict[str, Any]]
853
    description: Optional[str]
854
    tags: Dict[str, str]
855
    
856
    def format(self, **kwargs) -> str: ...
857

858
class PromptTemplate:
859
    template: str
860
    input_variables: List[str]
861
    
862
    def format(self, **kwargs) -> str: ...
863

864
# Scorer types
865
class Scorer:
866
    name: str
867
    version: Optional[str]
868
    greater_is_better: bool
869
    long_name: Optional[str]
870
    model_type: Optional[str]
871
    
872
    def score(self, predictions: List[str], targets: Optional[List[str]] = None, **kwargs) -> List[float]: ...
873

874
def scorer(
875
    name: Optional[str] = None,
876
    version: Optional[str] = None, 
877
    greater_is_better: bool = True,
878
    long_name: Optional[str] = None,
879
    model_type: Optional[str] = None
880
) -> Callable: ...
881

882
# Optimization types
883
class OptimizationResult:
884
    best_prompt: str
885
    best_score: float
886
    iterations: int
887
    candidate_prompts: List[str]
888
    scores: List[float]
889

890
# Scheduling types
891
class ScorerScheduleConfig:
892
    schedule_type: str
893
    frequency: Union[str, int]
894
    start_time: Optional[str]
895
    end_time: Optional[str]
896
    timezone: Optional[str]
897

898
# Labeling types
899
class LabelingSession:
900
    session_id: str
901
    name: str
902
    dataset: Optional[Dataset]
903
    instructions: Optional[str]
904
    config: Dict[str, Any]
905
    status: str
906
    
907
    def add_labels(self, labels: List[Dict[str, Any]]) -> None: ...
908
    def get_labels(self) -> List[Dict[str, Any]]: ...
909
    def export_labels(self, format: str = "json") -> Union[List[Dict], pd.DataFrame]: ...
910

911
class Agent:
912
    name: str
913
    model: Optional[str]
914
    tools: List[str]
915
    instructions: Optional[str]
916
    
917
    def evaluate(self, text: str, criteria: List[str]) -> Dict[str, Any]: ...
918
    def label_batch(self, texts: List[str], schema: Dict[str, Any], guidelines: str) -> List[Dict[str, Any]]: ...
919

920
class ReviewApp:
921
    session: LabelingSession
922
    
923
    def launch(self, port: int = 8080, host: str = "localhost") -> None: ...
924
    def stop(self) -> None: ...
925

926
# Dataset types
927
class GenAIDataset(Dataset):
928
    name: str
929
    version: Optional[str]
930
    description: Optional[str]
931
    tags: Dict[str, str]
932

933
# Built-in resources
934
judges: Dict[str, str]
935
scorers: Dict[str, str] 
936
datasets: Dict[str, str]
937
```

Version

Tile

Files

genai.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

genai.mddocs/