0
# GenAI and LLM Integration
1
2
MLflow's GenAI capabilities provide comprehensive support for large language models, prompt engineering, evaluation, and LLM application development. The system includes specialized tools for prompt management, LLM evaluation, automated scoring, and interactive labeling workflows for GenAI applications.
3
4
## Capabilities
5
6
### Model Evaluation and Testing
7
8
Comprehensive evaluation framework specifically designed for LLM and GenAI applications with built-in metrics and custom evaluators.
9
10
```python { .api }
11
def evaluate(model=None, data=None, model_type="text", evaluators=None, targets=None, evaluator_config=None, custom_metrics=None, extra_metrics=None, baseline_model=None, inference_params=None, model_config=None):
12
"""
13
Evaluate GenAI models with specialized LLM metrics.
14
15
Parameters:
16
- model: Model, callable, or URI - LLM model to evaluate
17
- data: DataFrame, Dataset, or URI - Evaluation dataset with inputs
18
- model_type: str - Type of model ("text", "chat", "question-answering")
19
- evaluators: list, optional - List of evaluator names or objects
20
- targets: str or array, optional - Ground truth targets for evaluation
21
- evaluator_config: dict, optional - Configuration for evaluators
22
- custom_metrics: list, optional - Custom metric functions
23
- extra_metrics: list, optional - Additional built-in metrics
24
- baseline_model: Model or URI, optional - Baseline model for comparison
25
- inference_params: dict, optional - Model inference parameters
26
- model_config: dict, optional - Model configuration parameters
27
28
Returns:
29
EvaluationResult object with LLM-specific metrics and artifacts
30
"""
31
32
def to_predict_fn(model_uri, inference_params=None):
33
"""
34
Convert MLflow model to prediction function for evaluation.
35
36
Parameters:
37
- model_uri: str - URI pointing to MLflow model
38
- inference_params: dict, optional - Parameters for model inference
39
40
Returns:
41
Callable prediction function compatible with evaluation
42
"""
43
```
44
45
### Prompt Management
46
47
Comprehensive prompt engineering and versioning system for managing prompts across LLM applications.
48
49
```python { .api }
50
def register_prompt(name, prompt, model_config=None, description=None, tags=None):
51
"""
52
Register a prompt template in MLflow.
53
54
Parameters:
55
- name: str - Unique prompt name (format: "name/version")
56
- prompt: str or PromptTemplate - Prompt content or template
57
- model_config: dict, optional - Associated model configuration
58
- description: str, optional - Prompt description
59
- tags: dict, optional - Prompt tags for organization
60
61
Returns:
62
Prompt object representing registered prompt
63
"""
64
65
def load_prompt(name):
66
"""
67
Load registered prompt by name.
68
69
Parameters:
70
- name: str - Prompt name with optional version ("name" or "name/version")
71
72
Returns:
73
Prompt object with template and configuration
74
"""
75
76
def search_prompts(name_like=None, tags=None, max_results=None):
77
"""
78
Search registered prompts by criteria.
79
80
Parameters:
81
- name_like: str, optional - Pattern to match prompt names
82
- tags: dict, optional - Tags to filter prompts
83
- max_results: int, optional - Maximum number of results
84
85
Returns:
86
List of Prompt objects matching criteria
87
"""
88
89
def set_prompt_alias(name, alias, version):
90
"""
91
Set alias for prompt version.
92
93
Parameters:
94
- name: str - Prompt name
95
- alias: str - Alias name (e.g., "champion", "latest")
96
- version: str or int - Prompt version number
97
"""
98
99
def delete_prompt_alias(name, alias):
100
"""
101
Delete prompt alias.
102
103
Parameters:
104
- name: str - Prompt name
105
- alias: str - Alias to delete
106
"""
107
```
108
109
### Prompt Optimization
110
111
Automated prompt optimization and improvement using various optimization strategies.
112
113
```python { .api }
114
def optimize_prompt(task, num_candidates=20, max_iterations=10, model=None, prompt_template=None, model_config=None, evaluator_config=None):
115
"""
116
Automatically optimize prompts for better performance.
117
118
Parameters:
119
- task: str - Description of the task for prompt optimization
120
- num_candidates: int - Number of prompt candidates to generate
121
- max_iterations: int - Maximum optimization iterations
122
- model: Model or URI, optional - Model for prompt testing
123
- prompt_template: str, optional - Base prompt template
124
- model_config: dict, optional - Model configuration
125
- evaluator_config: dict, optional - Evaluation configuration
126
127
Returns:
128
OptimizationResult with best prompt and performance metrics
129
"""
130
```
131
132
### Custom Scorers and Metrics
133
134
Framework for creating custom scoring functions and metrics for LLM evaluation.
135
136
```python { .api }
137
def scorer(name=None, version=None, greater_is_better=True, long_name=None, model_type=None):
138
"""
139
Decorator for creating custom LLM scorer functions.
140
141
Parameters:
142
- name: str, optional - Scorer name (inferred if not provided)
143
- version: str, optional - Scorer version
144
- greater_is_better: bool - Whether higher scores are better
145
- long_name: str, optional - Human-readable scorer name
146
- model_type: str, optional - Compatible model types
147
148
Returns:
149
Scorer object wrapping the function
150
"""
151
152
class Scorer:
153
def __init__(self, eval_fn, name=None, version=None, greater_is_better=True, long_name=None, model_type=None):
154
"""
155
Create custom LLM scorer.
156
157
Parameters:
158
- eval_fn: callable - Function that computes score
159
- name: str, optional - Scorer name
160
- version: str, optional - Scorer version
161
- greater_is_better: bool - Whether higher scores are better
162
- long_name: str, optional - Human-readable name
163
- model_type: str, optional - Compatible model types
164
"""
165
166
def score(self, predictions, targets=None, **kwargs):
167
"""
168
Compute scores for predictions.
169
170
Parameters:
171
- predictions: list - Model predictions to score
172
- targets: list, optional - Ground truth targets
173
- kwargs: Additional scoring arguments
174
175
Returns:
176
Scores or metrics dictionary
177
"""
178
```
179
180
### Scheduled Scoring
181
182
Configuration and management of automated scoring pipelines for continuous evaluation.
183
184
```python { .api }
185
class ScorerScheduleConfig:
186
def __init__(self, schedule_type, frequency, start_time=None, end_time=None, timezone=None):
187
"""
188
Configuration for scheduled scoring jobs.
189
190
Parameters:
191
- schedule_type: str - Type of schedule ("cron", "interval")
192
- frequency: str or int - Schedule frequency specification
193
- start_time: str, optional - Start time for scheduled jobs
194
- end_time: str, optional - End time for scheduled jobs
195
- timezone: str, optional - Timezone for schedule
196
"""
197
```
198
199
### Dataset Management
200
201
Specialized dataset operations for LLM training and evaluation datasets.
202
203
```python { .api }
204
def create_dataset(name, data_source=None, description=None, tags=None):
205
"""
206
Create GenAI dataset for LLM evaluation.
207
208
Parameters:
209
- name: str - Dataset name
210
- data_source: str or DataFrame, optional - Data source location or content
211
- description: str, optional - Dataset description
212
- tags: dict, optional - Dataset tags
213
214
Returns:
215
Dataset object for GenAI applications
216
"""
217
218
def get_dataset(name, version=None):
219
"""
220
Retrieve GenAI dataset by name.
221
222
Parameters:
223
- name: str - Dataset name
224
- version: str or int, optional - Dataset version
225
226
Returns:
227
Dataset object with LLM evaluation data
228
"""
229
230
def delete_dataset(name, version=None):
231
"""
232
Delete GenAI dataset.
233
234
Parameters:
235
- name: str - Dataset name to delete
236
- version: str or int, optional - Specific version to delete
237
"""
238
```
239
240
### Interactive Labeling and Review
241
242
Tools for human-in-the-loop evaluation and data labeling for LLM applications.
243
244
```python { .api }
245
def create_labeling_session(name, dataset=None, instructions=None, labelers=None, config=None):
246
"""
247
Create interactive labeling session for LLM data.
248
249
Parameters:
250
- name: str - Session name
251
- dataset: Dataset or str, optional - Dataset to label
252
- instructions: str, optional - Labeling instructions
253
- labelers: list, optional - List of labeler identifiers
254
- config: dict, optional - Labeling session configuration
255
256
Returns:
257
LabelingSession object
258
"""
259
260
def get_labeling_session(session_id):
261
"""
262
Retrieve labeling session by ID.
263
264
Parameters:
265
- session_id: str - Labeling session identifier
266
267
Returns:
268
LabelingSession object
269
"""
270
271
def get_labeling_sessions(experiment_id=None, status=None):
272
"""
273
List labeling sessions with optional filtering.
274
275
Parameters:
276
- experiment_id: str, optional - Filter by experiment
277
- status: str, optional - Filter by session status
278
279
Returns:
280
List of LabelingSession objects
281
"""
282
283
def delete_labeling_session(session_id):
284
"""
285
Delete labeling session.
286
287
Parameters:
288
- session_id: str - Session ID to delete
289
"""
290
291
class LabelingSession:
292
def __init__(self, name, dataset=None, instructions=None, config=None):
293
"""
294
Interactive labeling session for GenAI data.
295
296
Parameters:
297
- name: str - Session name
298
- dataset: Dataset, optional - Dataset to label
299
- instructions: str, optional - Labeling instructions
300
- config: dict, optional - Session configuration
301
"""
302
303
def add_labels(self, labels):
304
"""Add labels to session."""
305
306
def get_labels(self):
307
"""Get current session labels."""
308
309
def export_labels(self, format="json"):
310
"""Export labels in specified format."""
311
312
class Agent:
313
def __init__(self, name, model=None, tools=None, instructions=None):
314
"""
315
GenAI agent for automated evaluation and labeling.
316
317
Parameters:
318
- name: str - Agent name
319
- model: Model or str, optional - LLM model for agent
320
- tools: list, optional - Available tools for agent
321
- instructions: str, optional - Agent instructions
322
"""
323
324
def get_review_app(session_id):
325
"""
326
Get review application for labeling session.
327
328
Parameters:
329
- session_id: str - Labeling session ID
330
331
Returns:
332
ReviewApp object for interactive review
333
"""
334
335
class ReviewApp:
336
def __init__(self, session):
337
"""
338
Web application for reviewing and labeling LLM outputs.
339
340
Parameters:
341
- session: LabelingSession - Associated labeling session
342
"""
343
344
def launch(self, port=8080, host="localhost"):
345
"""Launch review application."""
346
347
def stop(self):
348
"""Stop review application."""
349
```
350
351
### Built-in Evaluators and Judges
352
353
Pre-built evaluators and judge models for common LLM evaluation tasks.
354
355
```python { .api }
356
# Built-in judge models for evaluation
357
judges = {
358
"gpt4_as_judge": "GPT-4 based evaluation judge",
359
"claude_as_judge": "Claude based evaluation judge",
360
"llama_as_judge": "Llama based evaluation judge"
361
}
362
363
# Built-in scorer functions
364
scorers = {
365
"answer_relevance": "Evaluate answer relevance to question",
366
"answer_correctness": "Evaluate factual correctness of answers",
367
"answer_similarity": "Semantic similarity between answers",
368
"faithfulness": "Evaluate faithfulness to source context",
369
"context_precision": "Precision of retrieved context",
370
"context_recall": "Recall of retrieved context",
371
"toxicity": "Detect toxic or harmful content",
372
"readability": "Evaluate text readability and clarity"
373
}
374
375
# Dataset utilities
376
datasets = {
377
"common_datasets": "Access to common LLM evaluation datasets",
378
"benchmarks": "Standard LLM benchmarks and test sets"
379
}
380
```
381
382
## Usage Examples
383
384
### Basic LLM Evaluation
385
386
```python
387
import mlflow
388
import mlflow.genai
389
import pandas as pd
390
391
# Prepare evaluation dataset
392
eval_data = pd.DataFrame({
393
"inputs": [
394
"What is machine learning?",
395
"Explain deep learning",
396
"How does AI work?"
397
],
398
"targets": [
399
"Machine learning is a subset of AI that learns from data",
400
"Deep learning uses neural networks with multiple layers",
401
"AI works by processing data to make predictions or decisions"
402
]
403
})
404
405
# Evaluate LLM model
406
with mlflow.start_run():
407
results = mlflow.genai.evaluate(
408
model="openai:/gpt-4", # Model URI
409
data=eval_data,
410
model_type="text",
411
evaluators=["default", "answer_relevance", "toxicity"],
412
targets="targets"
413
)
414
415
# Log evaluation results
416
mlflow.log_metrics(results.metrics)
417
418
print("Evaluation Results:")
419
for metric_name, score in results.metrics.items():
420
print(f"{metric_name}: {score:.3f}")
421
```
422
423
### Custom Scorer Creation
424
425
```python
426
import mlflow.genai
427
from mlflow.genai import scorer
428
import re
429
430
# Create custom scorer using decorator
431
@scorer(name="question_detection", greater_is_better=True)
432
def detect_questions(predictions, targets=None, **kwargs):
433
"""Custom scorer to detect if text contains questions."""
434
scores = []
435
for pred in predictions:
436
# Count question marks and question words
437
question_marks = pred.count('?')
438
question_words = len(re.findall(r'\b(what|how|why|when|where|who)\b', pred.lower()))
439
score = min(1.0, (question_marks + question_words * 0.5) / 2)
440
scores.append(score)
441
return scores
442
443
# Create scorer using class
444
class SentimentScorer(mlflow.genai.Scorer):
445
def __init__(self):
446
super().__init__(
447
eval_fn=self._score_sentiment,
448
name="sentiment_positivity",
449
greater_is_better=True
450
)
451
452
def _score_sentiment(self, predictions, **kwargs):
453
"""Score text sentiment positivity."""
454
# Simplified sentiment scoring
455
positive_words = ["good", "great", "excellent", "amazing", "wonderful"]
456
negative_words = ["bad", "terrible", "awful", "horrible", "worst"]
457
458
scores = []
459
for pred in predictions:
460
pred_lower = pred.lower()
461
pos_count = sum(word in pred_lower for word in positive_words)
462
neg_count = sum(word in pred_lower for word in negative_words)
463
464
if pos_count + neg_count == 0:
465
score = 0.5 # Neutral
466
else:
467
score = pos_count / (pos_count + neg_count)
468
469
scores.append(score)
470
return scores
471
472
# Use custom scorers in evaluation
473
sentiment_scorer = SentimentScorer()
474
475
results = mlflow.genai.evaluate(
476
model="openai:/gpt-3.5-turbo",
477
data=eval_data,
478
custom_metrics=[detect_questions, sentiment_scorer],
479
model_type="text"
480
)
481
482
print("Custom metric results:")
483
print(f"Question detection: {results.metrics['question_detection']:.3f}")
484
print(f"Sentiment positivity: {results.metrics['sentiment_positivity']:.3f}")
485
```
486
487
### Prompt Management Workflow
488
489
```python
490
import mlflow.genai
491
492
# Register prompt templates
493
classification_prompt = """
494
You are an expert classifier. Given the following text, classify it into one of these categories: {categories}
495
496
Text: {text}
497
498
Classification:
499
"""
500
501
mlflow.genai.register_prompt(
502
name="text_classification/v1",
503
prompt=classification_prompt,
504
description="Multi-class text classification prompt",
505
tags={"task": "classification", "version": "1.0"}
506
)
507
508
# Register improved version
509
improved_prompt = """
510
You are an expert text classifier with high accuracy. Analyze the following text carefully and classify it into exactly one of these categories: {categories}
511
512
Text to classify: "{text}"
513
514
Think step by step:
515
1. What are the key themes in this text?
516
2. Which category best matches these themes?
517
3. Why is this the best classification?
518
519
Final classification:
520
"""
521
522
mlflow.genai.register_prompt(
523
name="text_classification/v2",
524
prompt=improved_prompt,
525
description="Improved classification prompt with reasoning",
526
tags={"task": "classification", "version": "2.0", "reasoning": "true"}
527
)
528
529
# Set alias for best performing version
530
mlflow.genai.set_prompt_alias(
531
name="text_classification",
532
alias="champion",
533
version="2"
534
)
535
536
# Load and use prompt
537
prompt = mlflow.genai.load_prompt("text_classification@champion")
538
formatted_prompt = prompt.format(
539
categories=["positive", "negative", "neutral"],
540
text="I love this product!"
541
)
542
543
print("Formatted prompt:")
544
print(formatted_prompt)
545
546
# Search for prompts
547
classification_prompts = mlflow.genai.search_prompts(
548
name_like="classification*",
549
tags={"task": "classification"}
550
)
551
552
print(f"\nFound {len(classification_prompts)} classification prompts")
553
for p in classification_prompts:
554
print(f"- {p.name}: {p.description}")
555
```
556
557
### Prompt Optimization
558
559
```python
560
import mlflow.genai
561
562
# Define optimization task
563
task_description = """
564
Create a prompt that helps an AI assistant generate engaging
565
product descriptions for e-commerce items. The descriptions
566
should be persuasive, informative, and highlight key features.
567
"""
568
569
# Base prompt template
570
base_prompt = """
571
Write a product description for: {product_name}
572
573
Features: {features}
574
Price: {price}
575
576
Description:
577
"""
578
579
# Optimize prompt automatically
580
with mlflow.start_run():
581
optimization_result = mlflow.genai.optimize_prompt(
582
task=task_description,
583
prompt_template=base_prompt,
584
num_candidates=10,
585
max_iterations=5,
586
model="openai:/gpt-4",
587
evaluator_config={
588
"metrics": ["engagement", "clarity", "persuasiveness"]
589
}
590
)
591
592
# Log optimization results
593
mlflow.log_metric("optimization_score", optimization_result.best_score)
594
mlflow.log_param("iterations_completed", optimization_result.iterations)
595
596
# Register optimized prompt
597
mlflow.genai.register_prompt(
598
name="product_description/optimized",
599
prompt=optimization_result.best_prompt,
600
description="Auto-optimized product description prompt",
601
tags={"optimized": "true", "score": str(optimization_result.best_score)}
602
)
603
604
print(f"Optimization completed with score: {optimization_result.best_score:.3f}")
605
print(f"Best prompt:\n{optimization_result.best_prompt}")
606
```
607
608
### Interactive Labeling Session
609
610
```python
611
import mlflow.genai
612
import pandas as pd
613
614
# Create dataset for labeling
615
unlabeled_data = pd.DataFrame({
616
"text": [
617
"The movie was absolutely fantastic!",
618
"I didn't like the service at all.",
619
"The product works as expected.",
620
"This is the worst experience ever.",
621
"Pretty good, would recommend."
622
]
623
})
624
625
# Create labeling session
626
session = mlflow.genai.create_labeling_session(
627
name="sentiment_labeling_v1",
628
dataset=unlabeled_data,
629
instructions="""
630
Label each text with sentiment:
631
- positive: Text expresses positive sentiment
632
- negative: Text expresses negative sentiment
633
- neutral: Text expresses neutral sentiment
634
635
Consider the overall emotional tone and opinion expressed.
636
""",
637
config={
638
"labels": ["positive", "negative", "neutral"],
639
"allow_multiple": False,
640
"require_confidence": True
641
}
642
)
643
644
print(f"Created labeling session: {session.session_id}")
645
646
# Simulate adding labels (normally done through UI)
647
labels = [
648
{"text_id": 0, "label": "positive", "confidence": 0.95},
649
{"text_id": 1, "label": "negative", "confidence": 0.90},
650
{"text_id": 2, "label": "neutral", "confidence": 0.80},
651
{"text_id": 3, "label": "negative", "confidence": 0.98},
652
{"text_id": 4, "label": "positive", "confidence": 0.85}
653
]
654
655
session.add_labels(labels)
656
657
# Export labeled data
658
labeled_dataset = session.export_labels(format="json")
659
print(f"Exported {len(labeled_dataset)} labeled examples")
660
661
# Create review app for quality control
662
review_app = mlflow.genai.get_review_app(session.session_id)
663
# review_app.launch(port=8080) # Launches web interface
664
```
665
666
### GenAI Agent Implementation
667
668
```python
669
import mlflow.genai
670
671
# Create GenAI agent for automated evaluation
672
evaluation_agent = mlflow.genai.Agent(
673
name="evaluation_agent",
674
model="openai:/gpt-4",
675
tools=["web_search", "calculator", "code_execution"],
676
instructions="""
677
You are an expert evaluator for AI-generated content.
678
Analyze responses for accuracy, relevance, and quality.
679
Use available tools to fact-check when needed.
680
Provide detailed feedback and numerical scores.
681
"""
682
)
683
684
# Agent evaluates model outputs
685
test_outputs = [
686
"Paris is the capital of France and has a population of about 2.1 million.",
687
"The square root of 144 is 12.",
688
"Python is a programming language created in 1991 by Guido van Rossum."
689
]
690
691
evaluation_results = []
692
for output in test_outputs:
693
# Agent evaluates each output
694
result = evaluation_agent.evaluate(
695
text=output,
696
criteria=["factual_accuracy", "completeness", "clarity"]
697
)
698
evaluation_results.append(result)
699
700
# Create automated labeling agent
701
labeling_agent = mlflow.genai.Agent(
702
name="auto_labeler",
703
model="anthropic:/claude-3",
704
instructions="""
705
You are an expert data labeler. Label text data according to
706
the provided schema and guidelines. Be consistent and accurate.
707
"""
708
)
709
710
# Use agent for automated labeling
711
auto_labels = labeling_agent.label_batch(
712
texts=unlabeled_data["text"].tolist(),
713
schema={"sentiment": ["positive", "negative", "neutral"]},
714
guidelines="Focus on overall emotional tone and opinion"
715
)
716
717
print("Automated labeling results:")
718
for text, label in zip(unlabeled_data["text"], auto_labels):
719
print(f"'{text}' -> {label}")
720
```
721
722
### Comprehensive LLM Evaluation Pipeline
723
724
```python
725
import mlflow
726
import mlflow.genai
727
import pandas as pd
728
729
def create_llm_evaluation_pipeline():
730
"""Comprehensive LLM evaluation workflow."""
731
732
# Set up experiment
733
mlflow.set_experiment("llm_evaluation_pipeline")
734
735
with mlflow.start_run():
736
# 1. Prepare evaluation dataset
737
eval_data = pd.DataFrame({
738
"questions": [
739
"What is artificial intelligence?",
740
"How do neural networks work?",
741
"What are the benefits of machine learning?",
742
"Explain natural language processing",
743
"What is deep learning?"
744
],
745
"ground_truth": [
746
"AI is the simulation of human intelligence in machines",
747
"Neural networks are computing systems inspired by biological neural networks",
748
"ML provides automation, insights, and improved decision-making",
749
"NLP enables computers to understand and process human language",
750
"Deep learning is a subset of ML using artificial neural networks"
751
]
752
})
753
754
# 2. Create custom evaluators
755
@mlflow.genai.scorer(name="technical_accuracy")
756
def technical_accuracy(predictions, targets, **kwargs):
757
# Simplified technical accuracy scoring
758
scores = []
759
for pred, target in zip(predictions, targets):
760
# Check for technical keywords overlap
761
pred_words = set(pred.lower().split())
762
target_words = set(target.lower().split())
763
overlap = len(pred_words & target_words) / len(target_words | pred_words)
764
scores.append(overlap)
765
return scores
766
767
# 3. Evaluate multiple models
768
models_to_evaluate = [
769
"openai:/gpt-3.5-turbo",
770
"openai:/gpt-4",
771
"anthropic:/claude-3"
772
]
773
774
comparison_results = {}
775
776
for model_name in models_to_evaluate:
777
print(f"\nEvaluating {model_name}...")
778
779
# Evaluate model
780
results = mlflow.genai.evaluate(
781
model=model_name,
782
data=eval_data,
783
targets="ground_truth",
784
model_type="text",
785
evaluators=["default", "answer_relevance", "faithfulness"],
786
custom_metrics=[technical_accuracy],
787
evaluator_config={
788
"answer_relevance": {"threshold": 0.7},
789
"faithfulness": {"threshold": 0.8}
790
}
791
)
792
793
comparison_results[model_name] = results.metrics
794
795
# Log individual model results
796
for metric, value in results.metrics.items():
797
mlflow.log_metric(f"{model_name}_{metric}", value)
798
799
# 4. Create comparison report
800
print("\n=== Model Comparison Results ===")
801
for metric in ["answer_relevance", "faithfulness", "technical_accuracy"]:
802
print(f"\n{metric}:")
803
for model, metrics in comparison_results.items():
804
print(f" {model}: {metrics.get(metric, 0):.3f}")
805
806
# 5. Register best performing prompt
807
best_model = max(
808
comparison_results.items(),
809
key=lambda x: x[1].get("answer_relevance", 0)
810
)[0]
811
812
mlflow.log_param("best_model", best_model)
813
mlflow.log_metric("best_answer_relevance",
814
comparison_results[best_model]["answer_relevance"])
815
816
# 6. Save evaluation artifacts
817
comparison_df = pd.DataFrame(comparison_results).T
818
comparison_df.to_csv("model_comparison.csv")
819
mlflow.log_artifact("model_comparison.csv")
820
821
print(f"\nBest performing model: {best_model}")
822
823
return comparison_results
824
825
# Run evaluation pipeline
826
results = create_llm_evaluation_pipeline()
827
```
828
829
## Types
830
831
```python { .api }
832
from typing import Dict, List, Any, Optional, Union, Callable
833
from mlflow.entities import Dataset
834
import pandas as pd
835
836
# Core evaluation types
837
class EvaluationResult:
838
metrics: Dict[str, float]
839
artifacts: Dict[str, str]
840
tables: Dict[str, pd.DataFrame]
841
842
def to_predict_fn(
843
model_uri: str,
844
inference_params: Optional[Dict[str, Any]] = None
845
) -> Callable[[pd.DataFrame], List[str]]: ...
846
847
# Prompt management types
848
class Prompt:
849
name: str
850
version: str
851
template: str
852
model_config: Optional[Dict[str, Any]]
853
description: Optional[str]
854
tags: Dict[str, str]
855
856
def format(self, **kwargs) -> str: ...
857
858
class PromptTemplate:
859
template: str
860
input_variables: List[str]
861
862
def format(self, **kwargs) -> str: ...
863
864
# Scorer types
865
class Scorer:
866
name: str
867
version: Optional[str]
868
greater_is_better: bool
869
long_name: Optional[str]
870
model_type: Optional[str]
871
872
def score(self, predictions: List[str], targets: Optional[List[str]] = None, **kwargs) -> List[float]: ...
873
874
def scorer(
875
name: Optional[str] = None,
876
version: Optional[str] = None,
877
greater_is_better: bool = True,
878
long_name: Optional[str] = None,
879
model_type: Optional[str] = None
880
) -> Callable: ...
881
882
# Optimization types
883
class OptimizationResult:
884
best_prompt: str
885
best_score: float
886
iterations: int
887
candidate_prompts: List[str]
888
scores: List[float]
889
890
# Scheduling types
891
class ScorerScheduleConfig:
892
schedule_type: str
893
frequency: Union[str, int]
894
start_time: Optional[str]
895
end_time: Optional[str]
896
timezone: Optional[str]
897
898
# Labeling types
899
class LabelingSession:
900
session_id: str
901
name: str
902
dataset: Optional[Dataset]
903
instructions: Optional[str]
904
config: Dict[str, Any]
905
status: str
906
907
def add_labels(self, labels: List[Dict[str, Any]]) -> None: ...
908
def get_labels(self) -> List[Dict[str, Any]]: ...
909
def export_labels(self, format: str = "json") -> Union[List[Dict], pd.DataFrame]: ...
910
911
class Agent:
912
name: str
913
model: Optional[str]
914
tools: List[str]
915
instructions: Optional[str]
916
917
def evaluate(self, text: str, criteria: List[str]) -> Dict[str, Any]: ...
918
def label_batch(self, texts: List[str], schema: Dict[str, Any], guidelines: str) -> List[Dict[str, Any]]: ...
919
920
class ReviewApp:
921
session: LabelingSession
922
923
def launch(self, port: int = 8080, host: str = "localhost") -> None: ...
924
def stop(self) -> None: ...
925
926
# Dataset types
927
class GenAIDataset(Dataset):
928
name: str
929
version: Optional[str]
930
description: Optional[str]
931
tags: Dict[str, str]
932
933
# Built-in resources
934
judges: Dict[str, str]
935
scorers: Dict[str, str]
936
datasets: Dict[str, str]
937
```