0
# Experiment Execution
1
2
Comprehensive experiment execution system with evaluation capabilities, progress tracking, and OpenTelemetry instrumentation for AI model testing and systematic evaluation workflows.
3
4
## Capabilities
5
6
### Experiment Execution
7
8
Run experiments on datasets with custom tasks and evaluators, including automatic instrumentation and progress tracking.
9
10
```typescript { .api }
11
/**
12
* Run an experiment on a dataset with evaluation
13
* @param params - Experiment execution parameters
14
* @returns Promise resolving to experiment results
15
*/
16
function runExperiment(params: {
17
client?: PhoenixClient;
18
experimentName?: string;
19
experimentDescription?: string;
20
experimentMetadata?: Record<string, unknown>;
21
dataset: DatasetSelector;
22
task: ExperimentTask;
23
evaluators?: Evaluator[];
24
logger?: Logger;
25
record?: boolean;
26
concurrency?: number;
27
dryRun?: number | boolean;
28
setGlobalTracerProvider?: boolean;
29
repetitions?: number;
30
useBatchSpanProcessor?: boolean;
31
}): Promise<RanExperiment>;
32
33
interface ExperimentTask {
34
(example: Example): Promise<Record<string, unknown>>;
35
}
36
37
interface Evaluator {
38
name: string;
39
kind: AnnotatorKind;
40
evaluate: (
41
example: Example,
42
output: Record<string, unknown>
43
) => Promise<EvaluationResult>;
44
}
45
46
/**
47
* Helper function to create an evaluator with proper typing
48
* @param params - Evaluator configuration
49
* @returns Evaluator instance
50
*/
51
function asEvaluator(params: {
52
name: string;
53
kind: AnnotatorKind;
54
evaluate: Evaluator["evaluate"];
55
}): Evaluator;
56
57
interface EvaluationResult {
58
name: string;
59
score?: number;
60
label?: string;
61
explanation?: string;
62
metadata?: Record<string, unknown>;
63
}
64
65
interface RanExperiment extends ExperimentInfo {
66
runs: Record<string, ExperimentRun>;
67
evaluationRuns?: ExperimentEvaluationRun[];
68
}
69
70
interface ExperimentInfo {
71
id: string;
72
datasetId: string;
73
datasetVersionId: string;
74
projectName: string;
75
metadata: Record<string, unknown>;
76
}
77
78
interface ExperimentRun {
79
id: string;
80
startTime: Date;
81
endTime: Date;
82
experimentId: string;
83
datasetExampleId: string;
84
output?: string | boolean | number | object | null;
85
error: string | null;
86
repetition: number;
87
}
88
89
interface ExperimentEvaluationRun {
90
id: string;
91
runId: string;
92
evaluatorName: string;
93
result: EvaluationResult;
94
startTime: Date;
95
endTime: Date;
96
}
97
```
98
99
**Usage Example:**
100
101
```typescript
102
import { runExperiment, asEvaluator } from "@arizeai/phoenix-client/experiments";
103
104
// Define your task function
105
const myTask: ExperimentTask = async (example) => {
106
const { question } = example.input;
107
108
// Call your AI model/API
109
const response = await callMyModel(question);
110
111
return {
112
answer: response.answer,
113
confidence: response.confidence
114
};
115
};
116
117
// Define evaluators using asEvaluator helper
118
const accuracyEvaluator = asEvaluator({
119
name: "accuracy",
120
kind: "HEURISTIC",
121
evaluate: async (example, output) => {
122
const expectedAnswer = example.output?.answer;
123
const actualAnswer = output.answer;
124
125
const isCorrect = expectedAnswer === actualAnswer;
126
127
return {
128
name: "accuracy",
129
score: isCorrect ? 1 : 0,
130
label: isCorrect ? "correct" : "incorrect"
131
};
132
}
133
});
134
135
// Run the experiment
136
const results = await runExperiment({
137
dataset: { datasetName: "qa-eval-set" },
138
task: myTask,
139
evaluators: [accuracyEvaluator],
140
metadata: {
141
model: "gpt-4o",
142
temperature: 0.3,
143
experiment_type: "accuracy_test"
144
},
145
concurrency: 5,
146
repetitions: 1
147
});
148
149
console.log(`Experiment ${results.experimentId} completed`);
150
console.log(`Processed ${results.runs.length} examples`);
151
console.log(`Generated ${results.evaluations.length} evaluations`);
152
```
153
154
### Experiment Information Retrieval
155
156
Get experiment metadata without full run details for lightweight operations.
157
158
```typescript { .api }
159
/**
160
* Get experiment metadata
161
* @param params - Experiment info parameters
162
* @returns Promise resolving to experiment information
163
*/
164
function getExperimentInfo(params: {
165
client?: PhoenixClient;
166
experimentId: string;
167
}): Promise<ExperimentInfo>;
168
169
interface ExperimentInfo {
170
id: string;
171
datasetId: string;
172
datasetVersionId: string;
173
projectName: string;
174
metadata: Record<string, unknown>;
175
}
176
```
177
178
### Experiment Data Retrieval
179
180
Retrieve complete experiment data including all runs and evaluations.
181
182
```typescript { .api }
183
/**
184
* Get complete experiment data
185
* @param params - Experiment retrieval parameters
186
* @returns Promise resolving to full experiment data
187
*/
188
function getExperiment(params: {
189
client?: PhoenixClient;
190
experimentId: string;
191
}): Promise<RanExperiment>;
192
193
interface RanExperiment extends ExperimentInfo {
194
runs: Record<string, ExperimentRun>;
195
evaluationRuns?: ExperimentEvaluationRun[];
196
}
197
```
198
199
### Experiment Runs Retrieval
200
201
Get experiment runs with optional filtering and pagination.
202
203
```typescript { .api }
204
/**
205
* Get experiment runs with filtering
206
* @param params - Experiment runs parameters
207
* @returns Promise resolving to experiment runs
208
*/
209
function getExperimentRuns(params: {
210
client?: PhoenixClient;
211
experimentId: string;
212
}): Promise<{ runs: ExperimentRun[] }>
213
214
type ExperimentRunID = string;
215
```
216
217
**Usage Examples:**
218
219
```typescript
220
import { getExperimentInfo, getExperiment, getExperimentRuns } from "@arizeai/phoenix-client/experiments";
221
222
// Get basic experiment info
223
const info = await getExperimentInfo({
224
experimentId: "exp_123"
225
});
226
227
// Get complete experiment with runs
228
const experiment = await getExperiment({
229
experimentId: "exp_123"
230
});
231
232
// Get runs with pagination
233
const runs = await getExperimentRuns({
234
experimentId: "exp_123",
235
limit: 50,
236
offset: 0,
237
status: "COMPLETED"
238
});
239
```
240
241
### Experiment Configuration
242
243
Advanced configuration options for experiment execution behavior.
244
245
**Concurrency Control:**
246
247
```typescript
248
// Run up to 10 examples in parallel
249
await runExperiment({
250
dataset: { datasetId: "dataset_123" },
251
task: myTask,
252
concurrency: 10 // Default: 1
253
});
254
```
255
256
**Repetitions:**
257
258
```typescript
259
// Run each example 3 times for reliability testing
260
await runExperiment({
261
dataset: { datasetId: "dataset_123" },
262
task: myTask,
263
repetitions: 3 // Default: 1
264
});
265
```
266
267
**Custom Logging:**
268
269
```typescript
270
import { Logger } from "@arizeai/phoenix-client/types/logger";
271
272
const customLogger: Logger = {
273
info: (message: string) => console.log(`[INFO] ${message}`),
274
error: (message: string) => console.error(`[ERROR] ${message}`),
275
warn: (message: string) => console.warn(`[WARN] ${message}`)
276
};
277
278
await runExperiment({
279
dataset: { datasetId: "dataset_123" },
280
task: myTask,
281
logger: customLogger
282
});
283
```
284
285
### OpenTelemetry Integration
286
287
Experiments automatically generate OpenTelemetry traces for observability and debugging.
288
289
**Automatic Instrumentation:**
290
291
- Each experiment run is traced as a span
292
- Task execution and evaluator runs are sub-spans
293
- Error details are captured in span attributes
294
- Experiment metadata is included in trace context
295
296
**Custom Instrumentation Provider:**
297
298
```typescript
299
import { NodeTracerProvider } from "@opentelemetry/sdk-trace-node";
300
301
const provider = new NodeTracerProvider({
302
// Custom tracer configuration
303
});
304
305
await runExperiment({
306
dataset: { datasetId: "dataset_123" },
307
task: myTask,
308
instructionProvider: provider
309
});
310
```
311
312
### Evaluator Patterns
313
314
Common patterns for implementing evaluators for different use cases.
315
316
**Binary Classification:**
317
318
```typescript
319
const binaryEvaluator: Evaluator = {
320
name: "binary_accuracy",
321
evaluate: async (example, output) => {
322
const expected = example.output?.label;
323
const predicted = output.prediction;
324
325
return {
326
name: "binary_accuracy",
327
score: expected === predicted ? 1 : 0,
328
label: expected === predicted ? "correct" : "incorrect",
329
explanation: `Expected: ${expected}, Got: ${predicted}`
330
};
331
}
332
};
333
```
334
335
**Similarity-Based Evaluation:**
336
337
```typescript
338
const similarityEvaluator: Evaluator = {
339
name: "semantic_similarity",
340
evaluate: async (example, output) => {
341
const expected = example.output?.text;
342
const generated = output.text;
343
344
// Use your similarity calculation
345
const similarity = await calculateSimilarity(expected, generated);
346
347
return {
348
name: "semantic_similarity",
349
score: similarity,
350
explanation: `Similarity score between expected and generated text`
351
};
352
}
353
};
354
```
355
356
**LLM-as-Judge:**
357
358
```typescript
359
const llmJudgeEvaluator: Evaluator = {
360
name: "llm_judge",
361
evaluate: async (example, output) => {
362
const prompt = `Rate the quality of this response on a scale of 1-5:
363
364
Question: ${example.input.question}
365
Response: ${output.answer}
366
367
Provide a numeric score and brief explanation.`;
368
369
const judgeResponse = await callJudgeModel(prompt);
370
371
return {
372
name: "llm_judge",
373
score: judgeResponse.score,
374
explanation: judgeResponse.explanation
375
};
376
}
377
};
378
```
379
380
### Error Handling
381
382
Experiments include robust error handling with detailed error reporting.
383
384
**Task Error Handling:**
385
386
```typescript
387
const robustTask: ExperimentTask = async (example) => {
388
try {
389
const result = await callAPI(example.input);
390
return result;
391
} catch (error) {
392
// Errors are automatically captured in experiment runs
393
throw new Error(`Task failed: ${error.message}`);
394
}
395
};
396
```
397
398
**Evaluator Error Handling:**
399
400
```typescript
401
const safeEvaluator: Evaluator = {
402
name: "safe_evaluator",
403
evaluate: async (example, output) => {
404
try {
405
const score = await computeScore(example, output);
406
return { name: "safe_evaluator", score };
407
} catch (error) {
408
// Return error information in evaluation result
409
return {
410
name: "safe_evaluator",
411
score: null,
412
explanation: `Evaluation failed: ${error.message}`
413
};
414
}
415
}
416
};
417
```
418
419
### Best Practices
420
421
- **Deterministic Tasks**: Ensure task functions are deterministic for reproducible results
422
- **Error Handling**: Implement proper error handling in both tasks and evaluators
423
- **Concurrency**: Use appropriate concurrency levels based on API rate limits
424
- **Metadata**: Include relevant experiment metadata for analysis and comparison
425
- **Evaluation Strategy**: Choose evaluators appropriate for your specific use case
426
- **Progress Monitoring**: Use logging to monitor long-running experiments
427
- **Resource Management**: Consider memory usage with large datasets and high concurrency