Tessl Tile for npm/@arizeai/phoenix-client@4.0.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

client.md datasets.md experiments.md index.md prompts.md sdk-integration.md spans.md

experiments.mddocs/

0
# Experiment Execution
1

2
Comprehensive experiment execution system with evaluation capabilities, progress tracking, and OpenTelemetry instrumentation for AI model testing and systematic evaluation workflows.
3

4
## Capabilities
5

6
### Experiment Execution
7

8
Run experiments on datasets with custom tasks and evaluators, including automatic instrumentation and progress tracking.
9

10
```typescript { .api }
11
/**
12
 * Run an experiment on a dataset with evaluation
13
 * @param params - Experiment execution parameters
14
 * @returns Promise resolving to experiment results
15
 */
16
function runExperiment(params: {
17
  client?: PhoenixClient;
18
  experimentName?: string;
19
  experimentDescription?: string;
20
  experimentMetadata?: Record<string, unknown>;
21
  dataset: DatasetSelector;
22
  task: ExperimentTask;
23
  evaluators?: Evaluator[];
24
  logger?: Logger;
25
  record?: boolean;
26
  concurrency?: number;
27
  dryRun?: number | boolean;
28
  setGlobalTracerProvider?: boolean;
29
  repetitions?: number;
30
  useBatchSpanProcessor?: boolean;
31
}): Promise<RanExperiment>;
32

33
interface ExperimentTask {
34
  (example: Example): Promise<Record<string, unknown>>;
35
}
36

37
interface Evaluator {
38
  name: string;
39
  kind: AnnotatorKind;
40
  evaluate: (
41
    example: Example,
42
    output: Record<string, unknown>
43
  ) => Promise<EvaluationResult>;
44
}
45

46
/**
47
 * Helper function to create an evaluator with proper typing
48
 * @param params - Evaluator configuration
49
 * @returns Evaluator instance
50
 */
51
function asEvaluator(params: {
52
  name: string;
53
  kind: AnnotatorKind;
54
  evaluate: Evaluator["evaluate"];
55
}): Evaluator;
56

57
interface EvaluationResult {
58
  name: string;
59
  score?: number;
60
  label?: string;
61
  explanation?: string;
62
  metadata?: Record<string, unknown>;
63
}
64

65
interface RanExperiment extends ExperimentInfo {
66
  runs: Record<string, ExperimentRun>;
67
  evaluationRuns?: ExperimentEvaluationRun[];
68
}
69

70
interface ExperimentInfo {
71
  id: string;
72
  datasetId: string;
73
  datasetVersionId: string;
74
  projectName: string;
75
  metadata: Record<string, unknown>;
76
}
77

78
interface ExperimentRun {
79
  id: string;
80
  startTime: Date;
81
  endTime: Date;
82
  experimentId: string;
83
  datasetExampleId: string;
84
  output?: string | boolean | number | object | null;
85
  error: string | null;
86
  repetition: number;
87
}
88

89
interface ExperimentEvaluationRun {
90
  id: string;
91
  runId: string;
92
  evaluatorName: string;
93
  result: EvaluationResult;
94
  startTime: Date;
95
  endTime: Date;
96
}
97
```
98

99
**Usage Example:**
100

101
```typescript
102
import { runExperiment, asEvaluator } from "@arizeai/phoenix-client/experiments";
103

104
// Define your task function
105
const myTask: ExperimentTask = async (example) => {
106
  const { question } = example.input;
107

108
  // Call your AI model/API
109
  const response = await callMyModel(question);
110

111
  return {
112
    answer: response.answer,
113
    confidence: response.confidence
114
  };
115
};
116

117
// Define evaluators using asEvaluator helper
118
const accuracyEvaluator = asEvaluator({
119
  name: "accuracy",
120
  kind: "HEURISTIC",
121
  evaluate: async (example, output) => {
122
    const expectedAnswer = example.output?.answer;
123
    const actualAnswer = output.answer;
124

125
    const isCorrect = expectedAnswer === actualAnswer;
126

127
    return {
128
      name: "accuracy",
129
      score: isCorrect ? 1 : 0,
130
      label: isCorrect ? "correct" : "incorrect"
131
    };
132
  }
133
});
134

135
// Run the experiment
136
const results = await runExperiment({
137
  dataset: { datasetName: "qa-eval-set" },
138
  task: myTask,
139
  evaluators: [accuracyEvaluator],
140
  metadata: {
141
    model: "gpt-4o",
142
    temperature: 0.3,
143
    experiment_type: "accuracy_test"
144
  },
145
  concurrency: 5,
146
  repetitions: 1
147
});
148

149
console.log(`Experiment ${results.experimentId} completed`);
150
console.log(`Processed ${results.runs.length} examples`);
151
console.log(`Generated ${results.evaluations.length} evaluations`);
152
```
153

154
### Experiment Information Retrieval
155

156
Get experiment metadata without full run details for lightweight operations.
157

158
```typescript { .api }
159
/**
160
 * Get experiment metadata
161
 * @param params - Experiment info parameters
162
 * @returns Promise resolving to experiment information
163
 */
164
function getExperimentInfo(params: {
165
  client?: PhoenixClient;
166
  experimentId: string;
167
}): Promise<ExperimentInfo>;
168

169
interface ExperimentInfo {
170
  id: string;
171
  datasetId: string;
172
  datasetVersionId: string;
173
  projectName: string;
174
  metadata: Record<string, unknown>;
175
}
176
```
177

178
### Experiment Data Retrieval
179

180
Retrieve complete experiment data including all runs and evaluations.
181

182
```typescript { .api }
183
/**
184
 * Get complete experiment data
185
 * @param params - Experiment retrieval parameters
186
 * @returns Promise resolving to full experiment data
187
 */
188
function getExperiment(params: {
189
  client?: PhoenixClient;
190
  experimentId: string;
191
}): Promise<RanExperiment>;
192

193
interface RanExperiment extends ExperimentInfo {
194
  runs: Record<string, ExperimentRun>;
195
  evaluationRuns?: ExperimentEvaluationRun[];
196
}
197
```
198

199
### Experiment Runs Retrieval
200

201
Get experiment runs with optional filtering and pagination.
202

203
```typescript { .api }
204
/**
205
 * Get experiment runs with filtering
206
 * @param params - Experiment runs parameters
207
 * @returns Promise resolving to experiment runs
208
 */
209
function getExperimentRuns(params: {
210
  client?: PhoenixClient;
211
  experimentId: string;
212
}): Promise<{ runs: ExperimentRun[] }>
213

214
type ExperimentRunID = string;
215
```
216

217
**Usage Examples:**
218

219
```typescript
220
import { getExperimentInfo, getExperiment, getExperimentRuns } from "@arizeai/phoenix-client/experiments";
221

222
// Get basic experiment info
223
const info = await getExperimentInfo({
224
  experimentId: "exp_123"
225
});
226

227
// Get complete experiment with runs
228
const experiment = await getExperiment({
229
  experimentId: "exp_123"
230
});
231

232
// Get runs with pagination
233
const runs = await getExperimentRuns({
234
  experimentId: "exp_123",
235
  limit: 50,
236
  offset: 0,
237
  status: "COMPLETED"
238
});
239
```
240

241
### Experiment Configuration
242

243
Advanced configuration options for experiment execution behavior.
244

245
**Concurrency Control:**
246

247
```typescript
248
// Run up to 10 examples in parallel
249
await runExperiment({
250
  dataset: { datasetId: "dataset_123" },
251
  task: myTask,
252
  concurrency: 10 // Default: 1
253
});
254
```
255

256
**Repetitions:**
257

258
```typescript
259
// Run each example 3 times for reliability testing
260
await runExperiment({
261
  dataset: { datasetId: "dataset_123" },
262
  task: myTask,
263
  repetitions: 3 // Default: 1
264
});
265
```
266

267
**Custom Logging:**
268

269
```typescript
270
import { Logger } from "@arizeai/phoenix-client/types/logger";
271

272
const customLogger: Logger = {
273
  info: (message: string) => console.log(`[INFO] ${message}`),
274
  error: (message: string) => console.error(`[ERROR] ${message}`),
275
  warn: (message: string) => console.warn(`[WARN] ${message}`)
276
};
277

278
await runExperiment({
279
  dataset: { datasetId: "dataset_123" },
280
  task: myTask,
281
  logger: customLogger
282
});
283
```
284

285
### OpenTelemetry Integration
286

287
Experiments automatically generate OpenTelemetry traces for observability and debugging.
288

289
**Automatic Instrumentation:**
290

291
- Each experiment run is traced as a span
292
- Task execution and evaluator runs are sub-spans
293
- Error details are captured in span attributes
294
- Experiment metadata is included in trace context
295

296
**Custom Instrumentation Provider:**
297

298
```typescript
299
import { NodeTracerProvider } from "@opentelemetry/sdk-trace-node";
300

301
const provider = new NodeTracerProvider({
302
  // Custom tracer configuration
303
});
304

305
await runExperiment({
306
  dataset: { datasetId: "dataset_123" },
307
  task: myTask,
308
  instructionProvider: provider
309
});
310
```
311

312
### Evaluator Patterns
313

314
Common patterns for implementing evaluators for different use cases.
315

316
**Binary Classification:**
317

318
```typescript
319
const binaryEvaluator: Evaluator = {
320
  name: "binary_accuracy",
321
  evaluate: async (example, output) => {
322
    const expected = example.output?.label;
323
    const predicted = output.prediction;
324

325
    return {
326
      name: "binary_accuracy",
327
      score: expected === predicted ? 1 : 0,
328
      label: expected === predicted ? "correct" : "incorrect",
329
      explanation: `Expected: ${expected}, Got: ${predicted}`
330
    };
331
  }
332
};
333
```
334

335
**Similarity-Based Evaluation:**
336

337
```typescript
338
const similarityEvaluator: Evaluator = {
339
  name: "semantic_similarity",
340
  evaluate: async (example, output) => {
341
    const expected = example.output?.text;
342
    const generated = output.text;
343

344
    // Use your similarity calculation
345
    const similarity = await calculateSimilarity(expected, generated);
346

347
    return {
348
      name: "semantic_similarity",
349
      score: similarity,
350
      explanation: `Similarity score between expected and generated text`
351
    };
352
  }
353
};
354
```
355

356
**LLM-as-Judge:**
357

358
```typescript
359
const llmJudgeEvaluator: Evaluator = {
360
  name: "llm_judge",
361
  evaluate: async (example, output) => {
362
    const prompt = `Rate the quality of this response on a scale of 1-5:
363

364
Question: ${example.input.question}
365
Response: ${output.answer}
366

367
Provide a numeric score and brief explanation.`;
368

369
    const judgeResponse = await callJudgeModel(prompt);
370

371
    return {
372
      name: "llm_judge",
373
      score: judgeResponse.score,
374
      explanation: judgeResponse.explanation
375
    };
376
  }
377
};
378
```
379

380
### Error Handling
381

382
Experiments include robust error handling with detailed error reporting.
383

384
**Task Error Handling:**
385

386
```typescript
387
const robustTask: ExperimentTask = async (example) => {
388
  try {
389
    const result = await callAPI(example.input);
390
    return result;
391
  } catch (error) {
392
    // Errors are automatically captured in experiment runs
393
    throw new Error(`Task failed: ${error.message}`);
394
  }
395
};
396
```
397

398
**Evaluator Error Handling:**
399

400
```typescript
401
const safeEvaluator: Evaluator = {
402
  name: "safe_evaluator",
403
  evaluate: async (example, output) => {
404
    try {
405
      const score = await computeScore(example, output);
406
      return { name: "safe_evaluator", score };
407
    } catch (error) {
408
      // Return error information in evaluation result
409
      return {
410
        name: "safe_evaluator",
411
        score: null,
412
        explanation: `Evaluation failed: ${error.message}`
413
      };
414
    }
415
  }
416
};
417
```
418

419
### Best Practices
420

421
- **Deterministic Tasks**: Ensure task functions are deterministic for reproducible results
422
- **Error Handling**: Implement proper error handling in both tasks and evaluators
423
- **Concurrency**: Use appropriate concurrency levels based on API rate limits
424
- **Metadata**: Include relevant experiment metadata for analysis and comparison
425
- **Evaluation Strategy**: Choose evaluators appropriate for your specific use case
426
- **Progress Monitoring**: Use logging to monitor long-running experiments
427
- **Resource Management**: Consider memory usage with large datasets and high concurrency

Version

Tile

Files

experiments.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

experiments.mddocs/