or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

client.mddatasets.mdexperiments.mdindex.mdprompts.mdsdk-integration.mdspans.md

experiments.mddocs/

0

# Experiment Execution

1

2

Comprehensive experiment execution system with evaluation capabilities, progress tracking, and OpenTelemetry instrumentation for AI model testing and systematic evaluation workflows.

3

4

## Capabilities

5

6

### Experiment Execution

7

8

Run experiments on datasets with custom tasks and evaluators, including automatic instrumentation and progress tracking.

9

10

```typescript { .api }

11

/**

12

* Run an experiment on a dataset with evaluation

13

* @param params - Experiment execution parameters

14

* @returns Promise resolving to experiment results

15

*/

16

function runExperiment(params: {

17

client?: PhoenixClient;

18

experimentName?: string;

19

experimentDescription?: string;

20

experimentMetadata?: Record<string, unknown>;

21

dataset: DatasetSelector;

22

task: ExperimentTask;

23

evaluators?: Evaluator[];

24

logger?: Logger;

25

record?: boolean;

26

concurrency?: number;

27

dryRun?: number | boolean;

28

setGlobalTracerProvider?: boolean;

29

repetitions?: number;

30

useBatchSpanProcessor?: boolean;

31

}): Promise<RanExperiment>;

32

33

interface ExperimentTask {

34

(example: Example): Promise<Record<string, unknown>>;

35

}

36

37

interface Evaluator {

38

name: string;

39

kind: AnnotatorKind;

40

evaluate: (

41

example: Example,

42

output: Record<string, unknown>

43

) => Promise<EvaluationResult>;

44

}

45

46

/**

47

* Helper function to create an evaluator with proper typing

48

* @param params - Evaluator configuration

49

* @returns Evaluator instance

50

*/

51

function asEvaluator(params: {

52

name: string;

53

kind: AnnotatorKind;

54

evaluate: Evaluator["evaluate"];

55

}): Evaluator;

56

57

interface EvaluationResult {

58

name: string;

59

score?: number;

60

label?: string;

61

explanation?: string;

62

metadata?: Record<string, unknown>;

63

}

64

65

interface RanExperiment extends ExperimentInfo {

66

runs: Record<string, ExperimentRun>;

67

evaluationRuns?: ExperimentEvaluationRun[];

68

}

69

70

interface ExperimentInfo {

71

id: string;

72

datasetId: string;

73

datasetVersionId: string;

74

projectName: string;

75

metadata: Record<string, unknown>;

76

}

77

78

interface ExperimentRun {

79

id: string;

80

startTime: Date;

81

endTime: Date;

82

experimentId: string;

83

datasetExampleId: string;

84

output?: string | boolean | number | object | null;

85

error: string | null;

86

repetition: number;

87

}

88

89

interface ExperimentEvaluationRun {

90

id: string;

91

runId: string;

92

evaluatorName: string;

93

result: EvaluationResult;

94

startTime: Date;

95

endTime: Date;

96

}

97

```

98

99

**Usage Example:**

100

101

```typescript

102

import { runExperiment, asEvaluator } from "@arizeai/phoenix-client/experiments";

103

104

// Define your task function

105

const myTask: ExperimentTask = async (example) => {

106

const { question } = example.input;

107

108

// Call your AI model/API

109

const response = await callMyModel(question);

110

111

return {

112

answer: response.answer,

113

confidence: response.confidence

114

};

115

};

116

117

// Define evaluators using asEvaluator helper

118

const accuracyEvaluator = asEvaluator({

119

name: "accuracy",

120

kind: "HEURISTIC",

121

evaluate: async (example, output) => {

122

const expectedAnswer = example.output?.answer;

123

const actualAnswer = output.answer;

124

125

const isCorrect = expectedAnswer === actualAnswer;

126

127

return {

128

name: "accuracy",

129

score: isCorrect ? 1 : 0,

130

label: isCorrect ? "correct" : "incorrect"

131

};

132

}

133

});

134

135

// Run the experiment

136

const results = await runExperiment({

137

dataset: { datasetName: "qa-eval-set" },

138

task: myTask,

139

evaluators: [accuracyEvaluator],

140

metadata: {

141

model: "gpt-4o",

142

temperature: 0.3,

143

experiment_type: "accuracy_test"

144

},

145

concurrency: 5,

146

repetitions: 1

147

});

148

149

console.log(`Experiment ${results.experimentId} completed`);

150

console.log(`Processed ${results.runs.length} examples`);

151

console.log(`Generated ${results.evaluations.length} evaluations`);

152

```

153

154

### Experiment Information Retrieval

155

156

Get experiment metadata without full run details for lightweight operations.

157

158

```typescript { .api }

159

/**

160

* Get experiment metadata

161

* @param params - Experiment info parameters

162

* @returns Promise resolving to experiment information

163

*/

164

function getExperimentInfo(params: {

165

client?: PhoenixClient;

166

experimentId: string;

167

}): Promise<ExperimentInfo>;

168

169

interface ExperimentInfo {

170

id: string;

171

datasetId: string;

172

datasetVersionId: string;

173

projectName: string;

174

metadata: Record<string, unknown>;

175

}

176

```

177

178

### Experiment Data Retrieval

179

180

Retrieve complete experiment data including all runs and evaluations.

181

182

```typescript { .api }

183

/**

184

* Get complete experiment data

185

* @param params - Experiment retrieval parameters

186

* @returns Promise resolving to full experiment data

187

*/

188

function getExperiment(params: {

189

client?: PhoenixClient;

190

experimentId: string;

191

}): Promise<RanExperiment>;

192

193

interface RanExperiment extends ExperimentInfo {

194

runs: Record<string, ExperimentRun>;

195

evaluationRuns?: ExperimentEvaluationRun[];

196

}

197

```

198

199

### Experiment Runs Retrieval

200

201

Get experiment runs with optional filtering and pagination.

202

203

```typescript { .api }

204

/**

205

* Get experiment runs with filtering

206

* @param params - Experiment runs parameters

207

* @returns Promise resolving to experiment runs

208

*/

209

function getExperimentRuns(params: {

210

client?: PhoenixClient;

211

experimentId: string;

212

}): Promise<{ runs: ExperimentRun[] }>

213

214

type ExperimentRunID = string;

215

```

216

217

**Usage Examples:**

218

219

```typescript

220

import { getExperimentInfo, getExperiment, getExperimentRuns } from "@arizeai/phoenix-client/experiments";

221

222

// Get basic experiment info

223

const info = await getExperimentInfo({

224

experimentId: "exp_123"

225

});

226

227

// Get complete experiment with runs

228

const experiment = await getExperiment({

229

experimentId: "exp_123"

230

});

231

232

// Get runs with pagination

233

const runs = await getExperimentRuns({

234

experimentId: "exp_123",

235

limit: 50,

236

offset: 0,

237

status: "COMPLETED"

238

});

239

```

240

241

### Experiment Configuration

242

243

Advanced configuration options for experiment execution behavior.

244

245

**Concurrency Control:**

246

247

```typescript

248

// Run up to 10 examples in parallel

249

await runExperiment({

250

dataset: { datasetId: "dataset_123" },

251

task: myTask,

252

concurrency: 10 // Default: 1

253

});

254

```

255

256

**Repetitions:**

257

258

```typescript

259

// Run each example 3 times for reliability testing

260

await runExperiment({

261

dataset: { datasetId: "dataset_123" },

262

task: myTask,

263

repetitions: 3 // Default: 1

264

});

265

```

266

267

**Custom Logging:**

268

269

```typescript

270

import { Logger } from "@arizeai/phoenix-client/types/logger";

271

272

const customLogger: Logger = {

273

info: (message: string) => console.log(`[INFO] ${message}`),

274

error: (message: string) => console.error(`[ERROR] ${message}`),

275

warn: (message: string) => console.warn(`[WARN] ${message}`)

276

};

277

278

await runExperiment({

279

dataset: { datasetId: "dataset_123" },

280

task: myTask,

281

logger: customLogger

282

});

283

```

284

285

### OpenTelemetry Integration

286

287

Experiments automatically generate OpenTelemetry traces for observability and debugging.

288

289

**Automatic Instrumentation:**

290

291

- Each experiment run is traced as a span

292

- Task execution and evaluator runs are sub-spans

293

- Error details are captured in span attributes

294

- Experiment metadata is included in trace context

295

296

**Custom Instrumentation Provider:**

297

298

```typescript

299

import { NodeTracerProvider } from "@opentelemetry/sdk-trace-node";

300

301

const provider = new NodeTracerProvider({

302

// Custom tracer configuration

303

});

304

305

await runExperiment({

306

dataset: { datasetId: "dataset_123" },

307

task: myTask,

308

instructionProvider: provider

309

});

310

```

311

312

### Evaluator Patterns

313

314

Common patterns for implementing evaluators for different use cases.

315

316

**Binary Classification:**

317

318

```typescript

319

const binaryEvaluator: Evaluator = {

320

name: "binary_accuracy",

321

evaluate: async (example, output) => {

322

const expected = example.output?.label;

323

const predicted = output.prediction;

324

325

return {

326

name: "binary_accuracy",

327

score: expected === predicted ? 1 : 0,

328

label: expected === predicted ? "correct" : "incorrect",

329

explanation: `Expected: ${expected}, Got: ${predicted}`

330

};

331

}

332

};

333

```

334

335

**Similarity-Based Evaluation:**

336

337

```typescript

338

const similarityEvaluator: Evaluator = {

339

name: "semantic_similarity",

340

evaluate: async (example, output) => {

341

const expected = example.output?.text;

342

const generated = output.text;

343

344

// Use your similarity calculation

345

const similarity = await calculateSimilarity(expected, generated);

346

347

return {

348

name: "semantic_similarity",

349

score: similarity,

350

explanation: `Similarity score between expected and generated text`

351

};

352

}

353

};

354

```

355

356

**LLM-as-Judge:**

357

358

```typescript

359

const llmJudgeEvaluator: Evaluator = {

360

name: "llm_judge",

361

evaluate: async (example, output) => {

362

const prompt = `Rate the quality of this response on a scale of 1-5:

363

364

Question: ${example.input.question}

365

Response: ${output.answer}

366

367

Provide a numeric score and brief explanation.`;

368

369

const judgeResponse = await callJudgeModel(prompt);

370

371

return {

372

name: "llm_judge",

373

score: judgeResponse.score,

374

explanation: judgeResponse.explanation

375

};

376

}

377

};

378

```

379

380

### Error Handling

381

382

Experiments include robust error handling with detailed error reporting.

383

384

**Task Error Handling:**

385

386

```typescript

387

const robustTask: ExperimentTask = async (example) => {

388

try {

389

const result = await callAPI(example.input);

390

return result;

391

} catch (error) {

392

// Errors are automatically captured in experiment runs

393

throw new Error(`Task failed: ${error.message}`);

394

}

395

};

396

```

397

398

**Evaluator Error Handling:**

399

400

```typescript

401

const safeEvaluator: Evaluator = {

402

name: "safe_evaluator",

403

evaluate: async (example, output) => {

404

try {

405

const score = await computeScore(example, output);

406

return { name: "safe_evaluator", score };

407

} catch (error) {

408

// Return error information in evaluation result

409

return {

410

name: "safe_evaluator",

411

score: null,

412

explanation: `Evaluation failed: ${error.message}`

413

};

414

}

415

}

416

};

417

```

418

419

### Best Practices

420

421

- **Deterministic Tasks**: Ensure task functions are deterministic for reproducible results

422

- **Error Handling**: Implement proper error handling in both tasks and evaluators

423

- **Concurrency**: Use appropriate concurrency levels based on API rate limits

424

- **Metadata**: Include relevant experiment metadata for analysis and comparison

425

- **Evaluation Strategy**: Choose evaluators appropriate for your specific use case

426

- **Progress Monitoring**: Use logging to monitor long-running experiments

427

- **Resource Management**: Consider memory usage with large datasets and high concurrency