or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

assistants.mdaudio.mdbatches-evals.mdchat-completions.mdclient-configuration.mdcontainers.mdconversations.mdembeddings.mdfiles-uploads.mdfine-tuning.mdhelpers-audio.mdhelpers-zod.mdimages.mdindex.mdrealtime.mdresponses-api.mdvector-stores.mdvideos.md

batches-evals.mddocs/

0

# Batches and Evaluations

1

2

Comprehensive API reference for batch processing and evaluation management in the OpenAI Node.js library. Use batches to process multiple API requests asynchronously, and use evaluations to systematically test model performance against defined criteria.

3

4

## Overview

5

6

### Batches

7

8

Batches allow you to send multiple API requests in a single operation, processed asynchronously by OpenAI. This is ideal for high-volume, non-time-sensitive workloads where cost efficiency is important.

9

10

### Evaluations (Evals)

11

12

Evaluations provide a framework to systematically assess model outputs against defined testing criteria. Run evaluations on different models, configurations, and data sources to compare performance.

13

14

## Batches API

15

16

### Create a Batch

17

18

Submits a new batch job for processing. The batch is created from a JSONL file containing API requests.

19

20

```typescript { .api }

21

create(params: BatchCreateParams): Promise<Batch>

22

```

23

24

**Parameters:**

25

26

```typescript { .api }

27

interface BatchCreateParams {

28

/**

29

* The time frame within which the batch should be processed.

30

* Currently only `24h` is supported.

31

*/

32

completion_window: '24h';

33

34

/**

35

* The endpoint to be used for all requests in the batch.

36

* Supported: `/v1/responses`, `/v1/chat/completions`, `/v1/embeddings`,

37

* `/v1/completions`, `/v1/moderations`.

38

* Note: `/v1/embeddings` batches limited to 50,000 embedding inputs.

39

*/

40

endpoint:

41

| '/v1/responses'

42

| '/v1/chat/completions'

43

| '/v1/embeddings'

44

| '/v1/completions'

45

| '/v1/moderations';

46

47

/**

48

* The ID of an uploaded file containing requests for the batch.

49

* Must be a JSONL file uploaded with purpose `batch`.

50

* Max 50,000 requests, 200 MB file size.

51

*/

52

input_file_id: string;

53

54

/**

55

* Optional metadata (16 key-value pairs max).

56

* Keys: max 64 chars; Values: max 512 chars.

57

*/

58

metadata?: Metadata | null;

59

60

/**

61

* Optional expiration policy for output/error files.

62

*/

63

output_expires_after?: {

64

/**

65

* Anchor timestamp: `created_at` (file creation time).

66

*/

67

anchor: 'created_at';

68

/**

69

* Seconds after anchor: 3600 (1 hour) to 2592000 (30 days).

70

*/

71

seconds: number;

72

};

73

}

74

```

75

76

**Example:**

77

78

```typescript

79

import { OpenAI } from 'openai';

80

81

const client = new OpenAI();

82

83

// 1. Create a JSONL file with batch requests

84

const batchRequests = [

85

{

86

custom_id: 'request-1',

87

method: 'POST',

88

url: '/v1/chat/completions',

89

body: {

90

model: 'gpt-4o',

91

messages: [{ role: 'user', content: 'Translate "hello" to French' }],

92

max_tokens: 100,

93

},

94

},

95

{

96

custom_id: 'request-2',

97

method: 'POST',

98

url: '/v1/chat/completions',

99

body: {

100

model: 'gpt-4o',

101

messages: [{ role: 'user', content: 'Translate "goodbye" to Spanish' }],

102

max_tokens: 100,

103

},

104

},

105

];

106

107

// 2. Upload the file

108

const file = await client.files.create({

109

file: new Blob([batchRequests.map(r => JSON.stringify(r)).join('\n')], {

110

type: 'application/jsonl',

111

}),

112

purpose: 'batch',

113

});

114

115

// 3. Create the batch

116

const batch = await client.batches.create({

117

input_file_id: file.id,

118

endpoint: '/v1/chat/completions',

119

completion_window: '24h',

120

});

121

122

console.log(`Batch ${batch.id} submitted`);

123

```

124

125

### Retrieve a Batch

126

127

Retrieves details about a specific batch job.

128

129

```typescript { .api }

130

retrieve(batchID: string): Promise<Batch>

131

```

132

133

**Example:**

134

135

```typescript

136

const batch = await client.batches.retrieve('batch_abc123');

137

console.log(`Batch status: ${batch.status}`);

138

console.log(`Completed: ${batch.request_counts.completed}`);

139

console.log(`Failed: ${batch.request_counts.failed}`);

140

```

141

142

### List Batches

143

144

Retrieves a paginated list of batch jobs for your organization.

145

146

```typescript { .api }

147

list(params?: BatchListParams): Promise<BatchesPage>

148

```

149

150

**Parameters:**

151

152

```typescript { .api }

153

interface BatchListParams extends CursorPageParams {

154

// Pagination parameters inherited from CursorPageParams

155

}

156

```

157

158

**Example:**

159

160

```typescript

161

// List all batches

162

for await (const batch of client.batches.list()) {

163

console.log(`${batch.id}: ${batch.status}`);

164

}

165

166

// List with pagination

167

const page = await client.batches.list();

168

if (page.hasNextPage()) {

169

const nextPage = await page.getNextPage();

170

}

171

```

172

173

### Cancel a Batch

174

175

Cancels a batch that is in progress. The batch transitions to `cancelling` status for up to 10 minutes, then becomes `cancelled` with any partial results available.

176

177

```typescript { .api }

178

cancel(batchID: string): Promise<Batch>

179

```

180

181

**Example:**

182

183

```typescript

184

const cancelled = await client.batches.cancel('batch_abc123');

185

console.log(`Batch status: ${cancelled.status}`); // 'cancelling'

186

```

187

188

### Batch Types

189

190

```typescript { .api }

191

interface Batch {

192

/**

193

* Unique batch identifier.

194

*/

195

id: string;

196

197

/**

198

* The completion window (currently always `24h`).

199

*/

200

completion_window: string;

201

202

/**

203

* Unix timestamp (seconds) when batch was created.

204

*/

205

created_at: number;

206

207

/**

208

* The API endpoint used by this batch.

209

*/

210

endpoint: string;

211

212

/**

213

* The ID of the input file containing requests.

214

*/

215

input_file_id: string;

216

217

/**

218

* Object type identifier (always `batch`).

219

*/

220

object: 'batch';

221

222

/**

223

* Current batch status: validating, failed, in_progress, finalizing,

224

* completed, expired, cancelling, or cancelled.

225

*/

226

status:

227

| 'validating'

228

| 'failed'

229

| 'in_progress'

230

| 'finalizing'

231

| 'completed'

232

| 'expired'

233

| 'cancelling'

234

| 'cancelled';

235

236

/**

237

* Unix timestamp (seconds) when batch was cancelled (if applicable).

238

*/

239

cancelled_at?: number;

240

241

/**

242

* Unix timestamp (seconds) when batch started cancelling.

243

*/

244

cancelling_at?: number;

245

246

/**

247

* Unix timestamp (seconds) when batch completed.

248

*/

249

completed_at?: number;

250

251

/**

252

* ID of file containing errors (if any).

253

*/

254

error_file_id?: string;

255

256

/**

257

* List of batch-level errors.

258

*/

259

errors?: {

260

data?: BatchError[];

261

object?: string;

262

};

263

264

/**

265

* Unix timestamp (seconds) when batch expired.

266

*/

267

expired_at?: number;

268

269

/**

270

* Unix timestamp (seconds) when batch will expire.

271

*/

272

expires_at?: number;

273

274

/**

275

* Unix timestamp (seconds) when batch failed.

276

*/

277

failed_at?: number;

278

279

/**

280

* Unix timestamp (seconds) when batch started finalizing.

281

*/

282

finalizing_at?: number;

283

284

/**

285

* Unix timestamp (seconds) when batch started processing.

286

*/

287

in_progress_at?: number;

288

289

/**

290

* Optional metadata (16 key-value pairs max).

291

*/

292

metadata?: Metadata | null;

293

294

/**

295

* Model ID used (e.g., `gpt-5-2025-08-07`).

296

*/

297

model?: string;

298

299

/**

300

* ID of file containing successful outputs.

301

*/

302

output_file_id?: string;

303

304

/**

305

* Request count statistics.

306

*/

307

request_counts?: BatchRequestCounts;

308

309

/**

310

* Token usage details (batches created after Sept 7, 2025).

311

*/

312

usage?: BatchUsage;

313

}

314

```

315

316

```typescript { .api }

317

interface BatchError {

318

/**

319

* Error code identifying the error type.

320

*/

321

code?: string;

322

323

/**

324

* Line number in input file where error occurred.

325

*/

326

line?: number | null;

327

328

/**

329

* Human-readable error message.

330

*/

331

message?: string;

332

333

/**

334

* Parameter name that caused the error.

335

*/

336

param?: string | null;

337

}

338

```

339

340

```typescript { .api }

341

interface BatchRequestCounts {

342

/**

343

* Number of requests completed successfully.

344

*/

345

completed: number;

346

347

/**

348

* Number of requests that failed.

349

*/

350

failed: number;

351

352

/**

353

* Total number of requests in the batch.

354

*/

355

total: number;

356

}

357

```

358

359

```typescript { .api }

360

interface BatchUsage {

361

/**

362

* Number of input tokens.

363

*/

364

input_tokens: number;

365

366

/**

367

* Detailed breakdown of input tokens.

368

*/

369

input_tokens_details: {

370

/**

371

* Tokens retrieved from cache.

372

*/

373

cached_tokens: number;

374

};

375

376

/**

377

* Number of output tokens.

378

*/

379

output_tokens: number;

380

381

/**

382

* Detailed breakdown of output tokens.

383

*/

384

output_tokens_details: {

385

/**

386

* Reasoning tokens used (for reasoning models).

387

*/

388

reasoning_tokens: number;

389

};

390

391

/**

392

* Total tokens used.

393

*/

394

total_tokens: number;

395

}

396

```

397

398

## Evaluations API

399

400

### Create an Evaluation

401

402

Defines the structure of an evaluation with testing criteria and data source configuration. After creation, run the evaluation on different models and parameters.

403

404

```typescript { .api }

405

create(params: EvalCreateParams): Promise<EvalCreateResponse>

406

```

407

408

**Parameters:**

409

410

```typescript { .api }

411

interface EvalCreateParams {

412

/**

413

* Data source configuration determining the schema of data used in runs.

414

* Can be custom, logs, or stored_completions.

415

*/

416

data_source_config:

417

| EvalCreateParams.Custom

418

| EvalCreateParams.Logs

419

| EvalCreateParams.StoredCompletions;

420

421

/**

422

* List of graders (testing criteria) for all eval runs.

423

* Can reference variables using {{item.variable_name}} or {{sample.output_text}}.

424

*/

425

testing_criteria: Array<

426

| EvalCreateParams.LabelModel

427

| StringCheckGrader

428

| EvalCreateParams.TextSimilarity

429

| EvalCreateParams.Python

430

| EvalCreateParams.ScoreModel

431

>;

432

433

/**

434

* Optional metadata (16 key-value pairs max).

435

*/

436

metadata?: Metadata | null;

437

438

/**

439

* Optional evaluation name.

440

*/

441

name?: string;

442

}

443

444

namespace EvalCreateParams {

445

interface Custom {

446

/**

447

* JSON schema for each row in the data source.

448

*/

449

item_schema: Record<string, unknown>;

450

/**

451

* Data source type (always `custom`).

452

*/

453

type: 'custom';

454

/**

455

* Whether eval expects you to populate sample namespace.

456

*/

457

include_sample_schema?: boolean;

458

}

459

460

interface Logs {

461

/**

462

* Data source type (always `logs`).

463

*/

464

type: 'logs';

465

/**

466

* Metadata filters for logs query.

467

*/

468

metadata?: Record<string, unknown>;

469

}

470

471

interface StoredCompletions {

472

/**

473

* Data source type (always `stored_completions`).

474

* @deprecated Use Logs instead.

475

*/

476

type: 'stored_completions';

477

/**

478

* Metadata filters for stored completions.

479

*/

480

metadata?: Record<string, unknown>;

481

}

482

483

interface LabelModel {

484

/**

485

* List of messages forming the prompt (may include {{item.variable}}).

486

*/

487

input: Array<{

488

content: string;

489

role: string;

490

}>;

491

/**

492

* Labels to classify each item.

493

*/

494

labels: string[];

495

/**

496

* Model to use (must support structured outputs).

497

*/

498

model: string;

499

/**

500

* Grader name.

501

*/

502

name: string;

503

/**

504

* Labels indicating a passing result.

505

*/

506

passing_labels: string[];

507

/**

508

* Type (always `label_model`).

509

*/

510

type: 'label_model';

511

}

512

513

interface TextSimilarity extends GraderModelsAPI.TextSimilarityGrader {

514

/**

515

* Threshold for passing score.

516

*/

517

pass_threshold: number;

518

}

519

520

interface Python extends GraderModelsAPI.PythonGrader {

521

/**

522

* Optional threshold for passing score.

523

*/

524

pass_threshold?: number;

525

}

526

527

interface ScoreModel extends GraderModelsAPI.ScoreModelGrader {

528

/**

529

* Optional threshold for passing score.

530

*/

531

pass_threshold?: number;

532

}

533

}

534

```

535

536

**Example:**

537

538

```typescript

539

// Create evaluation for customer support chatbot

540

const evalResponse = await client.evals.create({

541

name: 'Customer Support Quality',

542

data_source_config: {

543

type: 'custom',

544

item_schema: {

545

type: 'object',

546

properties: {

547

customer_question: { type: 'string' },

548

expected_keywords: { type: 'array', items: { type: 'string' } },

549

},

550

required: ['customer_question'],

551

},

552

include_sample_schema: true,

553

},

554

testing_criteria: [

555

{

556

type: 'string_check',

557

name: 'Contains Required Keywords',

558

pass_keywords: ['{{item.expected_keywords}}'],

559

},

560

{

561

type: 'label_model',

562

name: 'Tone Assessment',

563

model: 'gpt-4o',

564

labels: ['professional', 'friendly', 'hostile'],

565

passing_labels: ['professional', 'friendly'],

566

input: [

567

{

568

role: 'system',

569

content: 'Assess the tone of the response.',

570

},

571

{

572

role: 'user',

573

content: '{{sample.output_text}}',

574

},

575

],

576

},

577

],

578

});

579

580

console.log(`Created evaluation: ${evalResponse.id}`);

581

```

582

583

### Retrieve an Evaluation

584

585

Gets details about a specific evaluation.

586

587

```typescript { .api }

588

retrieve(evalID: string): Promise<EvalRetrieveResponse>

589

```

590

591

**Example:**

592

593

```typescript

594

const eval = await client.evals.retrieve('eval_abc123');

595

console.log(`Evaluation: ${eval.name}`);

596

console.log(`Testing criteria count: ${eval.testing_criteria.length}`);

597

```

598

599

### Update an Evaluation

600

601

Updates evaluation properties (name, metadata).

602

603

```typescript { .api }

604

update(evalID: string, params: EvalUpdateParams): Promise<EvalUpdateResponse>

605

```

606

607

**Parameters:**

608

609

```typescript { .api }

610

interface EvalUpdateParams {

611

/**

612

* Optional metadata (16 key-value pairs max).

613

*/

614

metadata?: Metadata | null;

615

616

/**

617

* Rename the evaluation.

618

*/

619

name?: string;

620

}

621

```

622

623

**Example:**

624

625

```typescript

626

const updated = await client.evals.update('eval_abc123', {

627

name: 'Customer Support Quality - v2',

628

metadata: { version: '2', status: 'production' },

629

});

630

```

631

632

### List Evaluations

633

634

Lists evaluations for your project.

635

636

```typescript { .api }

637

list(params?: EvalListParams): Promise<EvalListResponsesPage>

638

```

639

640

**Parameters:**

641

642

```typescript { .api }

643

interface EvalListParams extends CursorPageParams {

644

/**

645

* Sort order: `asc` or `desc` (default: `asc`).

646

*/

647

order?: 'asc' | 'desc';

648

649

/**

650

* Sort by: `created_at` or `updated_at` (default: `created_at`).

651

*/

652

order_by?: 'created_at' | 'updated_at';

653

}

654

```

655

656

**Example:**

657

658

```typescript

659

// List all evaluations sorted by creation date

660

for await (const eval of client.evals.list({ order_by: 'created_at', order: 'desc' })) {

661

console.log(`${eval.name} (${eval.id})`);

662

}

663

```

664

665

### Delete an Evaluation

666

667

Deletes an evaluation and all associated runs.

668

669

```typescript { .api }

670

delete(evalID: string): Promise<EvalDeleteResponse>

671

```

672

673

**Example:**

674

675

```typescript

676

const result = await client.evals.delete('eval_abc123');

677

console.log(`Deleted: ${result.deleted}`);

678

```

679

680

### Evaluation Types

681

682

```typescript { .api }

683

interface EvalCreateResponse {

684

/**

685

* Unique evaluation identifier.

686

*/

687

id: string;

688

689

/**

690

* Unix timestamp (seconds) when evaluation was created.

691

*/

692

created_at: number;

693

694

/**

695

* Data source configuration for runs.

696

*/

697

data_source_config:

698

| EvalCustomDataSourceConfig

699

| EvalCreateResponse.Logs

700

| EvalStoredCompletionsDataSourceConfig;

701

702

/**

703

* Optional metadata.

704

*/

705

metadata: Metadata | null;

706

707

/**

708

* Evaluation name.

709

*/

710

name: string;

711

712

/**

713

* Object type (always `eval`).

714

*/

715

object: 'eval';

716

717

/**

718

* List of testing criteria (graders).

719

*/

720

testing_criteria: Array<

721

| LabelModelGrader

722

| StringCheckGrader

723

| EvalCreateResponse.EvalGraderTextSimilarity

724

| EvalCreateResponse.EvalGraderPython

725

| EvalCreateResponse.EvalGraderScoreModel

726

>;

727

}

728

```

729

730

```typescript { .api }

731

interface EvalCustomDataSourceConfig {

732

/**

733

* JSON schema for run data source items.

734

*/

735

schema: Record<string, unknown>;

736

737

/**

738

* Data source type (always `custom`).

739

*/

740

type: 'custom';

741

}

742

```

743

744

```typescript { .api }

745

interface EvalStoredCompletionsDataSourceConfig {

746

/**

747

* JSON schema for run data source items.

748

*/

749

schema: Record<string, unknown>;

750

751

/**

752

* Data source type (always `stored_completions`).

753

* @deprecated Use LogsDataSourceConfig instead.

754

*/

755

type: 'stored_completions';

756

757

/**

758

* Optional metadata.

759

*/

760

metadata?: Metadata | null;

761

}

762

```

763

764

```typescript { .api }

765

interface EvalDeleteResponse {

766

deleted: boolean;

767

eval_id: string;

768

object: string;

769

}

770

```

771

772

## Evaluation Runs API

773

774

### Create a Run

775

776

Starts an evaluation run for a given evaluation. Validates data source against evaluation schema.

777

778

```typescript { .api }

779

create(evalID: string, params: RunCreateParams): Promise<RunCreateResponse>

780

```

781

782

**Parameters:**

783

784

```typescript { .api }

785

interface RunCreateParams {

786

/**

787

* Run data source: JSONL, completions, or responses.

788

*/

789

data_source:

790

| CreateEvalJSONLRunDataSource

791

| CreateEvalCompletionsRunDataSource

792

| RunCreateParams.CreateEvalResponsesRunDataSource;

793

794

/**

795

* Optional metadata (16 key-value pairs max).

796

*/

797

metadata?: Metadata | null;

798

799

/**

800

* Optional run name.

801

*/

802

name?: string;

803

}

804

```

805

806

**Data Sources:**

807

808

```typescript { .api }

809

interface CreateEvalJSONLRunDataSource {

810

/**

811

* JSONL source (file content or file ID).

812

*/

813

source:

814

| { type: 'file_content'; content: Array<{ item: Record<string, unknown>; sample?: Record<string, unknown> }> }

815

| { type: 'file_id'; id: string };

816

817

/**

818

* Data source type (always `jsonl`).

819

*/

820

type: 'jsonl';

821

}

822

823

interface CreateEvalCompletionsRunDataSource {

824

/**

825

* Source configuration.

826

*/

827

source:

828

| { type: 'file_content'; content: Array<{ item: Record<string, unknown> }> }

829

| { type: 'file_id'; id: string }

830

| {

831

type: 'stored_completions';

832

created_after?: number | null;

833

created_before?: number | null;

834

limit?: number | null;

835

metadata?: Metadata | null;

836

model?: string | null;

837

};

838

839

/**

840

* Data source type (always `completions`).

841

*/

842

type: 'completions';

843

844

/**

845

* Input messages (template or item reference).

846

*/

847

input_messages?: { type: 'template'; template: unknown[] } | { type: 'item_reference'; item_reference: string };

848

849

/**

850

* Model to use for sampling.

851

*/

852

model?: string;

853

854

/**

855

* Sampling parameters (temperature, max_tokens, etc.).

856

*/

857

sampling_params?: Record<string, unknown>;

858

}

859

```

860

861

**Example:**

862

863

```typescript

864

// Create run with JSONL data source

865

const run = await client.evals.runs.create('eval_abc123', {

866

name: 'Production Test Run',

867

data_source: {

868

type: 'jsonl',

869

source: {

870

type: 'file_content',

871

content: [

872

{

873

item: {

874

customer_question: 'How do I reset my password?',

875

expected_keywords: ['password', 'reset', 'account'],

876

},

877

sample: {

878

output_text: 'To reset your password, go to the login page and click "Forgot Password".',

879

},

880

},

881

{

882

item: {

883

customer_question: 'What are your business hours?',

884

expected_keywords: ['hours', 'open', 'close'],

885

},

886

sample: {

887

output_text: 'We are open 9 AM to 5 PM EST, Monday through Friday.',

888

},

889

},

890

],

891

},

892

},

893

});

894

895

console.log(`Run ${run.id} started with status: ${run.status}`);

896

```

897

898

### Retrieve a Run

899

900

Gets details about a specific evaluation run.

901

902

```typescript { .api }

903

retrieve(runID: string, params: RunRetrieveParams): Promise<RunRetrieveResponse>

904

```

905

906

**Example:**

907

908

```typescript

909

const run = await client.evals.runs.retrieve('run_xyz789', {

910

eval_id: 'eval_abc123',

911

});

912

913

console.log(`Status: ${run.status}`);

914

console.log(`Passed: ${run.result_counts.passed}`);

915

console.log(`Failed: ${run.result_counts.failed}`);

916

console.log(`Report: ${run.report_url}`);

917

```

918

919

### List Runs

920

921

Lists evaluation runs for a given evaluation.

922

923

```typescript { .api }

924

list(evalID: string, params?: RunListParams): Promise<RunListResponsesPage>

925

```

926

927

**Parameters:**

928

929

```typescript { .api }

930

interface RunListParams extends CursorPageParams {

931

/**

932

* Sort order: `asc` or `desc` (default: `asc`).

933

*/

934

order?: 'asc' | 'desc';

935

936

/**

937

* Filter by status: queued, in_progress, completed, canceled, failed.

938

*/

939

status?: 'queued' | 'in_progress' | 'completed' | 'canceled' | 'failed';

940

}

941

```

942

943

**Example:**

944

945

```typescript

946

// List completed runs

947

const runs = await client.evals.runs.list('eval_abc123', {

948

status: 'completed',

949

order: 'desc',

950

});

951

952

for await (const run of runs) {

953

console.log(`${run.name}: ${run.status}`);

954

}

955

```

956

957

### Delete a Run

958

959

Deletes an evaluation run.

960

961

```typescript { .api }

962

delete(runID: string, params: RunDeleteParams): Promise<RunDeleteResponse>

963

```

964

965

**Example:**

966

967

```typescript

968

await client.evals.runs.delete('run_xyz789', { eval_id: 'eval_abc123' });

969

```

970

971

### Cancel a Run

972

973

Cancels an ongoing evaluation run.

974

975

```typescript { .api }

976

cancel(runID: string, params: RunCancelParams): Promise<RunCancelResponse>

977

```

978

979

**Example:**

980

981

```typescript

982

const cancelled = await client.evals.runs.cancel('run_xyz789', {

983

eval_id: 'eval_abc123',

984

});

985

986

console.log(`Cancelled: ${cancelled.id}`);

987

```

988

989

### Run Types

990

991

```typescript { .api }

992

interface RunCreateResponse {

993

/**

994

* Unique run identifier.

995

*/

996

id: string;

997

998

/**

999

* Unix timestamp (seconds) when run was created.

1000

*/

1001

created_at: number;

1002

1003

/**

1004

* Run data source configuration.

1005

*/

1006

data_source:

1007

| CreateEvalJSONLRunDataSource

1008

| CreateEvalCompletionsRunDataSource

1009

| RunCreateResponse.Responses;

1010

1011

/**

1012

* Error information (if applicable).

1013

*/

1014

error: EvalAPIError;

1015

1016

/**

1017

* Associated evaluation ID.

1018

*/

1019

eval_id: string;

1020

1021

/**

1022

* Optional metadata.

1023

*/

1024

metadata: Metadata | null;

1025

1026

/**

1027

* Model being evaluated.

1028

*/

1029

model: string;

1030

1031

/**

1032

* Run name.

1033

*/

1034

name: string;

1035

1036

/**

1037

* Object type (always `eval.run`).

1038

*/

1039

object: 'eval.run';

1040

1041

/**

1042

* Per-model token usage statistics.

1043

*/

1044

per_model_usage: Array<{

1045

cached_tokens: number;

1046

completion_tokens: number;

1047

invocation_count: number;

1048

model_name: string;

1049

prompt_tokens: number;

1050

total_tokens: number;

1051

}>;

1052

1053

/**

1054

* Results per testing criteria.

1055

*/

1056

per_testing_criteria_results: Array<{

1057

failed: number;

1058

passed: number;

1059

testing_criteria: string;

1060

}>;

1061

1062

/**

1063

* URL to rendered report on dashboard.

1064

*/

1065

report_url: string;

1066

1067

/**

1068

* Result counts summarizing outcomes.

1069

*/

1070

result_counts: {

1071

errored: number;

1072

failed: number;

1073

passed: number;

1074

total: number;

1075

};

1076

1077

/**

1078

* Run status.

1079

*/

1080

status: string;

1081

}

1082

```

1083

1084

```typescript { .api }

1085

interface EvalAPIError {

1086

/**

1087

* Error code.

1088

*/

1089

code: string;

1090

1091

/**

1092

* Error message.

1093

*/

1094

message: string;

1095

}

1096

```

1097

1098

## Output Items API

1099

1100

### Retrieve an Output Item

1101

1102

Gets a specific output item from an evaluation run.

1103

1104

```typescript { .api }

1105

retrieve(outputItemID: string, params: OutputItemRetrieveParams): Promise<OutputItemRetrieveResponse>

1106

```

1107

1108

**Parameters:**

1109

1110

```typescript { .api }

1111

interface OutputItemRetrieveParams {

1112

/**

1113

* The evaluation ID.

1114

*/

1115

eval_id: string;

1116

1117

/**

1118

* The run ID.

1119

*/

1120

run_id: string;

1121

}

1122

```

1123

1124

**Example:**

1125

1126

```typescript

1127

const item = await client.evals.runs.outputItems.retrieve('item_123', {

1128

eval_id: 'eval_abc123',

1129

run_id: 'run_xyz789',

1130

});

1131

1132

console.log(`Item status: ${item.status}`);

1133

console.log(`Results:`);

1134

item.results.forEach(r => {

1135

console.log(` ${r.name}: ${r.passed ? 'PASSED' : 'FAILED'} (${r.score})`);

1136

});

1137

```

1138

1139

### List Output Items

1140

1141

Lists output items for an evaluation run.

1142

1143

```typescript { .api }

1144

list(runID: string, params: OutputItemListParams): Promise<OutputItemListResponsesPage>

1145

```

1146

1147

**Parameters:**

1148

1149

```typescript { .api }

1150

interface OutputItemListParams extends CursorPageParams {

1151

/**

1152

* The evaluation ID.

1153

*/

1154

eval_id: string;

1155

1156

/**

1157

* Sort order: `asc` or `desc` (default: `asc`).

1158

*/

1159

order?: 'asc' | 'desc';

1160

1161

/**

1162

* Filter by status: `fail` or `pass`.

1163

*/

1164

status?: 'fail' | 'pass';

1165

}

1166

```

1167

1168

**Example:**

1169

1170

```typescript

1171

// List failed output items

1172

const failed = await client.evals.runs.outputItems.list('run_xyz789', {

1173

eval_id: 'eval_abc123',

1174

status: 'fail',

1175

});

1176

1177

for await (const item of failed) {

1178

console.log(`Failed item: ${item.id}`);

1179

item.results.forEach(r => {

1180

if (!r.passed) {

1181

console.log(` ${r.name}: score ${r.score}`);

1182

}

1183

});

1184

}

1185

```

1186

1187

### Output Item Types

1188

1189

```typescript { .api }

1190

interface OutputItemRetrieveResponse {

1191

/**

1192

* Unique output item identifier.

1193

*/

1194

id: string;

1195

1196

/**

1197

* Unix timestamp (seconds) when created.

1198

*/

1199

created_at: number;

1200

1201

/**

1202

* Input data source item details.

1203

*/

1204

datasource_item: Record<string, unknown>;

1205

1206

/**

1207

* Data source item identifier.

1208

*/

1209

datasource_item_id: number;

1210

1211

/**

1212

* Evaluation ID.

1213

*/

1214

eval_id: string;

1215

1216

/**

1217

* Object type (always `eval.run.output_item`).

1218

*/

1219

object: 'eval.run.output_item';

1220

1221

/**

1222

* List of grader results.

1223

*/

1224

results: Array<{

1225

/**

1226

* Grader name.

1227

*/

1228

name: string;

1229

1230

/**

1231

* Whether grader passed.

1232

*/

1233

passed: boolean;

1234

1235

/**

1236

* Numeric score from grader.

1237

*/

1238

score: number;

1239

1240

/**

1241

* Optional sample data from grader.

1242

*/

1243

sample?: Record<string, unknown> | null;

1244

1245

/**

1246

* Grader type identifier.

1247

*/

1248

type?: string;

1249

1250

[k: string]: unknown;

1251

}>;

1252

1253

/**

1254

* Associated run ID.

1255

*/

1256

run_id: string;

1257

1258

/**

1259

* Sample with input and output.

1260

*/

1261

sample: {

1262

/**

1263

* Error information (if applicable).

1264

*/

1265

error: EvalAPIError;

1266

1267

/**

1268

* Finish reason (e.g., "stop", "max_tokens").

1269

*/

1270

finish_reason: string;

1271

1272

/**

1273

* Input messages.

1274

*/

1275

input: Array<{

1276

content: string;

1277

role: string;

1278

}>;

1279

1280

/**

1281

* Maximum tokens for completion.

1282

*/

1283

max_completion_tokens: number;

1284

1285

/**

1286

* Model used.

1287

*/

1288

model: string;

1289

1290

/**

1291

* Output messages.

1292

*/

1293

output: Array<{

1294

content?: string;

1295

role?: string;

1296

}>;

1297

1298

/**

1299

* Seed used.

1300

*/

1301

seed: number;

1302

1303

/**

1304

* Temperature used.

1305

*/

1306

temperature: number;

1307

1308

/**

1309

* Top-p (nucleus sampling) value.

1310

*/

1311

top_p: number;

1312

1313

/**

1314

* Token usage.

1315

*/

1316

usage: {

1317

cached_tokens: number;

1318

completion_tokens: number;

1319

prompt_tokens: number;

1320

total_tokens: number;

1321

};

1322

};

1323

1324

/**

1325

* Status of the output item.

1326

*/

1327

status: string;

1328

}

1329

```

1330

1331

## Complete Workflow Examples

1332

1333

### Batch Processing Workflow

1334

1335

```typescript

1336

import { OpenAI } from 'openai';

1337

1338

const client = new OpenAI();

1339

1340

async function processBatch() {

1341

// 1. Prepare batch requests

1342

const requests = [

1343

{

1344

custom_id: 'translation-1',

1345

method: 'POST',

1346

url: '/v1/chat/completions',

1347

body: {

1348

model: 'gpt-4o',

1349

messages: [{ role: 'user', content: 'Translate "hello" to French' }],

1350

},

1351

},

1352

{

1353

custom_id: 'summary-1',

1354

method: 'POST',

1355

url: '/v1/chat/completions',

1356

body: {

1357

model: 'gpt-4o',

1358

messages: [{ role: 'user', content: 'Summarize: OpenAI creates AI models.' }],

1359

},

1360

},

1361

];

1362

1363

// 2. Upload requests file

1364

const file = await client.files.create({

1365

file: new Blob([requests.map(r => JSON.stringify(r)).join('\n')]),

1366

purpose: 'batch',

1367

});

1368

1369

// 3. Create batch

1370

const batch = await client.batches.create({

1371

input_file_id: file.id,

1372

endpoint: '/v1/chat/completions',

1373

completion_window: '24h',

1374

});

1375

1376

console.log(`Batch ${batch.id} created with status: ${batch.status}`);

1377

1378

// 4. Poll until completion (or use webhooks)

1379

let completed = batch;

1380

while (!['completed', 'failed', 'expired'].includes(completed.status)) {

1381

await new Promise(resolve => setTimeout(resolve, 30000)); // Wait 30s

1382

completed = await client.batches.retrieve(batch.id);

1383

console.log(`Status: ${completed.status}`);

1384

}

1385

1386

// 5. Retrieve results

1387

if (completed.output_file_id) {

1388

const results = await client.files.content(completed.output_file_id);

1389

console.log('Results:', results);

1390

}

1391

1392

// 6. Check for errors

1393

if (completed.error_file_id) {

1394

const errors = await client.files.content(completed.error_file_id);

1395

console.log('Errors:', errors);

1396

}

1397

}

1398

1399

processBatch().catch(console.error);

1400

```

1401

1402

### Evaluation Workflow

1403

1404

```typescript

1405

import { OpenAI } from 'openai';

1406

1407

const client = new OpenAI();

1408

1409

async function runEvaluation() {

1410

// 1. Create evaluation

1411

const eval = await client.evals.create({

1412

name: 'Support Response Quality',

1413

data_source_config: {

1414

type: 'custom',

1415

item_schema: {

1416

type: 'object',

1417

properties: {

1418

question: { type: 'string' },

1419

expected_answer: { type: 'string' },

1420

},

1421

},

1422

include_sample_schema: true,

1423

},

1424

testing_criteria: [

1425

{

1426

type: 'string_check',

1427

name: 'Contains Key Term',

1428

pass_keywords: ['resolved'],

1429

},

1430

{

1431

type: 'label_model',

1432

name: 'Tone Check',

1433

model: 'gpt-4o',

1434

labels: ['professional', 'casual', 'rude'],

1435

passing_labels: ['professional', 'casual'],

1436

input: [

1437

{

1438

role: 'system',

1439

content: 'Rate the tone of this response.',

1440

},

1441

{

1442

role: 'user',

1443

content: '{{sample.output_text}}',

1444

},

1445

],

1446

},

1447

],

1448

});

1449

1450

console.log(`Created evaluation ${eval.id}`);

1451

1452

// 2. Create run with test data

1453

const run = await client.evals.runs.create(eval.id, {

1454

name: 'First Run',

1455

data_source: {

1456

type: 'jsonl',

1457

source: {

1458

type: 'file_content',

1459

content: [

1460

{

1461

item: {

1462

question: 'How do I upgrade my account?',

1463

expected_answer: 'Go to settings',

1464

},

1465

sample: {

1466

output_text: 'Your issue has been resolved. Go to settings to upgrade.',

1467

},

1468

},

1469

],

1470

},

1471

},

1472

});

1473

1474

console.log(`Run ${run.id} started`);

1475

1476

// 3. Poll for completion

1477

let completed = false;

1478

while (!completed) {

1479

const updated = await client.evals.runs.retrieve(run.id, { eval_id: eval.id });

1480

if (['completed', 'failed'].includes(updated.status)) {

1481

completed = true;

1482

console.log(`Run completed with status: ${updated.status}`);

1483

console.log(`Results: ${updated.result_counts.passed} passed, ${updated.result_counts.failed} failed`);

1484

console.log(`Report: ${updated.report_url}`);

1485

} else {

1486

await new Promise(resolve => setTimeout(resolve, 5000));

1487

}

1488

}

1489

1490

// 4. Analyze output items

1491

for await (const item of client.evals.runs.outputItems.list(run.id, { eval_id: eval.id, status: 'fail' })) {

1492

console.log(`Failed item ${item.id}:`);

1493

item.results.forEach(r => {

1494

console.log(` ${r.name}: score ${r.score}`);

1495

});

1496

}

1497

}

1498

1499

runEvaluation().catch(console.error);

1500

```

1501

1502

### Data Source Examples

1503

1504

#### JSONL File Format

1505

1506

```json

1507

{"item": {"question": "What is 2+2?", "expected": "4"}, "sample": {"output": "2+2 equals 4"}}

1508

{"item": {"question": "What is the capital of France?", "expected": "Paris"}, "sample": {"output": "The capital of France is Paris"}}

1509

```

1510

1511

#### Stored Completions Data Source

1512

1513

```typescript

1514

const run = await client.evals.runs.create(eval.id, {

1515

name: 'Test with Stored Completions',

1516

data_source: {

1517

type: 'completions',

1518

source: {

1519

type: 'stored_completions',

1520

created_after: 1700000000, // Unix timestamp

1521

model: 'gpt-4o',

1522

metadata: { usecase: 'support', version: 'v2' },

1523

},

1524

model: 'gpt-4o-mini',

1525

input_messages: {

1526

type: 'template',

1527

template: [

1528

{ role: 'system', content: 'You are a helpful assistant.' },

1529

{ role: 'user', content: '{{item.prompt}}' },

1530

],

1531

},

1532

},

1533

});

1534

```

1535

1536

## Best Practices

1537

1538

### Batches

1539

1540

- Use batches for non-time-critical workloads (high-volume, cost-optimized)

1541

- Validate JSONL format before uploading (ensure proper line-delimited JSON)

1542

- Monitor batch status via polling or webhooks

1543

- Implement error handling for failed requests in output files

1544

- Use metadata to track batch purpose and versions

1545

- Archive output files for compliance and analysis

1546

1547

### Evaluations

1548

1549

- Start with a small, well-curated dataset to validate criteria

1550

- Use multiple grader types (label models, string checks, similarity) for comprehensive assessment

1551

- Reference data using `{{item.field}}` for inputs and `{{sample.output}}` for model outputs

1552

- Set appropriate pass thresholds that reflect your quality requirements

1553

- Use stored completions data source for historical model outputs

1554

- Access detailed reports via the `report_url` for visualizations

1555

- Track evaluation results across model versions for performance comparison

1556