Tessl Tile for npm/openai@6.9.1

or run

npx @tessl/cli init

batches-evals.mddocs/

0
# Batches and Evaluations
1

2
Comprehensive API reference for batch processing and evaluation management in the OpenAI Node.js library. Use batches to process multiple API requests asynchronously, and use evaluations to systematically test model performance against defined criteria.
3

4
## Overview
5

6
### Batches
7

8
Batches allow you to send multiple API requests in a single operation, processed asynchronously by OpenAI. This is ideal for high-volume, non-time-sensitive workloads where cost efficiency is important.
9

10
### Evaluations (Evals)
11

12
Evaluations provide a framework to systematically assess model outputs against defined testing criteria. Run evaluations on different models, configurations, and data sources to compare performance.
13

14
## Batches API
15

16
### Create a Batch
17

18
Submits a new batch job for processing. The batch is created from a JSONL file containing API requests.
19

20
```typescript { .api }
21
create(params: BatchCreateParams): Promise<Batch>
22
```
23

24
**Parameters:**
25

26
```typescript { .api }
27
interface BatchCreateParams {
28
  /**
29
   * The time frame within which the batch should be processed.
30
   * Currently only `24h` is supported.
31
   */
32
  completion_window: '24h';
33

34
  /**
35
   * The endpoint to be used for all requests in the batch.
36
   * Supported: `/v1/responses`, `/v1/chat/completions`, `/v1/embeddings`,
37
   * `/v1/completions`, `/v1/moderations`.
38
   * Note: `/v1/embeddings` batches limited to 50,000 embedding inputs.
39
   */
40
  endpoint:
41
    | '/v1/responses'
42
    | '/v1/chat/completions'
43
    | '/v1/embeddings'
44
    | '/v1/completions'
45
    | '/v1/moderations';
46

47
  /**
48
   * The ID of an uploaded file containing requests for the batch.
49
   * Must be a JSONL file uploaded with purpose `batch`.
50
   * Max 50,000 requests, 200 MB file size.
51
   */
52
  input_file_id: string;
53

54
  /**
55
   * Optional metadata (16 key-value pairs max).
56
   * Keys: max 64 chars; Values: max 512 chars.
57
   */
58
  metadata?: Metadata | null;
59

60
  /**
61
   * Optional expiration policy for output/error files.
62
   */
63
  output_expires_after?: {
64
    /**
65
     * Anchor timestamp: `created_at` (file creation time).
66
     */
67
    anchor: 'created_at';
68
    /**
69
     * Seconds after anchor: 3600 (1 hour) to 2592000 (30 days).
70
     */
71
    seconds: number;
72
  };
73
}
74
```
75

76
**Example:**
77

78
```typescript
79
import { OpenAI } from 'openai';
80

81
const client = new OpenAI();
82

83
// 1. Create a JSONL file with batch requests
84
const batchRequests = [
85
  {
86
    custom_id: 'request-1',
87
    method: 'POST',
88
    url: '/v1/chat/completions',
89
    body: {
90
      model: 'gpt-4o',
91
      messages: [{ role: 'user', content: 'Translate "hello" to French' }],
92
      max_tokens: 100,
93
    },
94
  },
95
  {
96
    custom_id: 'request-2',
97
    method: 'POST',
98
    url: '/v1/chat/completions',
99
    body: {
100
      model: 'gpt-4o',
101
      messages: [{ role: 'user', content: 'Translate "goodbye" to Spanish' }],
102
      max_tokens: 100,
103
    },
104
  },
105
];
106

107
// 2. Upload the file
108
const file = await client.files.create({
109
  file: new Blob([batchRequests.map(r => JSON.stringify(r)).join('\n')], {
110
    type: 'application/jsonl',
111
  }),
112
  purpose: 'batch',
113
});
114

115
// 3. Create the batch
116
const batch = await client.batches.create({
117
  input_file_id: file.id,
118
  endpoint: '/v1/chat/completions',
119
  completion_window: '24h',
120
});
121

122
console.log(`Batch ${batch.id} submitted`);
123
```
124

125
### Retrieve a Batch
126

127
Retrieves details about a specific batch job.
128

129
```typescript { .api }
130
retrieve(batchID: string): Promise<Batch>
131
```
132

133
**Example:**
134

135
```typescript
136
const batch = await client.batches.retrieve('batch_abc123');
137
console.log(`Batch status: ${batch.status}`);
138
console.log(`Completed: ${batch.request_counts.completed}`);
139
console.log(`Failed: ${batch.request_counts.failed}`);
140
```
141

142
### List Batches
143

144
Retrieves a paginated list of batch jobs for your organization.
145

146
```typescript { .api }
147
list(params?: BatchListParams): Promise<BatchesPage>
148
```
149

150
**Parameters:**
151

152
```typescript { .api }
153
interface BatchListParams extends CursorPageParams {
154
  // Pagination parameters inherited from CursorPageParams
155
}
156
```
157

158
**Example:**
159

160
```typescript
161
// List all batches
162
for await (const batch of client.batches.list()) {
163
  console.log(`${batch.id}: ${batch.status}`);
164
}
165

166
// List with pagination
167
const page = await client.batches.list();
168
if (page.hasNextPage()) {
169
  const nextPage = await page.getNextPage();
170
}
171
```
172

173
### Cancel a Batch
174

175
Cancels a batch that is in progress. The batch transitions to `cancelling` status for up to 10 minutes, then becomes `cancelled` with any partial results available.
176

177
```typescript { .api }
178
cancel(batchID: string): Promise<Batch>
179
```
180

181
**Example:**
182

183
```typescript
184
const cancelled = await client.batches.cancel('batch_abc123');
185
console.log(`Batch status: ${cancelled.status}`); // 'cancelling'
186
```
187

188
### Batch Types
189

190
```typescript { .api }
191
interface Batch {
192
  /**
193
   * Unique batch identifier.
194
   */
195
  id: string;
196

197
  /**
198
   * The completion window (currently always `24h`).
199
   */
200
  completion_window: string;
201

202
  /**
203
   * Unix timestamp (seconds) when batch was created.
204
   */
205
  created_at: number;
206

207
  /**
208
   * The API endpoint used by this batch.
209
   */
210
  endpoint: string;
211

212
  /**
213
   * The ID of the input file containing requests.
214
   */
215
  input_file_id: string;
216

217
  /**
218
   * Object type identifier (always `batch`).
219
   */
220
  object: 'batch';
221

222
  /**
223
   * Current batch status: validating, failed, in_progress, finalizing,
224
   * completed, expired, cancelling, or cancelled.
225
   */
226
  status:
227
    | 'validating'
228
    | 'failed'
229
    | 'in_progress'
230
    | 'finalizing'
231
    | 'completed'
232
    | 'expired'
233
    | 'cancelling'
234
    | 'cancelled';
235

236
  /**
237
   * Unix timestamp (seconds) when batch was cancelled (if applicable).
238
   */
239
  cancelled_at?: number;
240

241
  /**
242
   * Unix timestamp (seconds) when batch started cancelling.
243
   */
244
  cancelling_at?: number;
245

246
  /**
247
   * Unix timestamp (seconds) when batch completed.
248
   */
249
  completed_at?: number;
250

251
  /**
252
   * ID of file containing errors (if any).
253
   */
254
  error_file_id?: string;
255

256
  /**
257
   * List of batch-level errors.
258
   */
259
  errors?: {
260
    data?: BatchError[];
261
    object?: string;
262
  };
263

264
  /**
265
   * Unix timestamp (seconds) when batch expired.
266
   */
267
  expired_at?: number;
268

269
  /**
270
   * Unix timestamp (seconds) when batch will expire.
271
   */
272
  expires_at?: number;
273

274
  /**
275
   * Unix timestamp (seconds) when batch failed.
276
   */
277
  failed_at?: number;
278

279
  /**
280
   * Unix timestamp (seconds) when batch started finalizing.
281
   */
282
  finalizing_at?: number;
283

284
  /**
285
   * Unix timestamp (seconds) when batch started processing.
286
   */
287
  in_progress_at?: number;
288

289
  /**
290
   * Optional metadata (16 key-value pairs max).
291
   */
292
  metadata?: Metadata | null;
293

294
  /**
295
   * Model ID used (e.g., `gpt-5-2025-08-07`).
296
   */
297
  model?: string;
298

299
  /**
300
   * ID of file containing successful outputs.
301
   */
302
  output_file_id?: string;
303

304
  /**
305
   * Request count statistics.
306
   */
307
  request_counts?: BatchRequestCounts;
308

309
  /**
310
   * Token usage details (batches created after Sept 7, 2025).
311
   */
312
  usage?: BatchUsage;
313
}
314
```
315

316
```typescript { .api }
317
interface BatchError {
318
  /**
319
   * Error code identifying the error type.
320
   */
321
  code?: string;
322

323
  /**
324
   * Line number in input file where error occurred.
325
   */
326
  line?: number | null;
327

328
  /**
329
   * Human-readable error message.
330
   */
331
  message?: string;
332

333
  /**
334
   * Parameter name that caused the error.
335
   */
336
  param?: string | null;
337
}
338
```
339

340
```typescript { .api }
341
interface BatchRequestCounts {
342
  /**
343
   * Number of requests completed successfully.
344
   */
345
  completed: number;
346

347
  /**
348
   * Number of requests that failed.
349
   */
350
  failed: number;
351

352
  /**
353
   * Total number of requests in the batch.
354
   */
355
  total: number;
356
}
357
```
358

359
```typescript { .api }
360
interface BatchUsage {
361
  /**
362
   * Number of input tokens.
363
   */
364
  input_tokens: number;
365

366
  /**
367
   * Detailed breakdown of input tokens.
368
   */
369
  input_tokens_details: {
370
    /**
371
     * Tokens retrieved from cache.
372
     */
373
    cached_tokens: number;
374
  };
375

376
  /**
377
   * Number of output tokens.
378
   */
379
  output_tokens: number;
380

381
  /**
382
   * Detailed breakdown of output tokens.
383
   */
384
  output_tokens_details: {
385
    /**
386
     * Reasoning tokens used (for reasoning models).
387
     */
388
    reasoning_tokens: number;
389
  };
390

391
  /**
392
   * Total tokens used.
393
   */
394
  total_tokens: number;
395
}
396
```
397

398
## Evaluations API
399

400
### Create an Evaluation
401

402
Defines the structure of an evaluation with testing criteria and data source configuration. After creation, run the evaluation on different models and parameters.
403

404
```typescript { .api }
405
create(params: EvalCreateParams): Promise<EvalCreateResponse>
406
```
407

408
**Parameters:**
409

410
```typescript { .api }
411
interface EvalCreateParams {
412
  /**
413
   * Data source configuration determining the schema of data used in runs.
414
   * Can be custom, logs, or stored_completions.
415
   */
416
  data_source_config:
417
    | EvalCreateParams.Custom
418
    | EvalCreateParams.Logs
419
    | EvalCreateParams.StoredCompletions;
420

421
  /**
422
   * List of graders (testing criteria) for all eval runs.
423
   * Can reference variables using {{item.variable_name}} or {{sample.output_text}}.
424
   */
425
  testing_criteria: Array<
426
    | EvalCreateParams.LabelModel
427
    | StringCheckGrader
428
    | EvalCreateParams.TextSimilarity
429
    | EvalCreateParams.Python
430
    | EvalCreateParams.ScoreModel
431
  >;
432

433
  /**
434
   * Optional metadata (16 key-value pairs max).
435
   */
436
  metadata?: Metadata | null;
437

438
  /**
439
   * Optional evaluation name.
440
   */
441
  name?: string;
442
}
443

444
namespace EvalCreateParams {
445
  interface Custom {
446
    /**
447
     * JSON schema for each row in the data source.
448
     */
449
    item_schema: Record<string, unknown>;
450
    /**
451
     * Data source type (always `custom`).
452
     */
453
    type: 'custom';
454
    /**
455
     * Whether eval expects you to populate sample namespace.
456
     */
457
    include_sample_schema?: boolean;
458
  }
459

460
  interface Logs {
461
    /**
462
     * Data source type (always `logs`).
463
     */
464
    type: 'logs';
465
    /**
466
     * Metadata filters for logs query.
467
     */
468
    metadata?: Record<string, unknown>;
469
  }
470

471
  interface StoredCompletions {
472
    /**
473
     * Data source type (always `stored_completions`).
474
     * @deprecated Use Logs instead.
475
     */
476
    type: 'stored_completions';
477
    /**
478
     * Metadata filters for stored completions.
479
     */
480
    metadata?: Record<string, unknown>;
481
  }
482

483
  interface LabelModel {
484
    /**
485
     * List of messages forming the prompt (may include {{item.variable}}).
486
     */
487
    input: Array<{
488
      content: string;
489
      role: string;
490
    }>;
491
    /**
492
     * Labels to classify each item.
493
     */
494
    labels: string[];
495
    /**
496
     * Model to use (must support structured outputs).
497
     */
498
    model: string;
499
    /**
500
     * Grader name.
501
     */
502
    name: string;
503
    /**
504
     * Labels indicating a passing result.
505
     */
506
    passing_labels: string[];
507
    /**
508
     * Type (always `label_model`).
509
     */
510
    type: 'label_model';
511
  }
512

513
  interface TextSimilarity extends GraderModelsAPI.TextSimilarityGrader {
514
    /**
515
     * Threshold for passing score.
516
     */
517
    pass_threshold: number;
518
  }
519

520
  interface Python extends GraderModelsAPI.PythonGrader {
521
    /**
522
     * Optional threshold for passing score.
523
     */
524
    pass_threshold?: number;
525
  }
526

527
  interface ScoreModel extends GraderModelsAPI.ScoreModelGrader {
528
    /**
529
     * Optional threshold for passing score.
530
     */
531
    pass_threshold?: number;
532
  }
533
}
534
```
535

536
**Example:**
537

538
```typescript
539
// Create evaluation for customer support chatbot
540
const evalResponse = await client.evals.create({
541
  name: 'Customer Support Quality',
542
  data_source_config: {
543
    type: 'custom',
544
    item_schema: {
545
      type: 'object',
546
      properties: {
547
        customer_question: { type: 'string' },
548
        expected_keywords: { type: 'array', items: { type: 'string' } },
549
      },
550
      required: ['customer_question'],
551
    },
552
    include_sample_schema: true,
553
  },
554
  testing_criteria: [
555
    {
556
      type: 'string_check',
557
      name: 'Contains Required Keywords',
558
      pass_keywords: ['{{item.expected_keywords}}'],
559
    },
560
    {
561
      type: 'label_model',
562
      name: 'Tone Assessment',
563
      model: 'gpt-4o',
564
      labels: ['professional', 'friendly', 'hostile'],
565
      passing_labels: ['professional', 'friendly'],
566
      input: [
567
        {
568
          role: 'system',
569
          content: 'Assess the tone of the response.',
570
        },
571
        {
572
          role: 'user',
573
          content: '{{sample.output_text}}',
574
        },
575
      ],
576
    },
577
  ],
578
});
579

580
console.log(`Created evaluation: ${evalResponse.id}`);
581
```
582

583
### Retrieve an Evaluation
584

585
Gets details about a specific evaluation.
586

587
```typescript { .api }
588
retrieve(evalID: string): Promise<EvalRetrieveResponse>
589
```
590

591
**Example:**
592

593
```typescript
594
const eval = await client.evals.retrieve('eval_abc123');
595
console.log(`Evaluation: ${eval.name}`);
596
console.log(`Testing criteria count: ${eval.testing_criteria.length}`);
597
```
598

599
### Update an Evaluation
600

601
Updates evaluation properties (name, metadata).
602

603
```typescript { .api }
604
update(evalID: string, params: EvalUpdateParams): Promise<EvalUpdateResponse>
605
```
606

607
**Parameters:**
608

609
```typescript { .api }
610
interface EvalUpdateParams {
611
  /**
612
   * Optional metadata (16 key-value pairs max).
613
   */
614
  metadata?: Metadata | null;
615

616
  /**
617
   * Rename the evaluation.
618
   */
619
  name?: string;
620
}
621
```
622

623
**Example:**
624

625
```typescript
626
const updated = await client.evals.update('eval_abc123', {
627
  name: 'Customer Support Quality - v2',
628
  metadata: { version: '2', status: 'production' },
629
});
630
```
631

632
### List Evaluations
633

634
Lists evaluations for your project.
635

636
```typescript { .api }
637
list(params?: EvalListParams): Promise<EvalListResponsesPage>
638
```
639

640
**Parameters:**
641

642
```typescript { .api }
643
interface EvalListParams extends CursorPageParams {
644
  /**
645
   * Sort order: `asc` or `desc` (default: `asc`).
646
   */
647
  order?: 'asc' | 'desc';
648

649
  /**
650
   * Sort by: `created_at` or `updated_at` (default: `created_at`).
651
   */
652
  order_by?: 'created_at' | 'updated_at';
653
}
654
```
655

656
**Example:**
657

658
```typescript
659
// List all evaluations sorted by creation date
660
for await (const eval of client.evals.list({ order_by: 'created_at', order: 'desc' })) {
661
  console.log(`${eval.name} (${eval.id})`);
662
}
663
```
664

665
### Delete an Evaluation
666

667
Deletes an evaluation and all associated runs.
668

669
```typescript { .api }
670
delete(evalID: string): Promise<EvalDeleteResponse>
671
```
672

673
**Example:**
674

675
```typescript
676
const result = await client.evals.delete('eval_abc123');
677
console.log(`Deleted: ${result.deleted}`);
678
```
679

680
### Evaluation Types
681

682
```typescript { .api }
683
interface EvalCreateResponse {
684
  /**
685
   * Unique evaluation identifier.
686
   */
687
  id: string;
688

689
  /**
690
   * Unix timestamp (seconds) when evaluation was created.
691
   */
692
  created_at: number;
693

694
  /**
695
   * Data source configuration for runs.
696
   */
697
  data_source_config:
698
    | EvalCustomDataSourceConfig
699
    | EvalCreateResponse.Logs
700
    | EvalStoredCompletionsDataSourceConfig;
701

702
  /**
703
   * Optional metadata.
704
   */
705
  metadata: Metadata | null;
706

707
  /**
708
   * Evaluation name.
709
   */
710
  name: string;
711

712
  /**
713
   * Object type (always `eval`).
714
   */
715
  object: 'eval';
716

717
  /**
718
   * List of testing criteria (graders).
719
   */
720
  testing_criteria: Array<
721
    | LabelModelGrader
722
    | StringCheckGrader
723
    | EvalCreateResponse.EvalGraderTextSimilarity
724
    | EvalCreateResponse.EvalGraderPython
725
    | EvalCreateResponse.EvalGraderScoreModel
726
  >;
727
}
728
```
729

730
```typescript { .api }
731
interface EvalCustomDataSourceConfig {
732
  /**
733
   * JSON schema for run data source items.
734
   */
735
  schema: Record<string, unknown>;
736

737
  /**
738
   * Data source type (always `custom`).
739
   */
740
  type: 'custom';
741
}
742
```
743

744
```typescript { .api }
745
interface EvalStoredCompletionsDataSourceConfig {
746
  /**
747
   * JSON schema for run data source items.
748
   */
749
  schema: Record<string, unknown>;
750

751
  /**
752
   * Data source type (always `stored_completions`).
753
   * @deprecated Use LogsDataSourceConfig instead.
754
   */
755
  type: 'stored_completions';
756

757
  /**
758
   * Optional metadata.
759
   */
760
  metadata?: Metadata | null;
761
}
762
```
763

764
```typescript { .api }
765
interface EvalDeleteResponse {
766
  deleted: boolean;
767
  eval_id: string;
768
  object: string;
769
}
770
```
771

772
## Evaluation Runs API
773

774
### Create a Run
775

776
Starts an evaluation run for a given evaluation. Validates data source against evaluation schema.
777

778
```typescript { .api }
779
create(evalID: string, params: RunCreateParams): Promise<RunCreateResponse>
780
```
781

782
**Parameters:**
783

784
```typescript { .api }
785
interface RunCreateParams {
786
  /**
787
   * Run data source: JSONL, completions, or responses.
788
   */
789
  data_source:
790
    | CreateEvalJSONLRunDataSource
791
    | CreateEvalCompletionsRunDataSource
792
    | RunCreateParams.CreateEvalResponsesRunDataSource;
793

794
  /**
795
   * Optional metadata (16 key-value pairs max).
796
   */
797
  metadata?: Metadata | null;
798

799
  /**
800
   * Optional run name.
801
   */
802
  name?: string;
803
}
804
```
805

806
**Data Sources:**
807

808
```typescript { .api }
809
interface CreateEvalJSONLRunDataSource {
810
  /**
811
   * JSONL source (file content or file ID).
812
   */
813
  source:
814
    | { type: 'file_content'; content: Array<{ item: Record<string, unknown>; sample?: Record<string, unknown> }> }
815
    | { type: 'file_id'; id: string };
816

817
  /**
818
   * Data source type (always `jsonl`).
819
   */
820
  type: 'jsonl';
821
}
822

823
interface CreateEvalCompletionsRunDataSource {
824
  /**
825
   * Source configuration.
826
   */
827
  source:
828
    | { type: 'file_content'; content: Array<{ item: Record<string, unknown> }> }
829
    | { type: 'file_id'; id: string }
830
    | {
831
        type: 'stored_completions';
832
        created_after?: number | null;
833
        created_before?: number | null;
834
        limit?: number | null;
835
        metadata?: Metadata | null;
836
        model?: string | null;
837
      };
838

839
  /**
840
   * Data source type (always `completions`).
841
   */
842
  type: 'completions';
843

844
  /**
845
   * Input messages (template or item reference).
846
   */
847
  input_messages?: { type: 'template'; template: unknown[] } | { type: 'item_reference'; item_reference: string };
848

849
  /**
850
   * Model to use for sampling.
851
   */
852
  model?: string;
853

854
  /**
855
   * Sampling parameters (temperature, max_tokens, etc.).
856
   */
857
  sampling_params?: Record<string, unknown>;
858
}
859
```
860

861
**Example:**
862

863
```typescript
864
// Create run with JSONL data source
865
const run = await client.evals.runs.create('eval_abc123', {
866
  name: 'Production Test Run',
867
  data_source: {
868
    type: 'jsonl',
869
    source: {
870
      type: 'file_content',
871
      content: [
872
        {
873
          item: {
874
            customer_question: 'How do I reset my password?',
875
            expected_keywords: ['password', 'reset', 'account'],
876
          },
877
          sample: {
878
            output_text: 'To reset your password, go to the login page and click "Forgot Password".',
879
          },
880
        },
881
        {
882
          item: {
883
            customer_question: 'What are your business hours?',
884
            expected_keywords: ['hours', 'open', 'close'],
885
          },
886
          sample: {
887
            output_text: 'We are open 9 AM to 5 PM EST, Monday through Friday.',
888
          },
889
        },
890
      ],
891
    },
892
  },
893
});
894

895
console.log(`Run ${run.id} started with status: ${run.status}`);
896
```
897

898
### Retrieve a Run
899

900
Gets details about a specific evaluation run.
901

902
```typescript { .api }
903
retrieve(runID: string, params: RunRetrieveParams): Promise<RunRetrieveResponse>
904
```
905

906
**Example:**
907

908
```typescript
909
const run = await client.evals.runs.retrieve('run_xyz789', {
910
  eval_id: 'eval_abc123',
911
});
912

913
console.log(`Status: ${run.status}`);
914
console.log(`Passed: ${run.result_counts.passed}`);
915
console.log(`Failed: ${run.result_counts.failed}`);
916
console.log(`Report: ${run.report_url}`);
917
```
918

919
### List Runs
920

921
Lists evaluation runs for a given evaluation.
922

923
```typescript { .api }
924
list(evalID: string, params?: RunListParams): Promise<RunListResponsesPage>
925
```
926

927
**Parameters:**
928

929
```typescript { .api }
930
interface RunListParams extends CursorPageParams {
931
  /**
932
   * Sort order: `asc` or `desc` (default: `asc`).
933
   */
934
  order?: 'asc' | 'desc';
935

936
  /**
937
   * Filter by status: queued, in_progress, completed, canceled, failed.
938
   */
939
  status?: 'queued' | 'in_progress' | 'completed' | 'canceled' | 'failed';
940
}
941
```
942

943
**Example:**
944

945
```typescript
946
// List completed runs
947
const runs = await client.evals.runs.list('eval_abc123', {
948
  status: 'completed',
949
  order: 'desc',
950
});
951

952
for await (const run of runs) {
953
  console.log(`${run.name}: ${run.status}`);
954
}
955
```
956

957
### Delete a Run
958

959
Deletes an evaluation run.
960

961
```typescript { .api }
962
delete(runID: string, params: RunDeleteParams): Promise<RunDeleteResponse>
963
```
964

965
**Example:**
966

967
```typescript
968
await client.evals.runs.delete('run_xyz789', { eval_id: 'eval_abc123' });
969
```
970

971
### Cancel a Run
972

973
Cancels an ongoing evaluation run.
974

975
```typescript { .api }
976
cancel(runID: string, params: RunCancelParams): Promise<RunCancelResponse>
977
```
978

979
**Example:**
980

981
```typescript
982
const cancelled = await client.evals.runs.cancel('run_xyz789', {
983
  eval_id: 'eval_abc123',
984
});
985

986
console.log(`Cancelled: ${cancelled.id}`);
987
```
988

989
### Run Types
990

991
```typescript { .api }
992
interface RunCreateResponse {
993
  /**
994
   * Unique run identifier.
995
   */
996
  id: string;
997

998
  /**
999
   * Unix timestamp (seconds) when run was created.
1000
   */
1001
  created_at: number;
1002

1003
  /**
1004
   * Run data source configuration.
1005
   */
1006
  data_source:
1007
    | CreateEvalJSONLRunDataSource
1008
    | CreateEvalCompletionsRunDataSource
1009
    | RunCreateResponse.Responses;
1010

1011
  /**
1012
   * Error information (if applicable).
1013
   */
1014
  error: EvalAPIError;
1015

1016
  /**
1017
   * Associated evaluation ID.
1018
   */
1019
  eval_id: string;
1020

1021
  /**
1022
   * Optional metadata.
1023
   */
1024
  metadata: Metadata | null;
1025

1026
  /**
1027
   * Model being evaluated.
1028
   */
1029
  model: string;
1030

1031
  /**
1032
   * Run name.
1033
   */
1034
  name: string;
1035

1036
  /**
1037
   * Object type (always `eval.run`).
1038
   */
1039
  object: 'eval.run';
1040

1041
  /**
1042
   * Per-model token usage statistics.
1043
   */
1044
  per_model_usage: Array<{
1045
    cached_tokens: number;
1046
    completion_tokens: number;
1047
    invocation_count: number;
1048
    model_name: string;
1049
    prompt_tokens: number;
1050
    total_tokens: number;
1051
  }>;
1052

1053
  /**
1054
   * Results per testing criteria.
1055
   */
1056
  per_testing_criteria_results: Array<{
1057
    failed: number;
1058
    passed: number;
1059
    testing_criteria: string;
1060
  }>;
1061

1062
  /**
1063
   * URL to rendered report on dashboard.
1064
   */
1065
  report_url: string;
1066

1067
  /**
1068
   * Result counts summarizing outcomes.
1069
   */
1070
  result_counts: {
1071
    errored: number;
1072
    failed: number;
1073
    passed: number;
1074
    total: number;
1075
  };
1076

1077
  /**
1078
   * Run status.
1079
   */
1080
  status: string;
1081
}
1082
```
1083

1084
```typescript { .api }
1085
interface EvalAPIError {
1086
  /**
1087
   * Error code.
1088
   */
1089
  code: string;
1090

1091
  /**
1092
   * Error message.
1093
   */
1094
  message: string;
1095
}
1096
```
1097

1098
## Output Items API
1099

1100
### Retrieve an Output Item
1101

1102
Gets a specific output item from an evaluation run.
1103

1104
```typescript { .api }
1105
retrieve(outputItemID: string, params: OutputItemRetrieveParams): Promise<OutputItemRetrieveResponse>
1106
```
1107

1108
**Parameters:**
1109

1110
```typescript { .api }
1111
interface OutputItemRetrieveParams {
1112
  /**
1113
   * The evaluation ID.
1114
   */
1115
  eval_id: string;
1116

1117
  /**
1118
   * The run ID.
1119
   */
1120
  run_id: string;
1121
}
1122
```
1123

1124
**Example:**
1125

1126
```typescript
1127
const item = await client.evals.runs.outputItems.retrieve('item_123', {
1128
  eval_id: 'eval_abc123',
1129
  run_id: 'run_xyz789',
1130
});
1131

1132
console.log(`Item status: ${item.status}`);
1133
console.log(`Results:`);
1134
item.results.forEach(r => {
1135
  console.log(`  ${r.name}: ${r.passed ? 'PASSED' : 'FAILED'} (${r.score})`);
1136
});
1137
```
1138

1139
### List Output Items
1140

1141
Lists output items for an evaluation run.
1142

1143
```typescript { .api }
1144
list(runID: string, params: OutputItemListParams): Promise<OutputItemListResponsesPage>
1145
```
1146

1147
**Parameters:**
1148

1149
```typescript { .api }
1150
interface OutputItemListParams extends CursorPageParams {
1151
  /**
1152
   * The evaluation ID.
1153
   */
1154
  eval_id: string;
1155

1156
  /**
1157
   * Sort order: `asc` or `desc` (default: `asc`).
1158
   */
1159
  order?: 'asc' | 'desc';
1160

1161
  /**
1162
   * Filter by status: `fail` or `pass`.
1163
   */
1164
  status?: 'fail' | 'pass';
1165
}
1166
```
1167

1168
**Example:**
1169

1170
```typescript
1171
// List failed output items
1172
const failed = await client.evals.runs.outputItems.list('run_xyz789', {
1173
  eval_id: 'eval_abc123',
1174
  status: 'fail',
1175
});
1176

1177
for await (const item of failed) {
1178
  console.log(`Failed item: ${item.id}`);
1179
  item.results.forEach(r => {
1180
    if (!r.passed) {
1181
      console.log(`  ${r.name}: score ${r.score}`);
1182
    }
1183
  });
1184
}
1185
```
1186

1187
### Output Item Types
1188

1189
```typescript { .api }
1190
interface OutputItemRetrieveResponse {
1191
  /**
1192
   * Unique output item identifier.
1193
   */
1194
  id: string;
1195

1196
  /**
1197
   * Unix timestamp (seconds) when created.
1198
   */
1199
  created_at: number;
1200

1201
  /**
1202
   * Input data source item details.
1203
   */
1204
  datasource_item: Record<string, unknown>;
1205

1206
  /**
1207
   * Data source item identifier.
1208
   */
1209
  datasource_item_id: number;
1210

1211
  /**
1212
   * Evaluation ID.
1213
   */
1214
  eval_id: string;
1215

1216
  /**
1217
   * Object type (always `eval.run.output_item`).
1218
   */
1219
  object: 'eval.run.output_item';
1220

1221
  /**
1222
   * List of grader results.
1223
   */
1224
  results: Array<{
1225
    /**
1226
     * Grader name.
1227
     */
1228
    name: string;
1229

1230
    /**
1231
     * Whether grader passed.
1232
     */
1233
    passed: boolean;
1234

1235
    /**
1236
     * Numeric score from grader.
1237
     */
1238
    score: number;
1239

1240
    /**
1241
     * Optional sample data from grader.
1242
     */
1243
    sample?: Record<string, unknown> | null;
1244

1245
    /**
1246
     * Grader type identifier.
1247
     */
1248
    type?: string;
1249

1250
    [k: string]: unknown;
1251
  }>;
1252

1253
  /**
1254
   * Associated run ID.
1255
   */
1256
  run_id: string;
1257

1258
  /**
1259
   * Sample with input and output.
1260
   */
1261
  sample: {
1262
    /**
1263
     * Error information (if applicable).
1264
     */
1265
    error: EvalAPIError;
1266

1267
    /**
1268
     * Finish reason (e.g., "stop", "max_tokens").
1269
     */
1270
    finish_reason: string;
1271

1272
    /**
1273
     * Input messages.
1274
     */
1275
    input: Array<{
1276
      content: string;
1277
      role: string;
1278
    }>;
1279

1280
    /**
1281
     * Maximum tokens for completion.
1282
     */
1283
    max_completion_tokens: number;
1284

1285
    /**
1286
     * Model used.
1287
     */
1288
    model: string;
1289

1290
    /**
1291
     * Output messages.
1292
     */
1293
    output: Array<{
1294
      content?: string;
1295
      role?: string;
1296
    }>;
1297

1298
    /**
1299
     * Seed used.
1300
     */
1301
    seed: number;
1302

1303
    /**
1304
     * Temperature used.
1305
     */
1306
    temperature: number;
1307

1308
    /**
1309
     * Top-p (nucleus sampling) value.
1310
     */
1311
    top_p: number;
1312

1313
    /**
1314
     * Token usage.
1315
     */
1316
    usage: {
1317
      cached_tokens: number;
1318
      completion_tokens: number;
1319
      prompt_tokens: number;
1320
      total_tokens: number;
1321
    };
1322
  };
1323

1324
  /**
1325
   * Status of the output item.
1326
   */
1327
  status: string;
1328
}
1329
```
1330

1331
## Complete Workflow Examples
1332

1333
### Batch Processing Workflow
1334

1335
```typescript
1336
import { OpenAI } from 'openai';
1337

1338
const client = new OpenAI();
1339

1340
async function processBatch() {
1341
  // 1. Prepare batch requests
1342
  const requests = [
1343
    {
1344
      custom_id: 'translation-1',
1345
      method: 'POST',
1346
      url: '/v1/chat/completions',
1347
      body: {
1348
        model: 'gpt-4o',
1349
        messages: [{ role: 'user', content: 'Translate "hello" to French' }],
1350
      },
1351
    },
1352
    {
1353
      custom_id: 'summary-1',
1354
      method: 'POST',
1355
      url: '/v1/chat/completions',
1356
      body: {
1357
        model: 'gpt-4o',
1358
        messages: [{ role: 'user', content: 'Summarize: OpenAI creates AI models.' }],
1359
      },
1360
    },
1361
  ];
1362

1363
  // 2. Upload requests file
1364
  const file = await client.files.create({
1365
    file: new Blob([requests.map(r => JSON.stringify(r)).join('\n')]),
1366
    purpose: 'batch',
1367
  });
1368

1369
  // 3. Create batch
1370
  const batch = await client.batches.create({
1371
    input_file_id: file.id,
1372
    endpoint: '/v1/chat/completions',
1373
    completion_window: '24h',
1374
  });
1375

1376
  console.log(`Batch ${batch.id} created with status: ${batch.status}`);
1377

1378
  // 4. Poll until completion (or use webhooks)
1379
  let completed = batch;
1380
  while (!['completed', 'failed', 'expired'].includes(completed.status)) {
1381
    await new Promise(resolve => setTimeout(resolve, 30000)); // Wait 30s
1382
    completed = await client.batches.retrieve(batch.id);
1383
    console.log(`Status: ${completed.status}`);
1384
  }
1385

1386
  // 5. Retrieve results
1387
  if (completed.output_file_id) {
1388
    const results = await client.files.content(completed.output_file_id);
1389
    console.log('Results:', results);
1390
  }
1391

1392
  // 6. Check for errors
1393
  if (completed.error_file_id) {
1394
    const errors = await client.files.content(completed.error_file_id);
1395
    console.log('Errors:', errors);
1396
  }
1397
}
1398

1399
processBatch().catch(console.error);
1400
```
1401

1402
### Evaluation Workflow
1403

1404
```typescript
1405
import { OpenAI } from 'openai';
1406

1407
const client = new OpenAI();
1408

1409
async function runEvaluation() {
1410
  // 1. Create evaluation
1411
  const eval = await client.evals.create({
1412
    name: 'Support Response Quality',
1413
    data_source_config: {
1414
      type: 'custom',
1415
      item_schema: {
1416
        type: 'object',
1417
        properties: {
1418
          question: { type: 'string' },
1419
          expected_answer: { type: 'string' },
1420
        },
1421
      },
1422
      include_sample_schema: true,
1423
    },
1424
    testing_criteria: [
1425
      {
1426
        type: 'string_check',
1427
        name: 'Contains Key Term',
1428
        pass_keywords: ['resolved'],
1429
      },
1430
      {
1431
        type: 'label_model',
1432
        name: 'Tone Check',
1433
        model: 'gpt-4o',
1434
        labels: ['professional', 'casual', 'rude'],
1435
        passing_labels: ['professional', 'casual'],
1436
        input: [
1437
          {
1438
            role: 'system',
1439
            content: 'Rate the tone of this response.',
1440
          },
1441
          {
1442
            role: 'user',
1443
            content: '{{sample.output_text}}',
1444
          },
1445
        ],
1446
      },
1447
    ],
1448
  });
1449

1450
  console.log(`Created evaluation ${eval.id}`);
1451

1452
  // 2. Create run with test data
1453
  const run = await client.evals.runs.create(eval.id, {
1454
    name: 'First Run',
1455
    data_source: {
1456
      type: 'jsonl',
1457
      source: {
1458
        type: 'file_content',
1459
        content: [
1460
          {
1461
            item: {
1462
              question: 'How do I upgrade my account?',
1463
              expected_answer: 'Go to settings',
1464
            },
1465
            sample: {
1466
              output_text: 'Your issue has been resolved. Go to settings to upgrade.',
1467
            },
1468
          },
1469
        ],
1470
      },
1471
    },
1472
  });
1473

1474
  console.log(`Run ${run.id} started`);
1475

1476
  // 3. Poll for completion
1477
  let completed = false;
1478
  while (!completed) {
1479
    const updated = await client.evals.runs.retrieve(run.id, { eval_id: eval.id });
1480
    if (['completed', 'failed'].includes(updated.status)) {
1481
      completed = true;
1482
      console.log(`Run completed with status: ${updated.status}`);
1483
      console.log(`Results: ${updated.result_counts.passed} passed, ${updated.result_counts.failed} failed`);
1484
      console.log(`Report: ${updated.report_url}`);
1485
    } else {
1486
      await new Promise(resolve => setTimeout(resolve, 5000));
1487
    }
1488
  }
1489

1490
  // 4. Analyze output items
1491
  for await (const item of client.evals.runs.outputItems.list(run.id, { eval_id: eval.id, status: 'fail' })) {
1492
    console.log(`Failed item ${item.id}:`);
1493
    item.results.forEach(r => {
1494
      console.log(`  ${r.name}: score ${r.score}`);
1495
    });
1496
  }
1497
}
1498

1499
runEvaluation().catch(console.error);
1500
```
1501

1502
### Data Source Examples
1503

1504
#### JSONL File Format
1505

1506
```json
1507
{"item": {"question": "What is 2+2?", "expected": "4"}, "sample": {"output": "2+2 equals 4"}}
1508
{"item": {"question": "What is the capital of France?", "expected": "Paris"}, "sample": {"output": "The capital of France is Paris"}}
1509
```
1510

1511
#### Stored Completions Data Source
1512

1513
```typescript
1514
const run = await client.evals.runs.create(eval.id, {
1515
  name: 'Test with Stored Completions',
1516
  data_source: {
1517
    type: 'completions',
1518
    source: {
1519
      type: 'stored_completions',
1520
      created_after: 1700000000, // Unix timestamp
1521
      model: 'gpt-4o',
1522
      metadata: { usecase: 'support', version: 'v2' },
1523
    },
1524
    model: 'gpt-4o-mini',
1525
    input_messages: {
1526
      type: 'template',
1527
      template: [
1528
        { role: 'system', content: 'You are a helpful assistant.' },
1529
        { role: 'user', content: '{{item.prompt}}' },
1530
      ],
1531
    },
1532
  },
1533
});
1534
```
1535

1536
## Best Practices
1537

1538
### Batches
1539

1540
- Use batches for non-time-critical workloads (high-volume, cost-optimized)
1541
- Validate JSONL format before uploading (ensure proper line-delimited JSON)
1542
- Monitor batch status via polling or webhooks
1543
- Implement error handling for failed requests in output files
1544
- Use metadata to track batch purpose and versions
1545
- Archive output files for compliance and analysis
1546

1547
### Evaluations
1548

1549
- Start with a small, well-curated dataset to validate criteria
1550
- Use multiple grader types (label models, string checks, similarity) for comprehensive assessment
1551
- Reference data using `{{item.field}}` for inputs and `{{sample.output}}` for model outputs
1552
- Set appropriate pass thresholds that reflect your quality requirements
1553
- Use stored completions data source for historical model outputs
1554
- Access detailed reports via the `report_url` for visualizations
1555
- Track evaluation results across model versions for performance comparison
1556

Version

Tile

Files

batches-evals.mddocs/

Version

Tile

Files

batches-evals.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

batches-evals.mddocs/