0
# Batches and Evaluations
1
2
Comprehensive API reference for batch processing and evaluation management in the OpenAI Node.js library. Use batches to process multiple API requests asynchronously, and use evaluations to systematically test model performance against defined criteria.
3
4
## Overview
5
6
### Batches
7
8
Batches allow you to send multiple API requests in a single operation, processed asynchronously by OpenAI. This is ideal for high-volume, non-time-sensitive workloads where cost efficiency is important.
9
10
### Evaluations (Evals)
11
12
Evaluations provide a framework to systematically assess model outputs against defined testing criteria. Run evaluations on different models, configurations, and data sources to compare performance.
13
14
## Batches API
15
16
### Create a Batch
17
18
Submits a new batch job for processing. The batch is created from a JSONL file containing API requests.
19
20
```typescript { .api }
21
create(params: BatchCreateParams): Promise<Batch>
22
```
23
24
**Parameters:**
25
26
```typescript { .api }
27
interface BatchCreateParams {
28
/**
29
* The time frame within which the batch should be processed.
30
* Currently only `24h` is supported.
31
*/
32
completion_window: '24h';
33
34
/**
35
* The endpoint to be used for all requests in the batch.
36
* Supported: `/v1/responses`, `/v1/chat/completions`, `/v1/embeddings`,
37
* `/v1/completions`, `/v1/moderations`.
38
* Note: `/v1/embeddings` batches limited to 50,000 embedding inputs.
39
*/
40
endpoint:
41
| '/v1/responses'
42
| '/v1/chat/completions'
43
| '/v1/embeddings'
44
| '/v1/completions'
45
| '/v1/moderations';
46
47
/**
48
* The ID of an uploaded file containing requests for the batch.
49
* Must be a JSONL file uploaded with purpose `batch`.
50
* Max 50,000 requests, 200 MB file size.
51
*/
52
input_file_id: string;
53
54
/**
55
* Optional metadata (16 key-value pairs max).
56
* Keys: max 64 chars; Values: max 512 chars.
57
*/
58
metadata?: Metadata | null;
59
60
/**
61
* Optional expiration policy for output/error files.
62
*/
63
output_expires_after?: {
64
/**
65
* Anchor timestamp: `created_at` (file creation time).
66
*/
67
anchor: 'created_at';
68
/**
69
* Seconds after anchor: 3600 (1 hour) to 2592000 (30 days).
70
*/
71
seconds: number;
72
};
73
}
74
```
75
76
**Example:**
77
78
```typescript
79
import { OpenAI } from 'openai';
80
81
const client = new OpenAI();
82
83
// 1. Create a JSONL file with batch requests
84
const batchRequests = [
85
{
86
custom_id: 'request-1',
87
method: 'POST',
88
url: '/v1/chat/completions',
89
body: {
90
model: 'gpt-4o',
91
messages: [{ role: 'user', content: 'Translate "hello" to French' }],
92
max_tokens: 100,
93
},
94
},
95
{
96
custom_id: 'request-2',
97
method: 'POST',
98
url: '/v1/chat/completions',
99
body: {
100
model: 'gpt-4o',
101
messages: [{ role: 'user', content: 'Translate "goodbye" to Spanish' }],
102
max_tokens: 100,
103
},
104
},
105
];
106
107
// 2. Upload the file
108
const file = await client.files.create({
109
file: new Blob([batchRequests.map(r => JSON.stringify(r)).join('\n')], {
110
type: 'application/jsonl',
111
}),
112
purpose: 'batch',
113
});
114
115
// 3. Create the batch
116
const batch = await client.batches.create({
117
input_file_id: file.id,
118
endpoint: '/v1/chat/completions',
119
completion_window: '24h',
120
});
121
122
console.log(`Batch ${batch.id} submitted`);
123
```
124
125
### Retrieve a Batch
126
127
Retrieves details about a specific batch job.
128
129
```typescript { .api }
130
retrieve(batchID: string): Promise<Batch>
131
```
132
133
**Example:**
134
135
```typescript
136
const batch = await client.batches.retrieve('batch_abc123');
137
console.log(`Batch status: ${batch.status}`);
138
console.log(`Completed: ${batch.request_counts.completed}`);
139
console.log(`Failed: ${batch.request_counts.failed}`);
140
```
141
142
### List Batches
143
144
Retrieves a paginated list of batch jobs for your organization.
145
146
```typescript { .api }
147
list(params?: BatchListParams): Promise<BatchesPage>
148
```
149
150
**Parameters:**
151
152
```typescript { .api }
153
interface BatchListParams extends CursorPageParams {
154
// Pagination parameters inherited from CursorPageParams
155
}
156
```
157
158
**Example:**
159
160
```typescript
161
// List all batches
162
for await (const batch of client.batches.list()) {
163
console.log(`${batch.id}: ${batch.status}`);
164
}
165
166
// List with pagination
167
const page = await client.batches.list();
168
if (page.hasNextPage()) {
169
const nextPage = await page.getNextPage();
170
}
171
```
172
173
### Cancel a Batch
174
175
Cancels a batch that is in progress. The batch transitions to `cancelling` status for up to 10 minutes, then becomes `cancelled` with any partial results available.
176
177
```typescript { .api }
178
cancel(batchID: string): Promise<Batch>
179
```
180
181
**Example:**
182
183
```typescript
184
const cancelled = await client.batches.cancel('batch_abc123');
185
console.log(`Batch status: ${cancelled.status}`); // 'cancelling'
186
```
187
188
### Batch Types
189
190
```typescript { .api }
191
interface Batch {
192
/**
193
* Unique batch identifier.
194
*/
195
id: string;
196
197
/**
198
* The completion window (currently always `24h`).
199
*/
200
completion_window: string;
201
202
/**
203
* Unix timestamp (seconds) when batch was created.
204
*/
205
created_at: number;
206
207
/**
208
* The API endpoint used by this batch.
209
*/
210
endpoint: string;
211
212
/**
213
* The ID of the input file containing requests.
214
*/
215
input_file_id: string;
216
217
/**
218
* Object type identifier (always `batch`).
219
*/
220
object: 'batch';
221
222
/**
223
* Current batch status: validating, failed, in_progress, finalizing,
224
* completed, expired, cancelling, or cancelled.
225
*/
226
status:
227
| 'validating'
228
| 'failed'
229
| 'in_progress'
230
| 'finalizing'
231
| 'completed'
232
| 'expired'
233
| 'cancelling'
234
| 'cancelled';
235
236
/**
237
* Unix timestamp (seconds) when batch was cancelled (if applicable).
238
*/
239
cancelled_at?: number;
240
241
/**
242
* Unix timestamp (seconds) when batch started cancelling.
243
*/
244
cancelling_at?: number;
245
246
/**
247
* Unix timestamp (seconds) when batch completed.
248
*/
249
completed_at?: number;
250
251
/**
252
* ID of file containing errors (if any).
253
*/
254
error_file_id?: string;
255
256
/**
257
* List of batch-level errors.
258
*/
259
errors?: {
260
data?: BatchError[];
261
object?: string;
262
};
263
264
/**
265
* Unix timestamp (seconds) when batch expired.
266
*/
267
expired_at?: number;
268
269
/**
270
* Unix timestamp (seconds) when batch will expire.
271
*/
272
expires_at?: number;
273
274
/**
275
* Unix timestamp (seconds) when batch failed.
276
*/
277
failed_at?: number;
278
279
/**
280
* Unix timestamp (seconds) when batch started finalizing.
281
*/
282
finalizing_at?: number;
283
284
/**
285
* Unix timestamp (seconds) when batch started processing.
286
*/
287
in_progress_at?: number;
288
289
/**
290
* Optional metadata (16 key-value pairs max).
291
*/
292
metadata?: Metadata | null;
293
294
/**
295
* Model ID used (e.g., `gpt-5-2025-08-07`).
296
*/
297
model?: string;
298
299
/**
300
* ID of file containing successful outputs.
301
*/
302
output_file_id?: string;
303
304
/**
305
* Request count statistics.
306
*/
307
request_counts?: BatchRequestCounts;
308
309
/**
310
* Token usage details (batches created after Sept 7, 2025).
311
*/
312
usage?: BatchUsage;
313
}
314
```
315
316
```typescript { .api }
317
interface BatchError {
318
/**
319
* Error code identifying the error type.
320
*/
321
code?: string;
322
323
/**
324
* Line number in input file where error occurred.
325
*/
326
line?: number | null;
327
328
/**
329
* Human-readable error message.
330
*/
331
message?: string;
332
333
/**
334
* Parameter name that caused the error.
335
*/
336
param?: string | null;
337
}
338
```
339
340
```typescript { .api }
341
interface BatchRequestCounts {
342
/**
343
* Number of requests completed successfully.
344
*/
345
completed: number;
346
347
/**
348
* Number of requests that failed.
349
*/
350
failed: number;
351
352
/**
353
* Total number of requests in the batch.
354
*/
355
total: number;
356
}
357
```
358
359
```typescript { .api }
360
interface BatchUsage {
361
/**
362
* Number of input tokens.
363
*/
364
input_tokens: number;
365
366
/**
367
* Detailed breakdown of input tokens.
368
*/
369
input_tokens_details: {
370
/**
371
* Tokens retrieved from cache.
372
*/
373
cached_tokens: number;
374
};
375
376
/**
377
* Number of output tokens.
378
*/
379
output_tokens: number;
380
381
/**
382
* Detailed breakdown of output tokens.
383
*/
384
output_tokens_details: {
385
/**
386
* Reasoning tokens used (for reasoning models).
387
*/
388
reasoning_tokens: number;
389
};
390
391
/**
392
* Total tokens used.
393
*/
394
total_tokens: number;
395
}
396
```
397
398
## Evaluations API
399
400
### Create an Evaluation
401
402
Defines the structure of an evaluation with testing criteria and data source configuration. After creation, run the evaluation on different models and parameters.
403
404
```typescript { .api }
405
create(params: EvalCreateParams): Promise<EvalCreateResponse>
406
```
407
408
**Parameters:**
409
410
```typescript { .api }
411
interface EvalCreateParams {
412
/**
413
* Data source configuration determining the schema of data used in runs.
414
* Can be custom, logs, or stored_completions.
415
*/
416
data_source_config:
417
| EvalCreateParams.Custom
418
| EvalCreateParams.Logs
419
| EvalCreateParams.StoredCompletions;
420
421
/**
422
* List of graders (testing criteria) for all eval runs.
423
* Can reference variables using {{item.variable_name}} or {{sample.output_text}}.
424
*/
425
testing_criteria: Array<
426
| EvalCreateParams.LabelModel
427
| StringCheckGrader
428
| EvalCreateParams.TextSimilarity
429
| EvalCreateParams.Python
430
| EvalCreateParams.ScoreModel
431
>;
432
433
/**
434
* Optional metadata (16 key-value pairs max).
435
*/
436
metadata?: Metadata | null;
437
438
/**
439
* Optional evaluation name.
440
*/
441
name?: string;
442
}
443
444
namespace EvalCreateParams {
445
interface Custom {
446
/**
447
* JSON schema for each row in the data source.
448
*/
449
item_schema: Record<string, unknown>;
450
/**
451
* Data source type (always `custom`).
452
*/
453
type: 'custom';
454
/**
455
* Whether eval expects you to populate sample namespace.
456
*/
457
include_sample_schema?: boolean;
458
}
459
460
interface Logs {
461
/**
462
* Data source type (always `logs`).
463
*/
464
type: 'logs';
465
/**
466
* Metadata filters for logs query.
467
*/
468
metadata?: Record<string, unknown>;
469
}
470
471
interface StoredCompletions {
472
/**
473
* Data source type (always `stored_completions`).
474
* @deprecated Use Logs instead.
475
*/
476
type: 'stored_completions';
477
/**
478
* Metadata filters for stored completions.
479
*/
480
metadata?: Record<string, unknown>;
481
}
482
483
interface LabelModel {
484
/**
485
* List of messages forming the prompt (may include {{item.variable}}).
486
*/
487
input: Array<{
488
content: string;
489
role: string;
490
}>;
491
/**
492
* Labels to classify each item.
493
*/
494
labels: string[];
495
/**
496
* Model to use (must support structured outputs).
497
*/
498
model: string;
499
/**
500
* Grader name.
501
*/
502
name: string;
503
/**
504
* Labels indicating a passing result.
505
*/
506
passing_labels: string[];
507
/**
508
* Type (always `label_model`).
509
*/
510
type: 'label_model';
511
}
512
513
interface TextSimilarity extends GraderModelsAPI.TextSimilarityGrader {
514
/**
515
* Threshold for passing score.
516
*/
517
pass_threshold: number;
518
}
519
520
interface Python extends GraderModelsAPI.PythonGrader {
521
/**
522
* Optional threshold for passing score.
523
*/
524
pass_threshold?: number;
525
}
526
527
interface ScoreModel extends GraderModelsAPI.ScoreModelGrader {
528
/**
529
* Optional threshold for passing score.
530
*/
531
pass_threshold?: number;
532
}
533
}
534
```
535
536
**Example:**
537
538
```typescript
539
// Create evaluation for customer support chatbot
540
const evalResponse = await client.evals.create({
541
name: 'Customer Support Quality',
542
data_source_config: {
543
type: 'custom',
544
item_schema: {
545
type: 'object',
546
properties: {
547
customer_question: { type: 'string' },
548
expected_keywords: { type: 'array', items: { type: 'string' } },
549
},
550
required: ['customer_question'],
551
},
552
include_sample_schema: true,
553
},
554
testing_criteria: [
555
{
556
type: 'string_check',
557
name: 'Contains Required Keywords',
558
pass_keywords: ['{{item.expected_keywords}}'],
559
},
560
{
561
type: 'label_model',
562
name: 'Tone Assessment',
563
model: 'gpt-4o',
564
labels: ['professional', 'friendly', 'hostile'],
565
passing_labels: ['professional', 'friendly'],
566
input: [
567
{
568
role: 'system',
569
content: 'Assess the tone of the response.',
570
},
571
{
572
role: 'user',
573
content: '{{sample.output_text}}',
574
},
575
],
576
},
577
],
578
});
579
580
console.log(`Created evaluation: ${evalResponse.id}`);
581
```
582
583
### Retrieve an Evaluation
584
585
Gets details about a specific evaluation.
586
587
```typescript { .api }
588
retrieve(evalID: string): Promise<EvalRetrieveResponse>
589
```
590
591
**Example:**
592
593
```typescript
594
const eval = await client.evals.retrieve('eval_abc123');
595
console.log(`Evaluation: ${eval.name}`);
596
console.log(`Testing criteria count: ${eval.testing_criteria.length}`);
597
```
598
599
### Update an Evaluation
600
601
Updates evaluation properties (name, metadata).
602
603
```typescript { .api }
604
update(evalID: string, params: EvalUpdateParams): Promise<EvalUpdateResponse>
605
```
606
607
**Parameters:**
608
609
```typescript { .api }
610
interface EvalUpdateParams {
611
/**
612
* Optional metadata (16 key-value pairs max).
613
*/
614
metadata?: Metadata | null;
615
616
/**
617
* Rename the evaluation.
618
*/
619
name?: string;
620
}
621
```
622
623
**Example:**
624
625
```typescript
626
const updated = await client.evals.update('eval_abc123', {
627
name: 'Customer Support Quality - v2',
628
metadata: { version: '2', status: 'production' },
629
});
630
```
631
632
### List Evaluations
633
634
Lists evaluations for your project.
635
636
```typescript { .api }
637
list(params?: EvalListParams): Promise<EvalListResponsesPage>
638
```
639
640
**Parameters:**
641
642
```typescript { .api }
643
interface EvalListParams extends CursorPageParams {
644
/**
645
* Sort order: `asc` or `desc` (default: `asc`).
646
*/
647
order?: 'asc' | 'desc';
648
649
/**
650
* Sort by: `created_at` or `updated_at` (default: `created_at`).
651
*/
652
order_by?: 'created_at' | 'updated_at';
653
}
654
```
655
656
**Example:**
657
658
```typescript
659
// List all evaluations sorted by creation date
660
for await (const eval of client.evals.list({ order_by: 'created_at', order: 'desc' })) {
661
console.log(`${eval.name} (${eval.id})`);
662
}
663
```
664
665
### Delete an Evaluation
666
667
Deletes an evaluation and all associated runs.
668
669
```typescript { .api }
670
delete(evalID: string): Promise<EvalDeleteResponse>
671
```
672
673
**Example:**
674
675
```typescript
676
const result = await client.evals.delete('eval_abc123');
677
console.log(`Deleted: ${result.deleted}`);
678
```
679
680
### Evaluation Types
681
682
```typescript { .api }
683
interface EvalCreateResponse {
684
/**
685
* Unique evaluation identifier.
686
*/
687
id: string;
688
689
/**
690
* Unix timestamp (seconds) when evaluation was created.
691
*/
692
created_at: number;
693
694
/**
695
* Data source configuration for runs.
696
*/
697
data_source_config:
698
| EvalCustomDataSourceConfig
699
| EvalCreateResponse.Logs
700
| EvalStoredCompletionsDataSourceConfig;
701
702
/**
703
* Optional metadata.
704
*/
705
metadata: Metadata | null;
706
707
/**
708
* Evaluation name.
709
*/
710
name: string;
711
712
/**
713
* Object type (always `eval`).
714
*/
715
object: 'eval';
716
717
/**
718
* List of testing criteria (graders).
719
*/
720
testing_criteria: Array<
721
| LabelModelGrader
722
| StringCheckGrader
723
| EvalCreateResponse.EvalGraderTextSimilarity
724
| EvalCreateResponse.EvalGraderPython
725
| EvalCreateResponse.EvalGraderScoreModel
726
>;
727
}
728
```
729
730
```typescript { .api }
731
interface EvalCustomDataSourceConfig {
732
/**
733
* JSON schema for run data source items.
734
*/
735
schema: Record<string, unknown>;
736
737
/**
738
* Data source type (always `custom`).
739
*/
740
type: 'custom';
741
}
742
```
743
744
```typescript { .api }
745
interface EvalStoredCompletionsDataSourceConfig {
746
/**
747
* JSON schema for run data source items.
748
*/
749
schema: Record<string, unknown>;
750
751
/**
752
* Data source type (always `stored_completions`).
753
* @deprecated Use LogsDataSourceConfig instead.
754
*/
755
type: 'stored_completions';
756
757
/**
758
* Optional metadata.
759
*/
760
metadata?: Metadata | null;
761
}
762
```
763
764
```typescript { .api }
765
interface EvalDeleteResponse {
766
deleted: boolean;
767
eval_id: string;
768
object: string;
769
}
770
```
771
772
## Evaluation Runs API
773
774
### Create a Run
775
776
Starts an evaluation run for a given evaluation. Validates data source against evaluation schema.
777
778
```typescript { .api }
779
create(evalID: string, params: RunCreateParams): Promise<RunCreateResponse>
780
```
781
782
**Parameters:**
783
784
```typescript { .api }
785
interface RunCreateParams {
786
/**
787
* Run data source: JSONL, completions, or responses.
788
*/
789
data_source:
790
| CreateEvalJSONLRunDataSource
791
| CreateEvalCompletionsRunDataSource
792
| RunCreateParams.CreateEvalResponsesRunDataSource;
793
794
/**
795
* Optional metadata (16 key-value pairs max).
796
*/
797
metadata?: Metadata | null;
798
799
/**
800
* Optional run name.
801
*/
802
name?: string;
803
}
804
```
805
806
**Data Sources:**
807
808
```typescript { .api }
809
interface CreateEvalJSONLRunDataSource {
810
/**
811
* JSONL source (file content or file ID).
812
*/
813
source:
814
| { type: 'file_content'; content: Array<{ item: Record<string, unknown>; sample?: Record<string, unknown> }> }
815
| { type: 'file_id'; id: string };
816
817
/**
818
* Data source type (always `jsonl`).
819
*/
820
type: 'jsonl';
821
}
822
823
interface CreateEvalCompletionsRunDataSource {
824
/**
825
* Source configuration.
826
*/
827
source:
828
| { type: 'file_content'; content: Array<{ item: Record<string, unknown> }> }
829
| { type: 'file_id'; id: string }
830
| {
831
type: 'stored_completions';
832
created_after?: number | null;
833
created_before?: number | null;
834
limit?: number | null;
835
metadata?: Metadata | null;
836
model?: string | null;
837
};
838
839
/**
840
* Data source type (always `completions`).
841
*/
842
type: 'completions';
843
844
/**
845
* Input messages (template or item reference).
846
*/
847
input_messages?: { type: 'template'; template: unknown[] } | { type: 'item_reference'; item_reference: string };
848
849
/**
850
* Model to use for sampling.
851
*/
852
model?: string;
853
854
/**
855
* Sampling parameters (temperature, max_tokens, etc.).
856
*/
857
sampling_params?: Record<string, unknown>;
858
}
859
```
860
861
**Example:**
862
863
```typescript
864
// Create run with JSONL data source
865
const run = await client.evals.runs.create('eval_abc123', {
866
name: 'Production Test Run',
867
data_source: {
868
type: 'jsonl',
869
source: {
870
type: 'file_content',
871
content: [
872
{
873
item: {
874
customer_question: 'How do I reset my password?',
875
expected_keywords: ['password', 'reset', 'account'],
876
},
877
sample: {
878
output_text: 'To reset your password, go to the login page and click "Forgot Password".',
879
},
880
},
881
{
882
item: {
883
customer_question: 'What are your business hours?',
884
expected_keywords: ['hours', 'open', 'close'],
885
},
886
sample: {
887
output_text: 'We are open 9 AM to 5 PM EST, Monday through Friday.',
888
},
889
},
890
],
891
},
892
},
893
});
894
895
console.log(`Run ${run.id} started with status: ${run.status}`);
896
```
897
898
### Retrieve a Run
899
900
Gets details about a specific evaluation run.
901
902
```typescript { .api }
903
retrieve(runID: string, params: RunRetrieveParams): Promise<RunRetrieveResponse>
904
```
905
906
**Example:**
907
908
```typescript
909
const run = await client.evals.runs.retrieve('run_xyz789', {
910
eval_id: 'eval_abc123',
911
});
912
913
console.log(`Status: ${run.status}`);
914
console.log(`Passed: ${run.result_counts.passed}`);
915
console.log(`Failed: ${run.result_counts.failed}`);
916
console.log(`Report: ${run.report_url}`);
917
```
918
919
### List Runs
920
921
Lists evaluation runs for a given evaluation.
922
923
```typescript { .api }
924
list(evalID: string, params?: RunListParams): Promise<RunListResponsesPage>
925
```
926
927
**Parameters:**
928
929
```typescript { .api }
930
interface RunListParams extends CursorPageParams {
931
/**
932
* Sort order: `asc` or `desc` (default: `asc`).
933
*/
934
order?: 'asc' | 'desc';
935
936
/**
937
* Filter by status: queued, in_progress, completed, canceled, failed.
938
*/
939
status?: 'queued' | 'in_progress' | 'completed' | 'canceled' | 'failed';
940
}
941
```
942
943
**Example:**
944
945
```typescript
946
// List completed runs
947
const runs = await client.evals.runs.list('eval_abc123', {
948
status: 'completed',
949
order: 'desc',
950
});
951
952
for await (const run of runs) {
953
console.log(`${run.name}: ${run.status}`);
954
}
955
```
956
957
### Delete a Run
958
959
Deletes an evaluation run.
960
961
```typescript { .api }
962
delete(runID: string, params: RunDeleteParams): Promise<RunDeleteResponse>
963
```
964
965
**Example:**
966
967
```typescript
968
await client.evals.runs.delete('run_xyz789', { eval_id: 'eval_abc123' });
969
```
970
971
### Cancel a Run
972
973
Cancels an ongoing evaluation run.
974
975
```typescript { .api }
976
cancel(runID: string, params: RunCancelParams): Promise<RunCancelResponse>
977
```
978
979
**Example:**
980
981
```typescript
982
const cancelled = await client.evals.runs.cancel('run_xyz789', {
983
eval_id: 'eval_abc123',
984
});
985
986
console.log(`Cancelled: ${cancelled.id}`);
987
```
988
989
### Run Types
990
991
```typescript { .api }
992
interface RunCreateResponse {
993
/**
994
* Unique run identifier.
995
*/
996
id: string;
997
998
/**
999
* Unix timestamp (seconds) when run was created.
1000
*/
1001
created_at: number;
1002
1003
/**
1004
* Run data source configuration.
1005
*/
1006
data_source:
1007
| CreateEvalJSONLRunDataSource
1008
| CreateEvalCompletionsRunDataSource
1009
| RunCreateResponse.Responses;
1010
1011
/**
1012
* Error information (if applicable).
1013
*/
1014
error: EvalAPIError;
1015
1016
/**
1017
* Associated evaluation ID.
1018
*/
1019
eval_id: string;
1020
1021
/**
1022
* Optional metadata.
1023
*/
1024
metadata: Metadata | null;
1025
1026
/**
1027
* Model being evaluated.
1028
*/
1029
model: string;
1030
1031
/**
1032
* Run name.
1033
*/
1034
name: string;
1035
1036
/**
1037
* Object type (always `eval.run`).
1038
*/
1039
object: 'eval.run';
1040
1041
/**
1042
* Per-model token usage statistics.
1043
*/
1044
per_model_usage: Array<{
1045
cached_tokens: number;
1046
completion_tokens: number;
1047
invocation_count: number;
1048
model_name: string;
1049
prompt_tokens: number;
1050
total_tokens: number;
1051
}>;
1052
1053
/**
1054
* Results per testing criteria.
1055
*/
1056
per_testing_criteria_results: Array<{
1057
failed: number;
1058
passed: number;
1059
testing_criteria: string;
1060
}>;
1061
1062
/**
1063
* URL to rendered report on dashboard.
1064
*/
1065
report_url: string;
1066
1067
/**
1068
* Result counts summarizing outcomes.
1069
*/
1070
result_counts: {
1071
errored: number;
1072
failed: number;
1073
passed: number;
1074
total: number;
1075
};
1076
1077
/**
1078
* Run status.
1079
*/
1080
status: string;
1081
}
1082
```
1083
1084
```typescript { .api }
1085
interface EvalAPIError {
1086
/**
1087
* Error code.
1088
*/
1089
code: string;
1090
1091
/**
1092
* Error message.
1093
*/
1094
message: string;
1095
}
1096
```
1097
1098
## Output Items API
1099
1100
### Retrieve an Output Item
1101
1102
Gets a specific output item from an evaluation run.
1103
1104
```typescript { .api }
1105
retrieve(outputItemID: string, params: OutputItemRetrieveParams): Promise<OutputItemRetrieveResponse>
1106
```
1107
1108
**Parameters:**
1109
1110
```typescript { .api }
1111
interface OutputItemRetrieveParams {
1112
/**
1113
* The evaluation ID.
1114
*/
1115
eval_id: string;
1116
1117
/**
1118
* The run ID.
1119
*/
1120
run_id: string;
1121
}
1122
```
1123
1124
**Example:**
1125
1126
```typescript
1127
const item = await client.evals.runs.outputItems.retrieve('item_123', {
1128
eval_id: 'eval_abc123',
1129
run_id: 'run_xyz789',
1130
});
1131
1132
console.log(`Item status: ${item.status}`);
1133
console.log(`Results:`);
1134
item.results.forEach(r => {
1135
console.log(` ${r.name}: ${r.passed ? 'PASSED' : 'FAILED'} (${r.score})`);
1136
});
1137
```
1138
1139
### List Output Items
1140
1141
Lists output items for an evaluation run.
1142
1143
```typescript { .api }
1144
list(runID: string, params: OutputItemListParams): Promise<OutputItemListResponsesPage>
1145
```
1146
1147
**Parameters:**
1148
1149
```typescript { .api }
1150
interface OutputItemListParams extends CursorPageParams {
1151
/**
1152
* The evaluation ID.
1153
*/
1154
eval_id: string;
1155
1156
/**
1157
* Sort order: `asc` or `desc` (default: `asc`).
1158
*/
1159
order?: 'asc' | 'desc';
1160
1161
/**
1162
* Filter by status: `fail` or `pass`.
1163
*/
1164
status?: 'fail' | 'pass';
1165
}
1166
```
1167
1168
**Example:**
1169
1170
```typescript
1171
// List failed output items
1172
const failed = await client.evals.runs.outputItems.list('run_xyz789', {
1173
eval_id: 'eval_abc123',
1174
status: 'fail',
1175
});
1176
1177
for await (const item of failed) {
1178
console.log(`Failed item: ${item.id}`);
1179
item.results.forEach(r => {
1180
if (!r.passed) {
1181
console.log(` ${r.name}: score ${r.score}`);
1182
}
1183
});
1184
}
1185
```
1186
1187
### Output Item Types
1188
1189
```typescript { .api }
1190
interface OutputItemRetrieveResponse {
1191
/**
1192
* Unique output item identifier.
1193
*/
1194
id: string;
1195
1196
/**
1197
* Unix timestamp (seconds) when created.
1198
*/
1199
created_at: number;
1200
1201
/**
1202
* Input data source item details.
1203
*/
1204
datasource_item: Record<string, unknown>;
1205
1206
/**
1207
* Data source item identifier.
1208
*/
1209
datasource_item_id: number;
1210
1211
/**
1212
* Evaluation ID.
1213
*/
1214
eval_id: string;
1215
1216
/**
1217
* Object type (always `eval.run.output_item`).
1218
*/
1219
object: 'eval.run.output_item';
1220
1221
/**
1222
* List of grader results.
1223
*/
1224
results: Array<{
1225
/**
1226
* Grader name.
1227
*/
1228
name: string;
1229
1230
/**
1231
* Whether grader passed.
1232
*/
1233
passed: boolean;
1234
1235
/**
1236
* Numeric score from grader.
1237
*/
1238
score: number;
1239
1240
/**
1241
* Optional sample data from grader.
1242
*/
1243
sample?: Record<string, unknown> | null;
1244
1245
/**
1246
* Grader type identifier.
1247
*/
1248
type?: string;
1249
1250
[k: string]: unknown;
1251
}>;
1252
1253
/**
1254
* Associated run ID.
1255
*/
1256
run_id: string;
1257
1258
/**
1259
* Sample with input and output.
1260
*/
1261
sample: {
1262
/**
1263
* Error information (if applicable).
1264
*/
1265
error: EvalAPIError;
1266
1267
/**
1268
* Finish reason (e.g., "stop", "max_tokens").
1269
*/
1270
finish_reason: string;
1271
1272
/**
1273
* Input messages.
1274
*/
1275
input: Array<{
1276
content: string;
1277
role: string;
1278
}>;
1279
1280
/**
1281
* Maximum tokens for completion.
1282
*/
1283
max_completion_tokens: number;
1284
1285
/**
1286
* Model used.
1287
*/
1288
model: string;
1289
1290
/**
1291
* Output messages.
1292
*/
1293
output: Array<{
1294
content?: string;
1295
role?: string;
1296
}>;
1297
1298
/**
1299
* Seed used.
1300
*/
1301
seed: number;
1302
1303
/**
1304
* Temperature used.
1305
*/
1306
temperature: number;
1307
1308
/**
1309
* Top-p (nucleus sampling) value.
1310
*/
1311
top_p: number;
1312
1313
/**
1314
* Token usage.
1315
*/
1316
usage: {
1317
cached_tokens: number;
1318
completion_tokens: number;
1319
prompt_tokens: number;
1320
total_tokens: number;
1321
};
1322
};
1323
1324
/**
1325
* Status of the output item.
1326
*/
1327
status: string;
1328
}
1329
```
1330
1331
## Complete Workflow Examples
1332
1333
### Batch Processing Workflow
1334
1335
```typescript
1336
import { OpenAI } from 'openai';
1337
1338
const client = new OpenAI();
1339
1340
async function processBatch() {
1341
// 1. Prepare batch requests
1342
const requests = [
1343
{
1344
custom_id: 'translation-1',
1345
method: 'POST',
1346
url: '/v1/chat/completions',
1347
body: {
1348
model: 'gpt-4o',
1349
messages: [{ role: 'user', content: 'Translate "hello" to French' }],
1350
},
1351
},
1352
{
1353
custom_id: 'summary-1',
1354
method: 'POST',
1355
url: '/v1/chat/completions',
1356
body: {
1357
model: 'gpt-4o',
1358
messages: [{ role: 'user', content: 'Summarize: OpenAI creates AI models.' }],
1359
},
1360
},
1361
];
1362
1363
// 2. Upload requests file
1364
const file = await client.files.create({
1365
file: new Blob([requests.map(r => JSON.stringify(r)).join('\n')]),
1366
purpose: 'batch',
1367
});
1368
1369
// 3. Create batch
1370
const batch = await client.batches.create({
1371
input_file_id: file.id,
1372
endpoint: '/v1/chat/completions',
1373
completion_window: '24h',
1374
});
1375
1376
console.log(`Batch ${batch.id} created with status: ${batch.status}`);
1377
1378
// 4. Poll until completion (or use webhooks)
1379
let completed = batch;
1380
while (!['completed', 'failed', 'expired'].includes(completed.status)) {
1381
await new Promise(resolve => setTimeout(resolve, 30000)); // Wait 30s
1382
completed = await client.batches.retrieve(batch.id);
1383
console.log(`Status: ${completed.status}`);
1384
}
1385
1386
// 5. Retrieve results
1387
if (completed.output_file_id) {
1388
const results = await client.files.content(completed.output_file_id);
1389
console.log('Results:', results);
1390
}
1391
1392
// 6. Check for errors
1393
if (completed.error_file_id) {
1394
const errors = await client.files.content(completed.error_file_id);
1395
console.log('Errors:', errors);
1396
}
1397
}
1398
1399
processBatch().catch(console.error);
1400
```
1401
1402
### Evaluation Workflow
1403
1404
```typescript
1405
import { OpenAI } from 'openai';
1406
1407
const client = new OpenAI();
1408
1409
async function runEvaluation() {
1410
// 1. Create evaluation
1411
const eval = await client.evals.create({
1412
name: 'Support Response Quality',
1413
data_source_config: {
1414
type: 'custom',
1415
item_schema: {
1416
type: 'object',
1417
properties: {
1418
question: { type: 'string' },
1419
expected_answer: { type: 'string' },
1420
},
1421
},
1422
include_sample_schema: true,
1423
},
1424
testing_criteria: [
1425
{
1426
type: 'string_check',
1427
name: 'Contains Key Term',
1428
pass_keywords: ['resolved'],
1429
},
1430
{
1431
type: 'label_model',
1432
name: 'Tone Check',
1433
model: 'gpt-4o',
1434
labels: ['professional', 'casual', 'rude'],
1435
passing_labels: ['professional', 'casual'],
1436
input: [
1437
{
1438
role: 'system',
1439
content: 'Rate the tone of this response.',
1440
},
1441
{
1442
role: 'user',
1443
content: '{{sample.output_text}}',
1444
},
1445
],
1446
},
1447
],
1448
});
1449
1450
console.log(`Created evaluation ${eval.id}`);
1451
1452
// 2. Create run with test data
1453
const run = await client.evals.runs.create(eval.id, {
1454
name: 'First Run',
1455
data_source: {
1456
type: 'jsonl',
1457
source: {
1458
type: 'file_content',
1459
content: [
1460
{
1461
item: {
1462
question: 'How do I upgrade my account?',
1463
expected_answer: 'Go to settings',
1464
},
1465
sample: {
1466
output_text: 'Your issue has been resolved. Go to settings to upgrade.',
1467
},
1468
},
1469
],
1470
},
1471
},
1472
});
1473
1474
console.log(`Run ${run.id} started`);
1475
1476
// 3. Poll for completion
1477
let completed = false;
1478
while (!completed) {
1479
const updated = await client.evals.runs.retrieve(run.id, { eval_id: eval.id });
1480
if (['completed', 'failed'].includes(updated.status)) {
1481
completed = true;
1482
console.log(`Run completed with status: ${updated.status}`);
1483
console.log(`Results: ${updated.result_counts.passed} passed, ${updated.result_counts.failed} failed`);
1484
console.log(`Report: ${updated.report_url}`);
1485
} else {
1486
await new Promise(resolve => setTimeout(resolve, 5000));
1487
}
1488
}
1489
1490
// 4. Analyze output items
1491
for await (const item of client.evals.runs.outputItems.list(run.id, { eval_id: eval.id, status: 'fail' })) {
1492
console.log(`Failed item ${item.id}:`);
1493
item.results.forEach(r => {
1494
console.log(` ${r.name}: score ${r.score}`);
1495
});
1496
}
1497
}
1498
1499
runEvaluation().catch(console.error);
1500
```
1501
1502
### Data Source Examples
1503
1504
#### JSONL File Format
1505
1506
```json
1507
{"item": {"question": "What is 2+2?", "expected": "4"}, "sample": {"output": "2+2 equals 4"}}
1508
{"item": {"question": "What is the capital of France?", "expected": "Paris"}, "sample": {"output": "The capital of France is Paris"}}
1509
```
1510
1511
#### Stored Completions Data Source
1512
1513
```typescript
1514
const run = await client.evals.runs.create(eval.id, {
1515
name: 'Test with Stored Completions',
1516
data_source: {
1517
type: 'completions',
1518
source: {
1519
type: 'stored_completions',
1520
created_after: 1700000000, // Unix timestamp
1521
model: 'gpt-4o',
1522
metadata: { usecase: 'support', version: 'v2' },
1523
},
1524
model: 'gpt-4o-mini',
1525
input_messages: {
1526
type: 'template',
1527
template: [
1528
{ role: 'system', content: 'You are a helpful assistant.' },
1529
{ role: 'user', content: '{{item.prompt}}' },
1530
],
1531
},
1532
},
1533
});
1534
```
1535
1536
## Best Practices
1537
1538
### Batches
1539
1540
- Use batches for non-time-critical workloads (high-volume, cost-optimized)
1541
- Validate JSONL format before uploading (ensure proper line-delimited JSON)
1542
- Monitor batch status via polling or webhooks
1543
- Implement error handling for failed requests in output files
1544
- Use metadata to track batch purpose and versions
1545
- Archive output files for compliance and analysis
1546
1547
### Evaluations
1548
1549
- Start with a small, well-curated dataset to validate criteria
1550
- Use multiple grader types (label models, string checks, similarity) for comprehensive assessment
1551
- Reference data using `{{item.field}}` for inputs and `{{sample.output}}` for model outputs
1552
- Set appropriate pass thresholds that reflect your quality requirements
1553
- Use stored completions data source for historical model outputs
1554
- Access detailed reports via the `report_url` for visualizations
1555
- Track evaluation results across model versions for performance comparison
1556