pypi-openai

Description
Official Python library for the OpenAI API providing chat completions, embeddings, audio, images, and more
Author
tessl
Last updated

How to use

npx @tessl/cli registry install tessl/pypi-openai@1.106.0

embeddings.md docs/

1
# Embeddings
2
3
Convert text into high-dimensional vector representations for semantic similarity, search, clustering, and other NLP tasks using OpenAI's embedding models.
4
5
## Capabilities
6
7
### Basic Text Embeddings
8
9
Generate vector embeddings from text input for semantic analysis and similarity comparisons.
10
11
```python { .api }
12
def create(
13
self,
14
*,
15
input: Union[str, SequenceNotStr[str], Iterable[int], Iterable[Iterable[int]]],
16
model: Union[str, EmbeddingModel],
17
dimensions: int | NotGiven = NOT_GIVEN,
18
encoding_format: Literal["float", "base64"] | NotGiven = NOT_GIVEN,
19
user: str | NotGiven = NOT_GIVEN,
20
extra_headers: Headers | None = None,
21
extra_query: Query | None = None,
22
extra_body: Body | None = None,
23
timeout: float | httpx.Timeout | None | NotGiven = NOT_GIVEN
24
) -> CreateEmbeddingResponse: ...
25
```
26
27
Usage examples:
28
29
```python
30
from openai import OpenAI
31
32
client = OpenAI()
33
34
# Single text embedding
35
response = client.embeddings.create(
36
model="text-embedding-3-small",
37
input="The quick brown fox jumps over the lazy dog"
38
)
39
40
embedding = response.data[0].embedding
41
print(f"Embedding dimension: {len(embedding)}")
42
print(f"First few values: {embedding[:5]}")
43
44
# Multiple texts at once
45
texts = [
46
"Machine learning is a subset of artificial intelligence",
47
"Deep learning uses neural networks with multiple layers",
48
"Natural language processing enables computers to understand text",
49
"Computer vision allows machines to interpret visual information"
50
]
51
52
response = client.embeddings.create(
53
model="text-embedding-3-small",
54
input=texts
55
)
56
57
print(f"Generated {len(response.data)} embeddings")
58
for i, embedding_data in enumerate(response.data):
59
print(f"Text {i+1}: {len(embedding_data.embedding)} dimensions")
60
```
61
62
### Advanced Embedding Models
63
64
Use different embedding models optimized for various use cases and performance requirements.
65
66
Usage examples:
67
68
```python
69
# High-performance embedding model
70
response = client.embeddings.create(
71
model="text-embedding-3-large",
72
input="This is a sample text for embedding"
73
)
74
75
large_embedding = response.data[0].embedding
76
print(f"Large model embedding size: {len(large_embedding)}")
77
78
# Ada model for compatibility
79
response = client.embeddings.create(
80
model="text-embedding-ada-002",
81
input="Legacy embedding model example"
82
)
83
84
ada_embedding = response.data[0].embedding
85
print(f"Ada model embedding size: {len(ada_embedding)}")
86
87
# Performance comparison
88
import time
89
90
text = "Compare embedding generation speed across models"
91
92
# Small model
93
start = time.time()
94
response_small = client.embeddings.create(
95
model="text-embedding-3-small",
96
input=text
97
)
98
small_time = time.time() - start
99
100
# Large model
101
start = time.time()
102
response_large = client.embeddings.create(
103
model="text-embedding-3-large",
104
input=text
105
)
106
large_time = time.time() - start
107
108
print(f"Small model time: {small_time:.3f}s")
109
print(f"Large model time: {large_time:.3f}s")
110
```
111
112
### Custom Embedding Dimensions
113
114
Specify custom dimensions for optimized storage and performance in specific applications.
115
116
Usage examples:
117
118
```python
119
# Reduced dimensions for storage efficiency
120
response = client.embeddings.create(
121
model="text-embedding-3-small",
122
input="Text with custom dimensions",
123
dimensions=512 # Reduced from default 1536
124
)
125
126
custom_embedding = response.data[0].embedding
127
print(f"Custom dimension size: {len(custom_embedding)}")
128
129
# Different dimension sizes for comparison
130
texts = ["Sample text for dimension comparison"]
131
132
dimensions_to_test = [256, 512, 1024, 1536]
133
embeddings = {}
134
135
for dim in dimensions_to_test:
136
response = client.embeddings.create(
137
model="text-embedding-3-small",
138
input=texts[0],
139
dimensions=dim
140
)
141
embeddings[dim] = response.data[0].embedding
142
print(f"Dimensions {dim}: actual size {len(response.data[0].embedding)}")
143
144
# Note: Smaller dimensions lose some information but are more efficient
145
```
146
147
### Batch Processing and Encoding Formats
148
149
Process large amounts of text efficiently with batch operations and different encoding formats.
150
151
Usage examples:
152
153
```python
154
# Large batch processing
155
documents = [
156
"Document 1: Introduction to machine learning concepts",
157
"Document 2: Advanced neural network architectures",
158
"Document 3: Natural language processing applications",
159
"Document 4: Computer vision and image recognition",
160
"Document 5: Reinforcement learning algorithms"
161
] * 100 # 500 documents total
162
163
# Process in batches to handle API limits
164
batch_size = 100
165
all_embeddings = []
166
167
for i in range(0, len(documents), batch_size):
168
batch = documents[i:i + batch_size]
169
170
response = client.embeddings.create(
171
model="text-embedding-3-small",
172
input=batch
173
)
174
175
batch_embeddings = [item.embedding for item in response.data]
176
all_embeddings.extend(batch_embeddings)
177
178
print(f"Processed batch {i//batch_size + 1}: {len(batch)} documents")
179
180
print(f"Total embeddings generated: {len(all_embeddings)}")
181
182
# Base64 encoding for storage efficiency
183
response = client.embeddings.create(
184
model="text-embedding-3-small",
185
input="Text for base64 encoding",
186
encoding_format="base64"
187
)
188
189
base64_embedding = response.data[0].embedding
190
print(f"Base64 encoded embedding type: {type(base64_embedding)}")
191
192
# Decode base64 embedding when needed
193
import base64
194
import struct
195
196
def decode_base64_embedding(base64_str):
197
"""Convert base64 encoded embedding back to float array"""
198
binary_data = base64.b64decode(base64_str)
199
float_array = struct.unpack(f'{len(binary_data)//4}f', binary_data)
200
return list(float_array)
201
202
# decoded_embedding = decode_base64_embedding(base64_embedding)
203
```
204
205
### Semantic Similarity and Search
206
207
Use embeddings for semantic similarity calculations and vector-based search applications.
208
209
Usage examples:
210
211
```python
212
import numpy as np
213
from sklearn.metrics.pairwise import cosine_similarity
214
215
# Generate embeddings for similarity comparison
216
queries = [
217
"What is artificial intelligence?",
218
"How do neural networks work?",
219
"Explain machine learning algorithms"
220
]
221
222
documents = [
223
"Artificial intelligence is the simulation of human intelligence by machines",
224
"Neural networks are computing systems inspired by biological neural networks",
225
"Machine learning algorithms enable computers to learn from data",
226
"Deep learning uses multi-layered neural networks for complex pattern recognition",
227
"Natural language processing helps computers understand human language"
228
]
229
230
# Get query embeddings
231
query_response = client.embeddings.create(
232
model="text-embedding-3-small",
233
input=queries
234
)
235
query_embeddings = np.array([item.embedding for item in query_response.data])
236
237
# Get document embeddings
238
doc_response = client.embeddings.create(
239
model="text-embedding-3-small",
240
input=documents
241
)
242
doc_embeddings = np.array([item.embedding for item in doc_response.data])
243
244
# Calculate similarities
245
similarities = cosine_similarity(query_embeddings, doc_embeddings)
246
247
# Find best matches for each query
248
for i, query in enumerate(queries):
249
best_doc_idx = np.argmax(similarities[i])
250
similarity_score = similarities[i][best_doc_idx]
251
252
print(f"Query: {query}")
253
print(f"Best match: {documents[best_doc_idx]}")
254
print(f"Similarity: {similarity_score:.3f}\n")
255
256
# Vector search function
257
def vector_search(query, documents, top_k=3):
258
"""Perform vector-based semantic search"""
259
260
# Get query embedding
261
query_response = client.embeddings.create(
262
model="text-embedding-3-small",
263
input=query
264
)
265
query_embedding = np.array(query_response.data[0].embedding).reshape(1, -1)
266
267
# Get document embeddings
268
doc_response = client.embeddings.create(
269
model="text-embedding-3-small",
270
input=documents
271
)
272
doc_embeddings = np.array([item.embedding for item in doc_response.data])
273
274
# Calculate similarities
275
similarities = cosine_similarity(query_embedding, doc_embeddings)[0]
276
277
# Get top k results
278
top_indices = np.argsort(similarities)[::-1][:top_k]
279
280
results = []
281
for idx in top_indices:
282
results.append({
283
'document': documents[idx],
284
'similarity': similarities[idx],
285
'index': idx
286
})
287
288
return results
289
290
# Example search
291
search_results = vector_search(
292
"How do computers learn?",
293
documents,
294
top_k=3
295
)
296
297
print("Search results:")
298
for i, result in enumerate(search_results):
299
print(f"{i+1}. {result['document']} (similarity: {result['similarity']:.3f})")
300
```
301
302
### Token-Based Input
303
304
Use tokenized input for precise control over embedding generation and handling of long texts.
305
306
Usage examples:
307
308
```python
309
import tiktoken
310
311
# Get tokenizer for embedding model
312
encoding = tiktoken.encoding_for_model("text-embedding-3-small")
313
314
# Tokenize input text
315
text = "This is a sample text that will be tokenized for embedding generation."
316
tokens = encoding.encode(text)
317
318
print(f"Original text: {text}")
319
print(f"Tokens: {tokens}")
320
print(f"Token count: {len(tokens)}")
321
322
# Generate embedding from tokens
323
response = client.embeddings.create(
324
model="text-embedding-3-small",
325
input=tokens # Pass tokens directly
326
)
327
328
token_embedding = response.data[0].embedding
329
print(f"Embedding from tokens: {len(token_embedding)} dimensions")
330
331
# Handle long texts by truncation
332
max_tokens = 8192 # Model limit for text-embedding-3-small
333
334
long_text = "Very long document content..." * 1000
335
long_tokens = encoding.encode(long_text)
336
337
if len(long_tokens) > max_tokens:
338
truncated_tokens = long_tokens[:max_tokens]
339
print(f"Truncated from {len(long_tokens)} to {len(truncated_tokens)} tokens")
340
341
response = client.embeddings.create(
342
model="text-embedding-3-small",
343
input=truncated_tokens
344
)
345
346
truncated_embedding = response.data[0].embedding
347
print(f"Embedding generated from truncated tokens")
348
349
# Multiple token sequences
350
token_sequences = [
351
encoding.encode("First document for embedding"),
352
encoding.encode("Second document for embedding"),
353
encoding.encode("Third document for embedding")
354
]
355
356
response = client.embeddings.create(
357
model="text-embedding-3-small",
358
input=token_sequences
359
)
360
361
print(f"Generated embeddings for {len(response.data)} token sequences")
362
```
363
364
## Types
365
366
### Core Response Types
367
368
```python { .api }
369
class CreateEmbeddingResponse(BaseModel):
370
data: List[Embedding]
371
model: str
372
object: Literal["list"]
373
usage: EmbeddingUsage
374
375
class Embedding(BaseModel):
376
embedding: List[float]
377
index: int
378
object: Literal["embedding"]
379
380
class EmbeddingUsage(BaseModel):
381
prompt_tokens: int
382
total_tokens: int
383
```
384
385
### Parameter Types
386
387
```python { .api }
388
EmbeddingCreateParams = TypedDict('EmbeddingCreateParams', {
389
'input': Required[Union[str, List[str], List[int], List[List[int]]]],
390
'model': Required[Union[str, EmbeddingModel]],
391
'dimensions': NotRequired[int],
392
'encoding_format': NotRequired[Literal["float", "base64"]],
393
'user': NotRequired[str],
394
}, total=False)
395
396
# Input can be various formats
397
EmbeddingInput = Union[
398
str, # Single text string
399
List[str], # Multiple text strings
400
List[int], # Token IDs for single text
401
List[List[int]] # Token IDs for multiple texts
402
]
403
```
404
405
### Model Types
406
407
```python { .api }
408
EmbeddingModel = Literal[
409
"text-embedding-3-small",
410
"text-embedding-3-large",
411
"text-embedding-ada-002"
412
]
413
414
# Model specifications
415
class EmbeddingModelSpecs:
416
text_embedding_3_small: {
417
'max_input': 8192, # tokens
418
'dimensions': 1536, # default
419
'custom_dimensions': True, # supports dimension reduction
420
'performance': 'high'
421
}
422
423
text_embedding_3_large: {
424
'max_input': 8192, # tokens
425
'dimensions': 3072, # default
426
'custom_dimensions': True, # supports dimension reduction
427
'performance': 'highest'
428
}
429
430
text_embedding_ada_002: {
431
'max_input': 8192, # tokens
432
'dimensions': 1536, # fixed
433
'custom_dimensions': False, # no dimension reduction
434
'performance': 'good'
435
}
436
```
437
438
### Configuration Types
439
440
```python { .api }
441
EncodingFormat = Literal["float", "base64"]
442
443
# Dimension limits by model
444
ModelDimensions = Dict[str, Dict[str, int]] = {
445
"text-embedding-3-small": {"min": 1, "max": 1536, "default": 1536},
446
"text-embedding-3-large": {"min": 1, "max": 3072, "default": 3072},
447
"text-embedding-ada-002": {"min": 1536, "max": 1536, "default": 1536}
448
}
449
```
450
451
## Best Practices
452
453
### Performance Optimization
454
455
- Use `text-embedding-3-small` for most applications (good performance, lower cost)
456
- Use `text-embedding-3-large` when maximum accuracy is needed
457
- Reduce dimensions for storage efficiency when full precision isn't required
458
- Batch multiple texts together for better throughput
459
- Cache embeddings for frequently accessed content
460
461
### Input Preparation
462
463
- Normalize text (lowercase, remove extra whitespace) for consistent results
464
- Handle long texts by truncating to model limits (8192 tokens)
465
- Use meaningful text chunks rather than very short fragments
466
- Preprocess documents to remove irrelevant content (headers, footers, etc.)
467
468
### Storage and Retrieval
469
470
- Use base64 encoding for space-efficient storage
471
- Consider approximate nearest neighbor search libraries (FAISS, Annoy) for large datasets
472
- Store metadata alongside embeddings for result filtering
473
- Implement caching strategies for frequently accessed embeddings
474
475
### Similarity Calculations
476
477
- Use cosine similarity for most semantic similarity tasks
478
- Consider Euclidean distance for specific mathematical applications
479
- Normalize vectors when using dot product similarity
480
- Establish similarity thresholds based on your specific use case