pypi-openai

Description: Official Python library for the OpenAI API providing chat completions, embeddings, audio, images, and more

Author: tessl

Last updated: 21 days ago

How to use

npx @tessl/cli registry install tessl/pypi-openai@1.106.0

Provide feedback Docs

embeddings.md docs/

1
# Embeddings
2

3
Convert text into high-dimensional vector representations for semantic similarity, search, clustering, and other NLP tasks using OpenAI's embedding models.
4

5
## Capabilities
6

7
### Basic Text Embeddings
8

9
Generate vector embeddings from text input for semantic analysis and similarity comparisons.
10

11
```python { .api }
12
def create(
13
    self,
14
    *,
15
    input: Union[str, SequenceNotStr[str], Iterable[int], Iterable[Iterable[int]]],
16
    model: Union[str, EmbeddingModel],
17
    dimensions: int | NotGiven = NOT_GIVEN,
18
    encoding_format: Literal["float", "base64"] | NotGiven = NOT_GIVEN,
19
    user: str | NotGiven = NOT_GIVEN,
20
    extra_headers: Headers | None = None,
21
    extra_query: Query | None = None,
22
    extra_body: Body | None = None,
23
    timeout: float | httpx.Timeout | None | NotGiven = NOT_GIVEN
24
) -> CreateEmbeddingResponse: ...
25
```
26

27
Usage examples:
28

29
```python
30
from openai import OpenAI
31

32
client = OpenAI()
33

34
# Single text embedding
35
response = client.embeddings.create(
36
    model="text-embedding-3-small",
37
    input="The quick brown fox jumps over the lazy dog"
38
)
39

40
embedding = response.data[0].embedding
41
print(f"Embedding dimension: {len(embedding)}")
42
print(f"First few values: {embedding[:5]}")
43

44
# Multiple texts at once
45
texts = [
46
    "Machine learning is a subset of artificial intelligence",
47
    "Deep learning uses neural networks with multiple layers",
48
    "Natural language processing enables computers to understand text",
49
    "Computer vision allows machines to interpret visual information"
50
]
51

52
response = client.embeddings.create(
53
    model="text-embedding-3-small",
54
    input=texts
55
)
56

57
print(f"Generated {len(response.data)} embeddings")
58
for i, embedding_data in enumerate(response.data):
59
    print(f"Text {i+1}: {len(embedding_data.embedding)} dimensions")
60
```
61

62
### Advanced Embedding Models
63

64
Use different embedding models optimized for various use cases and performance requirements.
65

66
Usage examples:
67

68
```python
69
# High-performance embedding model
70
response = client.embeddings.create(
71
    model="text-embedding-3-large", 
72
    input="This is a sample text for embedding"
73
)
74

75
large_embedding = response.data[0].embedding
76
print(f"Large model embedding size: {len(large_embedding)}")
77

78
# Ada model for compatibility
79
response = client.embeddings.create(
80
    model="text-embedding-ada-002",
81
    input="Legacy embedding model example"
82
)
83

84
ada_embedding = response.data[0].embedding
85
print(f"Ada model embedding size: {len(ada_embedding)}")
86

87
# Performance comparison
88
import time
89

90
text = "Compare embedding generation speed across models"
91

92
# Small model
93
start = time.time()
94
response_small = client.embeddings.create(
95
    model="text-embedding-3-small",
96
    input=text
97
)
98
small_time = time.time() - start
99

100
# Large model  
101
start = time.time()
102
response_large = client.embeddings.create(
103
    model="text-embedding-3-large",
104
    input=text
105
)
106
large_time = time.time() - start
107

108
print(f"Small model time: {small_time:.3f}s")
109
print(f"Large model time: {large_time:.3f}s")
110
```
111

112
### Custom Embedding Dimensions
113

114
Specify custom dimensions for optimized storage and performance in specific applications.
115

116
Usage examples:
117

118
```python
119
# Reduced dimensions for storage efficiency
120
response = client.embeddings.create(
121
    model="text-embedding-3-small",
122
    input="Text with custom dimensions",
123
    dimensions=512  # Reduced from default 1536
124
)
125

126
custom_embedding = response.data[0].embedding
127
print(f"Custom dimension size: {len(custom_embedding)}")
128

129
# Different dimension sizes for comparison
130
texts = ["Sample text for dimension comparison"]
131

132
dimensions_to_test = [256, 512, 1024, 1536]
133
embeddings = {}
134

135
for dim in dimensions_to_test:
136
    response = client.embeddings.create(
137
        model="text-embedding-3-small",
138
        input=texts[0],
139
        dimensions=dim
140
    )
141
    embeddings[dim] = response.data[0].embedding
142
    print(f"Dimensions {dim}: actual size {len(response.data[0].embedding)}")
143

144
# Note: Smaller dimensions lose some information but are more efficient
145
```
146

147
### Batch Processing and Encoding Formats
148

149
Process large amounts of text efficiently with batch operations and different encoding formats.
150

151
Usage examples:
152

153
```python
154
# Large batch processing
155
documents = [
156
    "Document 1: Introduction to machine learning concepts",
157
    "Document 2: Advanced neural network architectures", 
158
    "Document 3: Natural language processing applications",
159
    "Document 4: Computer vision and image recognition",
160
    "Document 5: Reinforcement learning algorithms"
161
] * 100  # 500 documents total
162

163
# Process in batches to handle API limits
164
batch_size = 100
165
all_embeddings = []
166

167
for i in range(0, len(documents), batch_size):
168
    batch = documents[i:i + batch_size]
169
    
170
    response = client.embeddings.create(
171
        model="text-embedding-3-small",
172
        input=batch
173
    )
174
    
175
    batch_embeddings = [item.embedding for item in response.data]
176
    all_embeddings.extend(batch_embeddings)
177
    
178
    print(f"Processed batch {i//batch_size + 1}: {len(batch)} documents")
179

180
print(f"Total embeddings generated: {len(all_embeddings)}")
181

182
# Base64 encoding for storage efficiency
183
response = client.embeddings.create(
184
    model="text-embedding-3-small",
185
    input="Text for base64 encoding",
186
    encoding_format="base64"
187
)
188

189
base64_embedding = response.data[0].embedding
190
print(f"Base64 encoded embedding type: {type(base64_embedding)}")
191

192
# Decode base64 embedding when needed
193
import base64
194
import struct
195

196
def decode_base64_embedding(base64_str):
197
    """Convert base64 encoded embedding back to float array"""
198
    binary_data = base64.b64decode(base64_str)
199
    float_array = struct.unpack(f'{len(binary_data)//4}f', binary_data)
200
    return list(float_array)
201

202
# decoded_embedding = decode_base64_embedding(base64_embedding)
203
```
204

205
### Semantic Similarity and Search
206

207
Use embeddings for semantic similarity calculations and vector-based search applications.
208

209
Usage examples:
210

211
```python
212
import numpy as np
213
from sklearn.metrics.pairwise import cosine_similarity
214

215
# Generate embeddings for similarity comparison
216
queries = [
217
    "What is artificial intelligence?",
218
    "How do neural networks work?", 
219
    "Explain machine learning algorithms"
220
]
221

222
documents = [
223
    "Artificial intelligence is the simulation of human intelligence by machines",
224
    "Neural networks are computing systems inspired by biological neural networks",
225
    "Machine learning algorithms enable computers to learn from data",
226
    "Deep learning uses multi-layered neural networks for complex pattern recognition",
227
    "Natural language processing helps computers understand human language"
228
]
229

230
# Get query embeddings
231
query_response = client.embeddings.create(
232
    model="text-embedding-3-small",
233
    input=queries
234
)
235
query_embeddings = np.array([item.embedding for item in query_response.data])
236

237
# Get document embeddings
238
doc_response = client.embeddings.create(
239
    model="text-embedding-3-small", 
240
    input=documents
241
)
242
doc_embeddings = np.array([item.embedding for item in doc_response.data])
243

244
# Calculate similarities
245
similarities = cosine_similarity(query_embeddings, doc_embeddings)
246

247
# Find best matches for each query
248
for i, query in enumerate(queries):
249
    best_doc_idx = np.argmax(similarities[i])
250
    similarity_score = similarities[i][best_doc_idx]
251
    
252
    print(f"Query: {query}")
253
    print(f"Best match: {documents[best_doc_idx]}")
254
    print(f"Similarity: {similarity_score:.3f}\n")
255

256
# Vector search function
257
def vector_search(query, documents, top_k=3):
258
    """Perform vector-based semantic search"""
259
    
260
    # Get query embedding
261
    query_response = client.embeddings.create(
262
        model="text-embedding-3-small",
263
        input=query
264
    )
265
    query_embedding = np.array(query_response.data[0].embedding).reshape(1, -1)
266
    
267
    # Get document embeddings
268
    doc_response = client.embeddings.create(
269
        model="text-embedding-3-small",
270
        input=documents
271
    )
272
    doc_embeddings = np.array([item.embedding for item in doc_response.data])
273
    
274
    # Calculate similarities
275
    similarities = cosine_similarity(query_embedding, doc_embeddings)[0]
276
    
277
    # Get top k results
278
    top_indices = np.argsort(similarities)[::-1][:top_k]
279
    
280
    results = []
281
    for idx in top_indices:
282
        results.append({
283
            'document': documents[idx],
284
            'similarity': similarities[idx],
285
            'index': idx
286
        })
287
    
288
    return results
289

290
# Example search
291
search_results = vector_search(
292
    "How do computers learn?",
293
    documents,
294
    top_k=3
295
)
296

297
print("Search results:")
298
for i, result in enumerate(search_results):
299
    print(f"{i+1}. {result['document']} (similarity: {result['similarity']:.3f})")
300
```
301

302
### Token-Based Input
303

304
Use tokenized input for precise control over embedding generation and handling of long texts.
305

306
Usage examples:
307

308
```python
309
import tiktoken
310

311
# Get tokenizer for embedding model
312
encoding = tiktoken.encoding_for_model("text-embedding-3-small")
313

314
# Tokenize input text
315
text = "This is a sample text that will be tokenized for embedding generation."
316
tokens = encoding.encode(text)
317

318
print(f"Original text: {text}")
319
print(f"Tokens: {tokens}")
320
print(f"Token count: {len(tokens)}")
321

322
# Generate embedding from tokens
323
response = client.embeddings.create(
324
    model="text-embedding-3-small",
325
    input=tokens  # Pass tokens directly
326
)
327

328
token_embedding = response.data[0].embedding
329
print(f"Embedding from tokens: {len(token_embedding)} dimensions")
330

331
# Handle long texts by truncation
332
max_tokens = 8192  # Model limit for text-embedding-3-small
333

334
long_text = "Very long document content..." * 1000
335
long_tokens = encoding.encode(long_text)
336

337
if len(long_tokens) > max_tokens:
338
    truncated_tokens = long_tokens[:max_tokens]
339
    print(f"Truncated from {len(long_tokens)} to {len(truncated_tokens)} tokens")
340
    
341
    response = client.embeddings.create(
342
        model="text-embedding-3-small",
343
        input=truncated_tokens
344
    )
345
    
346
    truncated_embedding = response.data[0].embedding
347
    print(f"Embedding generated from truncated tokens")
348

349
# Multiple token sequences
350
token_sequences = [
351
    encoding.encode("First document for embedding"),
352
    encoding.encode("Second document for embedding"),
353
    encoding.encode("Third document for embedding")
354
]
355

356
response = client.embeddings.create(
357
    model="text-embedding-3-small",
358
    input=token_sequences
359
)
360

361
print(f"Generated embeddings for {len(response.data)} token sequences")
362
```
363

364
## Types
365

366
### Core Response Types
367

368
```python { .api }
369
class CreateEmbeddingResponse(BaseModel):
370
    data: List[Embedding]
371
    model: str
372
    object: Literal["list"]
373
    usage: EmbeddingUsage
374

375
class Embedding(BaseModel):
376
    embedding: List[float]
377
    index: int
378
    object: Literal["embedding"]
379

380
class EmbeddingUsage(BaseModel):
381
    prompt_tokens: int
382
    total_tokens: int
383
```
384

385
### Parameter Types
386

387
```python { .api }
388
EmbeddingCreateParams = TypedDict('EmbeddingCreateParams', {
389
    'input': Required[Union[str, List[str], List[int], List[List[int]]]],
390
    'model': Required[Union[str, EmbeddingModel]],
391
    'dimensions': NotRequired[int],
392
    'encoding_format': NotRequired[Literal["float", "base64"]],
393
    'user': NotRequired[str],
394
}, total=False)
395

396
# Input can be various formats
397
EmbeddingInput = Union[
398
    str,              # Single text string
399
    List[str],        # Multiple text strings
400
    List[int],        # Token IDs for single text
401
    List[List[int]]   # Token IDs for multiple texts
402
]
403
```
404

405
### Model Types
406

407
```python { .api }
408
EmbeddingModel = Literal[
409
    "text-embedding-3-small",
410
    "text-embedding-3-large", 
411
    "text-embedding-ada-002"
412
]
413

414
# Model specifications
415
class EmbeddingModelSpecs:
416
    text_embedding_3_small: {
417
        'max_input': 8192,        # tokens
418
        'dimensions': 1536,       # default
419
        'custom_dimensions': True, # supports dimension reduction
420
        'performance': 'high'
421
    }
422
    
423
    text_embedding_3_large: {
424
        'max_input': 8192,        # tokens
425
        'dimensions': 3072,       # default
426
        'custom_dimensions': True, # supports dimension reduction
427
        'performance': 'highest'
428
    }
429
    
430
    text_embedding_ada_002: {
431
        'max_input': 8192,        # tokens
432
        'dimensions': 1536,       # fixed
433
        'custom_dimensions': False, # no dimension reduction
434
        'performance': 'good'
435
    }
436
```
437

438
### Configuration Types
439

440
```python { .api }
441
EncodingFormat = Literal["float", "base64"]
442

443
# Dimension limits by model
444
ModelDimensions = Dict[str, Dict[str, int]] = {
445
    "text-embedding-3-small": {"min": 1, "max": 1536, "default": 1536},
446
    "text-embedding-3-large": {"min": 1, "max": 3072, "default": 3072},
447
    "text-embedding-ada-002": {"min": 1536, "max": 1536, "default": 1536}
448
}
449
```
450

451
## Best Practices
452

453
### Performance Optimization
454

455
- Use `text-embedding-3-small` for most applications (good performance, lower cost)
456
- Use `text-embedding-3-large` when maximum accuracy is needed
457
- Reduce dimensions for storage efficiency when full precision isn't required
458
- Batch multiple texts together for better throughput
459
- Cache embeddings for frequently accessed content
460

461
### Input Preparation
462

463
- Normalize text (lowercase, remove extra whitespace) for consistent results
464
- Handle long texts by truncating to model limits (8192 tokens)
465
- Use meaningful text chunks rather than very short fragments
466
- Preprocess documents to remove irrelevant content (headers, footers, etc.)
467

468
### Storage and Retrieval
469

470
- Use base64 encoding for space-efficient storage
471
- Consider approximate nearest neighbor search libraries (FAISS, Annoy) for large datasets
472
- Store metadata alongside embeddings for result filtering
473
- Implement caching strategies for frequently accessed embeddings
474

475
### Similarity Calculations
476

477
- Use cosine similarity for most semantic similarity tasks
478
- Consider Euclidean distance for specific mathematical applications
479
- Normalize vectors when using dot product similarity
480
- Establish similarity thresholds based on your specific use case