- Spec files
pypi-openai
Describes: pkg:pypi/openai@1.106.x
- Description
- Official Python library for the OpenAI API providing chat completions, embeddings, audio, images, and more
- Author
- tessl
- Last updated
embeddings.md docs/
1# Embeddings23Convert text into high-dimensional vector representations for semantic similarity, search, clustering, and other NLP tasks using OpenAI's embedding models.45## Capabilities67### Basic Text Embeddings89Generate vector embeddings from text input for semantic analysis and similarity comparisons.1011```python { .api }12def create(13self,14*,15input: Union[str, SequenceNotStr[str], Iterable[int], Iterable[Iterable[int]]],16model: Union[str, EmbeddingModel],17dimensions: int | NotGiven = NOT_GIVEN,18encoding_format: Literal["float", "base64"] | NotGiven = NOT_GIVEN,19user: str | NotGiven = NOT_GIVEN,20extra_headers: Headers | None = None,21extra_query: Query | None = None,22extra_body: Body | None = None,23timeout: float | httpx.Timeout | None | NotGiven = NOT_GIVEN24) -> CreateEmbeddingResponse: ...25```2627Usage examples:2829```python30from openai import OpenAI3132client = OpenAI()3334# Single text embedding35response = client.embeddings.create(36model="text-embedding-3-small",37input="The quick brown fox jumps over the lazy dog"38)3940embedding = response.data[0].embedding41print(f"Embedding dimension: {len(embedding)}")42print(f"First few values: {embedding[:5]}")4344# Multiple texts at once45texts = [46"Machine learning is a subset of artificial intelligence",47"Deep learning uses neural networks with multiple layers",48"Natural language processing enables computers to understand text",49"Computer vision allows machines to interpret visual information"50]5152response = client.embeddings.create(53model="text-embedding-3-small",54input=texts55)5657print(f"Generated {len(response.data)} embeddings")58for i, embedding_data in enumerate(response.data):59print(f"Text {i+1}: {len(embedding_data.embedding)} dimensions")60```6162### Advanced Embedding Models6364Use different embedding models optimized for various use cases and performance requirements.6566Usage examples:6768```python69# High-performance embedding model70response = client.embeddings.create(71model="text-embedding-3-large",72input="This is a sample text for embedding"73)7475large_embedding = response.data[0].embedding76print(f"Large model embedding size: {len(large_embedding)}")7778# Ada model for compatibility79response = client.embeddings.create(80model="text-embedding-ada-002",81input="Legacy embedding model example"82)8384ada_embedding = response.data[0].embedding85print(f"Ada model embedding size: {len(ada_embedding)}")8687# Performance comparison88import time8990text = "Compare embedding generation speed across models"9192# Small model93start = time.time()94response_small = client.embeddings.create(95model="text-embedding-3-small",96input=text97)98small_time = time.time() - start99100# Large model101start = time.time()102response_large = client.embeddings.create(103model="text-embedding-3-large",104input=text105)106large_time = time.time() - start107108print(f"Small model time: {small_time:.3f}s")109print(f"Large model time: {large_time:.3f}s")110```111112### Custom Embedding Dimensions113114Specify custom dimensions for optimized storage and performance in specific applications.115116Usage examples:117118```python119# Reduced dimensions for storage efficiency120response = client.embeddings.create(121model="text-embedding-3-small",122input="Text with custom dimensions",123dimensions=512 # Reduced from default 1536124)125126custom_embedding = response.data[0].embedding127print(f"Custom dimension size: {len(custom_embedding)}")128129# Different dimension sizes for comparison130texts = ["Sample text for dimension comparison"]131132dimensions_to_test = [256, 512, 1024, 1536]133embeddings = {}134135for dim in dimensions_to_test:136response = client.embeddings.create(137model="text-embedding-3-small",138input=texts[0],139dimensions=dim140)141embeddings[dim] = response.data[0].embedding142print(f"Dimensions {dim}: actual size {len(response.data[0].embedding)}")143144# Note: Smaller dimensions lose some information but are more efficient145```146147### Batch Processing and Encoding Formats148149Process large amounts of text efficiently with batch operations and different encoding formats.150151Usage examples:152153```python154# Large batch processing155documents = [156"Document 1: Introduction to machine learning concepts",157"Document 2: Advanced neural network architectures",158"Document 3: Natural language processing applications",159"Document 4: Computer vision and image recognition",160"Document 5: Reinforcement learning algorithms"161] * 100 # 500 documents total162163# Process in batches to handle API limits164batch_size = 100165all_embeddings = []166167for i in range(0, len(documents), batch_size):168batch = documents[i:i + batch_size]169170response = client.embeddings.create(171model="text-embedding-3-small",172input=batch173)174175batch_embeddings = [item.embedding for item in response.data]176all_embeddings.extend(batch_embeddings)177178print(f"Processed batch {i//batch_size + 1}: {len(batch)} documents")179180print(f"Total embeddings generated: {len(all_embeddings)}")181182# Base64 encoding for storage efficiency183response = client.embeddings.create(184model="text-embedding-3-small",185input="Text for base64 encoding",186encoding_format="base64"187)188189base64_embedding = response.data[0].embedding190print(f"Base64 encoded embedding type: {type(base64_embedding)}")191192# Decode base64 embedding when needed193import base64194import struct195196def decode_base64_embedding(base64_str):197"""Convert base64 encoded embedding back to float array"""198binary_data = base64.b64decode(base64_str)199float_array = struct.unpack(f'{len(binary_data)//4}f', binary_data)200return list(float_array)201202# decoded_embedding = decode_base64_embedding(base64_embedding)203```204205### Semantic Similarity and Search206207Use embeddings for semantic similarity calculations and vector-based search applications.208209Usage examples:210211```python212import numpy as np213from sklearn.metrics.pairwise import cosine_similarity214215# Generate embeddings for similarity comparison216queries = [217"What is artificial intelligence?",218"How do neural networks work?",219"Explain machine learning algorithms"220]221222documents = [223"Artificial intelligence is the simulation of human intelligence by machines",224"Neural networks are computing systems inspired by biological neural networks",225"Machine learning algorithms enable computers to learn from data",226"Deep learning uses multi-layered neural networks for complex pattern recognition",227"Natural language processing helps computers understand human language"228]229230# Get query embeddings231query_response = client.embeddings.create(232model="text-embedding-3-small",233input=queries234)235query_embeddings = np.array([item.embedding for item in query_response.data])236237# Get document embeddings238doc_response = client.embeddings.create(239model="text-embedding-3-small",240input=documents241)242doc_embeddings = np.array([item.embedding for item in doc_response.data])243244# Calculate similarities245similarities = cosine_similarity(query_embeddings, doc_embeddings)246247# Find best matches for each query248for i, query in enumerate(queries):249best_doc_idx = np.argmax(similarities[i])250similarity_score = similarities[i][best_doc_idx]251252print(f"Query: {query}")253print(f"Best match: {documents[best_doc_idx]}")254print(f"Similarity: {similarity_score:.3f}\n")255256# Vector search function257def vector_search(query, documents, top_k=3):258"""Perform vector-based semantic search"""259260# Get query embedding261query_response = client.embeddings.create(262model="text-embedding-3-small",263input=query264)265query_embedding = np.array(query_response.data[0].embedding).reshape(1, -1)266267# Get document embeddings268doc_response = client.embeddings.create(269model="text-embedding-3-small",270input=documents271)272doc_embeddings = np.array([item.embedding for item in doc_response.data])273274# Calculate similarities275similarities = cosine_similarity(query_embedding, doc_embeddings)[0]276277# Get top k results278top_indices = np.argsort(similarities)[::-1][:top_k]279280results = []281for idx in top_indices:282results.append({283'document': documents[idx],284'similarity': similarities[idx],285'index': idx286})287288return results289290# Example search291search_results = vector_search(292"How do computers learn?",293documents,294top_k=3295)296297print("Search results:")298for i, result in enumerate(search_results):299print(f"{i+1}. {result['document']} (similarity: {result['similarity']:.3f})")300```301302### Token-Based Input303304Use tokenized input for precise control over embedding generation and handling of long texts.305306Usage examples:307308```python309import tiktoken310311# Get tokenizer for embedding model312encoding = tiktoken.encoding_for_model("text-embedding-3-small")313314# Tokenize input text315text = "This is a sample text that will be tokenized for embedding generation."316tokens = encoding.encode(text)317318print(f"Original text: {text}")319print(f"Tokens: {tokens}")320print(f"Token count: {len(tokens)}")321322# Generate embedding from tokens323response = client.embeddings.create(324model="text-embedding-3-small",325input=tokens # Pass tokens directly326)327328token_embedding = response.data[0].embedding329print(f"Embedding from tokens: {len(token_embedding)} dimensions")330331# Handle long texts by truncation332max_tokens = 8192 # Model limit for text-embedding-3-small333334long_text = "Very long document content..." * 1000335long_tokens = encoding.encode(long_text)336337if len(long_tokens) > max_tokens:338truncated_tokens = long_tokens[:max_tokens]339print(f"Truncated from {len(long_tokens)} to {len(truncated_tokens)} tokens")340341response = client.embeddings.create(342model="text-embedding-3-small",343input=truncated_tokens344)345346truncated_embedding = response.data[0].embedding347print(f"Embedding generated from truncated tokens")348349# Multiple token sequences350token_sequences = [351encoding.encode("First document for embedding"),352encoding.encode("Second document for embedding"),353encoding.encode("Third document for embedding")354]355356response = client.embeddings.create(357model="text-embedding-3-small",358input=token_sequences359)360361print(f"Generated embeddings for {len(response.data)} token sequences")362```363364## Types365366### Core Response Types367368```python { .api }369class CreateEmbeddingResponse(BaseModel):370data: List[Embedding]371model: str372object: Literal["list"]373usage: EmbeddingUsage374375class Embedding(BaseModel):376embedding: List[float]377index: int378object: Literal["embedding"]379380class EmbeddingUsage(BaseModel):381prompt_tokens: int382total_tokens: int383```384385### Parameter Types386387```python { .api }388EmbeddingCreateParams = TypedDict('EmbeddingCreateParams', {389'input': Required[Union[str, List[str], List[int], List[List[int]]]],390'model': Required[Union[str, EmbeddingModel]],391'dimensions': NotRequired[int],392'encoding_format': NotRequired[Literal["float", "base64"]],393'user': NotRequired[str],394}, total=False)395396# Input can be various formats397EmbeddingInput = Union[398str, # Single text string399List[str], # Multiple text strings400List[int], # Token IDs for single text401List[List[int]] # Token IDs for multiple texts402]403```404405### Model Types406407```python { .api }408EmbeddingModel = Literal[409"text-embedding-3-small",410"text-embedding-3-large",411"text-embedding-ada-002"412]413414# Model specifications415class EmbeddingModelSpecs:416text_embedding_3_small: {417'max_input': 8192, # tokens418'dimensions': 1536, # default419'custom_dimensions': True, # supports dimension reduction420'performance': 'high'421}422423text_embedding_3_large: {424'max_input': 8192, # tokens425'dimensions': 3072, # default426'custom_dimensions': True, # supports dimension reduction427'performance': 'highest'428}429430text_embedding_ada_002: {431'max_input': 8192, # tokens432'dimensions': 1536, # fixed433'custom_dimensions': False, # no dimension reduction434'performance': 'good'435}436```437438### Configuration Types439440```python { .api }441EncodingFormat = Literal["float", "base64"]442443# Dimension limits by model444ModelDimensions = Dict[str, Dict[str, int]] = {445"text-embedding-3-small": {"min": 1, "max": 1536, "default": 1536},446"text-embedding-3-large": {"min": 1, "max": 3072, "default": 3072},447"text-embedding-ada-002": {"min": 1536, "max": 1536, "default": 1536}448}449```450451## Best Practices452453### Performance Optimization454455- Use `text-embedding-3-small` for most applications (good performance, lower cost)456- Use `text-embedding-3-large` when maximum accuracy is needed457- Reduce dimensions for storage efficiency when full precision isn't required458- Batch multiple texts together for better throughput459- Cache embeddings for frequently accessed content460461### Input Preparation462463- Normalize text (lowercase, remove extra whitespace) for consistent results464- Handle long texts by truncating to model limits (8192 tokens)465- Use meaningful text chunks rather than very short fragments466- Preprocess documents to remove irrelevant content (headers, footers, etc.)467468### Storage and Retrieval469470- Use base64 encoding for space-efficient storage471- Consider approximate nearest neighbor search libraries (FAISS, Annoy) for large datasets472- Store metadata alongside embeddings for result filtering473- Implement caching strategies for frequently accessed embeddings474475### Similarity Calculations476477- Use cosine similarity for most semantic similarity tasks478- Consider Euclidean distance for specific mathematical applications479- Normalize vectors when using dot product similarity480- Establish similarity thresholds based on your specific use case