Tessl Tile for pypi/tavily-python@0.7.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

async.md content.md hybrid-rag.md index.md mapping.md search.md

hybrid-rag.mddocs/

0
# Hybrid RAG
1

2
Combine Tavily's web search capabilities with local vector database searches for enhanced RAG applications. The TavilyHybridClient integrates web search results with your existing document collections, providing both fresh external content and relevant local context.
3

4
## Capabilities
5

6
### Hybrid RAG Client
7

8
The TavilyHybridClient combines Tavily API search with local vector database queries, supporting embedding generation, result ranking, and automatic storage of web results in your database.
9

10
```python { .api }
11
class TavilyHybridClient:
12
    def __init__(
13
        self,
14
        api_key: Union[str, None],
15
        db_provider: Literal['mongodb'],
16
        collection,
17
        index: str,
18
        embeddings_field: str = 'embeddings',
19
        content_field: str = 'content',
20
        embedding_function: Optional[callable] = None,
21
        ranking_function: Optional[callable] = None
22
    ):
23
        """
24
        Initialize hybrid RAG client combining Tavily API with local database.
25

26
        Parameters:
27
        - api_key: Tavily API key (or None to use TAVILY_API_KEY env var)
28
        - db_provider: Database provider ("mongodb" only currently supported)
29
        - collection: MongoDB collection object for local search
30
        - index: Name of the vector search index in the collection
31
        - embeddings_field: Field name containing embeddings (default: 'embeddings')
32
        - content_field: Field name containing text content (default: 'content')
33
        - embedding_function: Custom embedding function (defaults to Cohere)
34
        - ranking_function: Custom ranking function (defaults to Cohere rerank)
35
        """
36
```
37

38
### Hybrid Search
39

40
Perform combined searches across both local database and web sources with intelligent ranking and optional result storage.
41

42
```python { .api }
43
def search(
44
    self,
45
    query: str,
46
    max_results: int = 10,
47
    max_local: int = None,
48
    max_foreign: int = None,
49
    save_foreign: bool = False,
50
    **kwargs
51
) -> list:
52
    """
53
    Perform hybrid search combining local database and Tavily API results.
54

55
    Parameters:
56
    - query: Search query string
57
    - max_results: Maximum number of final ranked results to return
58
    - max_local: Maximum results from local database (defaults to max_results)
59
    - max_foreign: Maximum results from Tavily API (defaults to max_results)
60
    - save_foreign: Whether to save Tavily results to local database
61
      - True: Save results as-is with content and embeddings
62
      - callable: Transform function to process results before saving
63
      - False: Don't save results
64
    - **kwargs: Additional parameters passed to Tavily search
65

66
    Returns:
67
    List of ranked search results containing:
68
    - content: Content text
69
    - score: Relevance score
70
    - origin: Source ("local" or "foreign")
71
    """
72
```
73

74
## Setup and Configuration
75

76
### MongoDB Setup
77

78
Configure MongoDB with vector search capabilities:
79

80
```python
81
import pymongo
82
from tavily import TavilyHybridClient
83

84
# Connect to MongoDB
85
client = pymongo.MongoClient("mongodb://localhost:27017/")
86
db = client["my_rag_database"]
87
collection = db["documents"]
88

89
# Create vector search index (run once)
90
collection.create_search_index({
91
    "name": "vector_index",
92
    "definition": {
93
        "fields": [
94
            {
95
                "type": "vector",
96
                "path": "embeddings",
97
                "numDimensions": 1024,  # Adjust based on your embedding model
98
                "similarity": "cosine"
99
            }
100
        ]
101
    }
102
})
103

104
# Initialize hybrid client
105
hybrid_client = TavilyHybridClient(
106
    api_key="tvly-YOUR_API_KEY",
107
    db_provider="mongodb",
108
    collection=collection,
109
    index="vector_index"
110
)
111
```
112

113
### Custom Embedding Functions
114

115
Use custom embedding functions instead of the default Cohere integration:
116

117
```python
118
from sentence_transformers import SentenceTransformer
119
import numpy as np
120

121
# Custom embedding function using sentence-transformers
122
model = SentenceTransformer('all-MiniLM-L6-v2')
123

124
def custom_embed_function(texts, input_type):
125
    """
126
    Custom embedding function compatible with TavilyHybridClient.
127
    
128
    Args:
129
        texts: List of text strings to embed
130
        input_type: 'search_query' or 'search_document'
131
    
132
    Returns:
133
        List of embedding vectors
134
    """
135
    embeddings = model.encode(texts)
136
    return embeddings.tolist()
137

138
# Custom ranking function
139
def custom_ranking_function(query, documents, top_n):
140
    """
141
    Custom ranking function for result reordering.
142
    
143
    Args:
144
        query: Search query string
145
        documents: List of document dicts with 'content' field
146
        top_n: Number of top results to return
147
    
148
    Returns:
149
        List of reranked documents with 'score' field added
150
    """
151
    # Simple keyword-based scoring (replace with your ranking logic)
152
    query_words = set(query.lower().split())
153
    
154
    scored_docs = []
155
    for doc in documents:
156
        content_words = set(doc['content'].lower().split())
157
        overlap = len(query_words.intersection(content_words))
158
        doc_with_score = doc.copy()
159
        doc_with_score['score'] = overlap / len(query_words) if query_words else 0
160
        scored_docs.append(doc_with_score)
161
    
162
    # Sort by score and return top N
163
    scored_docs.sort(key=lambda x: x['score'], reverse=True)
164
    return scored_docs[:top_n]
165

166
# Initialize with custom functions
167
hybrid_client = TavilyHybridClient(
168
    api_key="tvly-YOUR_API_KEY",
169
    db_provider="mongodb",
170
    collection=collection,
171
    index="vector_index",
172
    embedding_function=custom_embed_function,
173
    ranking_function=custom_ranking_function
174
)
175
```
176

177
## Usage Patterns
178

179
### Basic Hybrid Search
180

181
Combine local knowledge with web search:
182

183
```python
184
# Initialize hybrid client
185
hybrid_client = TavilyHybridClient(
186
    api_key="tvly-YOUR_API_KEY",
187
    db_provider="mongodb",
188
    collection=collection,
189
    index="vector_index"
190
)
191

192
# Perform hybrid search
193
results = hybrid_client.search(
194
    query="latest developments in quantum computing",
195
    max_results=10,
196
    max_local=5,    # Get up to 5 local results
197
    max_foreign=5   # Get up to 5 web results
198
)
199

200
# Process combined results
201
for result in results:
202
    print(f"Source: {result['origin']}")
203
    print(f"Score: {result['score']:.3f}")
204
    print(f"Content: {result['content'][:200]}...")
205
    print("---")
206
```
207

208
### Save Web Results to Database
209

210
Automatically expand your local knowledge base with relevant web content:
211

212
```python
213
# Search and save web results to local database
214
results = hybrid_client.search(
215
    query="machine learning best practices",
216
    max_results=8,
217
    save_foreign=True,  # Save web results to database
218
    search_depth="advanced",
219
    topic="general"
220
)
221

222
print(f"Found {len(results)} total results")
223
local_count = len([r for r in results if r['origin'] == 'local'])
224
foreign_count = len([r for r in results if r['origin'] == 'foreign'])
225
print(f"Local: {local_count}, Web: {foreign_count}")
226
```
227

228
### Custom Result Processing
229

230
Transform web results before saving to database:
231

232
```python
233
def process_web_result(result):
234
    """
235
    Custom function to process web results before saving to database.
236
    
237
    Args:
238
        result: Web search result dict with 'content', 'embeddings', etc.
239
    
240
    Returns:
241
        Dict to save to database, or None to skip saving
242
    """
243
    # Add metadata
244
    processed = {
245
        'content': result['content'],
246
        'embeddings': result['embeddings'],
247
        'source_url': result.get('url', ''),
248
        'added_date': datetime.utcnow(),
249
        'content_type': 'web_search',
250
        'content_length': len(result['content'])
251
    }
252
    
253
    # Skip very short content
254
    if len(result['content']) < 100:
255
        return None
256
    
257
    return processed
258

259
# Use custom processing
260
results = hybrid_client.search(
261
    query="renewable energy technologies",
262
    save_foreign=process_web_result,  # Use custom processing function
263
    max_results=10
264
)
265
```
266

267
## Advanced Use Cases
268

269
### Domain-Specific RAG
270

271
Create specialized RAG systems for specific domains:
272

273
```python
274
# Medical RAG with domain filtering
275
medical_results = hybrid_client.search(
276
    query="treatment options for type 2 diabetes",
277
    max_results=12,
278
    max_local=8,        # Prioritize local medical knowledge
279
    max_foreign=4,      # Limited web results
280
    include_domains=[   # Focus on medical sources
281
        "pubmed.ncbi.nlm.nih.gov",
282
        "mayoclinic.org", 
283
        "nejm.org",
284
        "bmj.com"
285
    ],
286
    save_foreign=True,
287
    search_depth="advanced"
288
)
289

290
# Legal RAG with case law focus
291
legal_results = hybrid_client.search(
292
    query="precedent for intellectual property disputes",
293
    max_results=10,
294
    include_domains=[
295
        "law.cornell.edu",
296
        "justia.com",
297
        "findlaw.com"
298
    ],
299
    save_foreign=process_legal_content,  # Custom legal content processor
300
    topic="general"
301
)
302
```
303

304
### Temporal Knowledge Updates
305

306
Keep your knowledge base current with fresh web content:
307

308
```python
309
def update_knowledge_base():
310
    """Periodically update knowledge base with fresh web content."""
311
    
312
    # Define topics of interest
313
    topics = [
314
        "artificial intelligence developments",
315
        "climate change research",
316
        "medical breakthroughs",
317
        "technology innovations"
318
    ]
319
    
320
    for topic in topics:
321
        print(f"Updating knowledge for: {topic}")
322
        
323
        # Search with time constraints to get recent content
324
        results = hybrid_client.search(
325
            query=topic,
326
            max_results=5,
327
            max_foreign=5,      # Only get web results
328
            max_local=0,        # Skip local results for updates
329
            time_range="week",  # Recent content only
330
            save_foreign=True,  # Save to expand knowledge base
331
            search_depth="advanced"
332
        )
333
        
334
        print(f"Added {len([r for r in results if r['origin'] == 'foreign'])} new documents")
335

336
# Run periodically (e.g., daily via cron job)
337
update_knowledge_base()
338
```
339

340
### Multi-Modal Knowledge Integration
341

342
Combine different types of content in your RAG system:
343

344
```python
345
def enhanced_document_processor(result):
346
    """Process web results with enhanced metadata extraction."""
347
    
348
    content = result['content']
349
    
350
    # Basic content analysis
351
    word_count = len(content.split())
352
    has_code = '```' in content or 'def ' in content or 'class ' in content
353
    has_data = any(word in content.lower() for word in ['data', 'statistics', 'metrics'])
354
    
355
    processed = {
356
        'content': content,
357
        'embeddings': result['embeddings'],
358
        'metadata': {
359
            'source_url': result.get('url', ''),
360
            'word_count': word_count,
361
            'content_type': 'code' if has_code else 'data' if has_data else 'text',
362
            'processed_date': datetime.utcnow(),
363
            'embedding_model': 'cohere-embed-english-v3.0'
364
        }
365
    }
366
    
367
    return processed
368

369
# Search with enhanced processing
370
results = hybrid_client.search(
371
    query="Python web scraping techniques",
372
    save_foreign=enhanced_document_processor,
373
    max_results=10,
374
    include_raw_content="markdown"  # Preserve formatting for code examples
375
)
376
```
377

378
## Error Handling and Validation
379

380
Handle errors in hybrid RAG operations:
381

382
```python
383
from tavily import TavilyHybridClient, InvalidAPIKeyError
384
import pymongo.errors
385

386
try:
387
    # Initialize client
388
    hybrid_client = TavilyHybridClient(
389
        api_key="tvly-YOUR_API_KEY",
390
        db_provider="mongodb",
391
        collection=collection,
392
        index="vector_index"
393
    )
394
    
395
    # Perform search with error handling
396
    results = hybrid_client.search(
397
        query="example query",
398
        max_results=10,
399
        save_foreign=True
400
    )
401
    
402
except ValueError as e:
403
    # Handle database configuration errors
404
    print(f"Database configuration error: {e}")
405
    
406
except InvalidAPIKeyError:
407
    # Handle Tavily API key errors
408
    print("Invalid Tavily API key")
409
    
410
except pymongo.errors.PyMongoError as e:
411
    # Handle MongoDB errors
412
    print(f"Database error: {e}")
413
    
414
except Exception as e:
415
    # Handle unexpected errors
416
    print(f"Unexpected error: {e}")
417
```
418

419
## Performance Optimization
420

421
Optimize hybrid RAG performance:
422

423
```python
424
# Balanced search configuration
425
optimized_results = hybrid_client.search(
426
    query="query",
427
    max_results=10,     # Reasonable result count
428
    max_local=7,        # Favor local results (faster)
429
    max_foreign=5,      # Limit web requests
430
    timeout=30,         # Reasonable timeout
431
    search_depth="basic" # Faster web search
432
)
433

434
# Batch processing for multiple queries
435
queries = ["query1", "query2", "query3"]
436
all_results = []
437

438
for query in queries:
439
    try:
440
        results = hybrid_client.search(
441
            query=query,
442
            max_results=5,  # Smaller batches
443
            save_foreign=False  # Skip saving for batch processing
444
        )
445
        all_results.extend(results)
446
    except Exception as e:
447
        print(f"Failed to process query '{query}': {e}")
448
        continue
449

450
print(f"Processed {len(all_results)} total results from {len(queries)} queries")
451
```

Version

Tile

Files

hybrid-rag.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

hybrid-rag.mddocs/