0
# Hybrid RAG
1
2
Combine Tavily's web search capabilities with local vector database searches for enhanced RAG applications. The TavilyHybridClient integrates web search results with your existing document collections, providing both fresh external content and relevant local context.
3
4
## Capabilities
5
6
### Hybrid RAG Client
7
8
The TavilyHybridClient combines Tavily API search with local vector database queries, supporting embedding generation, result ranking, and automatic storage of web results in your database.
9
10
```python { .api }
11
class TavilyHybridClient:
12
def __init__(
13
self,
14
api_key: Union[str, None],
15
db_provider: Literal['mongodb'],
16
collection,
17
index: str,
18
embeddings_field: str = 'embeddings',
19
content_field: str = 'content',
20
embedding_function: Optional[callable] = None,
21
ranking_function: Optional[callable] = None
22
):
23
"""
24
Initialize hybrid RAG client combining Tavily API with local database.
25
26
Parameters:
27
- api_key: Tavily API key (or None to use TAVILY_API_KEY env var)
28
- db_provider: Database provider ("mongodb" only currently supported)
29
- collection: MongoDB collection object for local search
30
- index: Name of the vector search index in the collection
31
- embeddings_field: Field name containing embeddings (default: 'embeddings')
32
- content_field: Field name containing text content (default: 'content')
33
- embedding_function: Custom embedding function (defaults to Cohere)
34
- ranking_function: Custom ranking function (defaults to Cohere rerank)
35
"""
36
```
37
38
### Hybrid Search
39
40
Perform combined searches across both local database and web sources with intelligent ranking and optional result storage.
41
42
```python { .api }
43
def search(
44
self,
45
query: str,
46
max_results: int = 10,
47
max_local: int = None,
48
max_foreign: int = None,
49
save_foreign: bool = False,
50
**kwargs
51
) -> list:
52
"""
53
Perform hybrid search combining local database and Tavily API results.
54
55
Parameters:
56
- query: Search query string
57
- max_results: Maximum number of final ranked results to return
58
- max_local: Maximum results from local database (defaults to max_results)
59
- max_foreign: Maximum results from Tavily API (defaults to max_results)
60
- save_foreign: Whether to save Tavily results to local database
61
- True: Save results as-is with content and embeddings
62
- callable: Transform function to process results before saving
63
- False: Don't save results
64
- **kwargs: Additional parameters passed to Tavily search
65
66
Returns:
67
List of ranked search results containing:
68
- content: Content text
69
- score: Relevance score
70
- origin: Source ("local" or "foreign")
71
"""
72
```
73
74
## Setup and Configuration
75
76
### MongoDB Setup
77
78
Configure MongoDB with vector search capabilities:
79
80
```python
81
import pymongo
82
from tavily import TavilyHybridClient
83
84
# Connect to MongoDB
85
client = pymongo.MongoClient("mongodb://localhost:27017/")
86
db = client["my_rag_database"]
87
collection = db["documents"]
88
89
# Create vector search index (run once)
90
collection.create_search_index({
91
"name": "vector_index",
92
"definition": {
93
"fields": [
94
{
95
"type": "vector",
96
"path": "embeddings",
97
"numDimensions": 1024, # Adjust based on your embedding model
98
"similarity": "cosine"
99
}
100
]
101
}
102
})
103
104
# Initialize hybrid client
105
hybrid_client = TavilyHybridClient(
106
api_key="tvly-YOUR_API_KEY",
107
db_provider="mongodb",
108
collection=collection,
109
index="vector_index"
110
)
111
```
112
113
### Custom Embedding Functions
114
115
Use custom embedding functions instead of the default Cohere integration:
116
117
```python
118
from sentence_transformers import SentenceTransformer
119
import numpy as np
120
121
# Custom embedding function using sentence-transformers
122
model = SentenceTransformer('all-MiniLM-L6-v2')
123
124
def custom_embed_function(texts, input_type):
125
"""
126
Custom embedding function compatible with TavilyHybridClient.
127
128
Args:
129
texts: List of text strings to embed
130
input_type: 'search_query' or 'search_document'
131
132
Returns:
133
List of embedding vectors
134
"""
135
embeddings = model.encode(texts)
136
return embeddings.tolist()
137
138
# Custom ranking function
139
def custom_ranking_function(query, documents, top_n):
140
"""
141
Custom ranking function for result reordering.
142
143
Args:
144
query: Search query string
145
documents: List of document dicts with 'content' field
146
top_n: Number of top results to return
147
148
Returns:
149
List of reranked documents with 'score' field added
150
"""
151
# Simple keyword-based scoring (replace with your ranking logic)
152
query_words = set(query.lower().split())
153
154
scored_docs = []
155
for doc in documents:
156
content_words = set(doc['content'].lower().split())
157
overlap = len(query_words.intersection(content_words))
158
doc_with_score = doc.copy()
159
doc_with_score['score'] = overlap / len(query_words) if query_words else 0
160
scored_docs.append(doc_with_score)
161
162
# Sort by score and return top N
163
scored_docs.sort(key=lambda x: x['score'], reverse=True)
164
return scored_docs[:top_n]
165
166
# Initialize with custom functions
167
hybrid_client = TavilyHybridClient(
168
api_key="tvly-YOUR_API_KEY",
169
db_provider="mongodb",
170
collection=collection,
171
index="vector_index",
172
embedding_function=custom_embed_function,
173
ranking_function=custom_ranking_function
174
)
175
```
176
177
## Usage Patterns
178
179
### Basic Hybrid Search
180
181
Combine local knowledge with web search:
182
183
```python
184
# Initialize hybrid client
185
hybrid_client = TavilyHybridClient(
186
api_key="tvly-YOUR_API_KEY",
187
db_provider="mongodb",
188
collection=collection,
189
index="vector_index"
190
)
191
192
# Perform hybrid search
193
results = hybrid_client.search(
194
query="latest developments in quantum computing",
195
max_results=10,
196
max_local=5, # Get up to 5 local results
197
max_foreign=5 # Get up to 5 web results
198
)
199
200
# Process combined results
201
for result in results:
202
print(f"Source: {result['origin']}")
203
print(f"Score: {result['score']:.3f}")
204
print(f"Content: {result['content'][:200]}...")
205
print("---")
206
```
207
208
### Save Web Results to Database
209
210
Automatically expand your local knowledge base with relevant web content:
211
212
```python
213
# Search and save web results to local database
214
results = hybrid_client.search(
215
query="machine learning best practices",
216
max_results=8,
217
save_foreign=True, # Save web results to database
218
search_depth="advanced",
219
topic="general"
220
)
221
222
print(f"Found {len(results)} total results")
223
local_count = len([r for r in results if r['origin'] == 'local'])
224
foreign_count = len([r for r in results if r['origin'] == 'foreign'])
225
print(f"Local: {local_count}, Web: {foreign_count}")
226
```
227
228
### Custom Result Processing
229
230
Transform web results before saving to database:
231
232
```python
233
def process_web_result(result):
234
"""
235
Custom function to process web results before saving to database.
236
237
Args:
238
result: Web search result dict with 'content', 'embeddings', etc.
239
240
Returns:
241
Dict to save to database, or None to skip saving
242
"""
243
# Add metadata
244
processed = {
245
'content': result['content'],
246
'embeddings': result['embeddings'],
247
'source_url': result.get('url', ''),
248
'added_date': datetime.utcnow(),
249
'content_type': 'web_search',
250
'content_length': len(result['content'])
251
}
252
253
# Skip very short content
254
if len(result['content']) < 100:
255
return None
256
257
return processed
258
259
# Use custom processing
260
results = hybrid_client.search(
261
query="renewable energy technologies",
262
save_foreign=process_web_result, # Use custom processing function
263
max_results=10
264
)
265
```
266
267
## Advanced Use Cases
268
269
### Domain-Specific RAG
270
271
Create specialized RAG systems for specific domains:
272
273
```python
274
# Medical RAG with domain filtering
275
medical_results = hybrid_client.search(
276
query="treatment options for type 2 diabetes",
277
max_results=12,
278
max_local=8, # Prioritize local medical knowledge
279
max_foreign=4, # Limited web results
280
include_domains=[ # Focus on medical sources
281
"pubmed.ncbi.nlm.nih.gov",
282
"mayoclinic.org",
283
"nejm.org",
284
"bmj.com"
285
],
286
save_foreign=True,
287
search_depth="advanced"
288
)
289
290
# Legal RAG with case law focus
291
legal_results = hybrid_client.search(
292
query="precedent for intellectual property disputes",
293
max_results=10,
294
include_domains=[
295
"law.cornell.edu",
296
"justia.com",
297
"findlaw.com"
298
],
299
save_foreign=process_legal_content, # Custom legal content processor
300
topic="general"
301
)
302
```
303
304
### Temporal Knowledge Updates
305
306
Keep your knowledge base current with fresh web content:
307
308
```python
309
def update_knowledge_base():
310
"""Periodically update knowledge base with fresh web content."""
311
312
# Define topics of interest
313
topics = [
314
"artificial intelligence developments",
315
"climate change research",
316
"medical breakthroughs",
317
"technology innovations"
318
]
319
320
for topic in topics:
321
print(f"Updating knowledge for: {topic}")
322
323
# Search with time constraints to get recent content
324
results = hybrid_client.search(
325
query=topic,
326
max_results=5,
327
max_foreign=5, # Only get web results
328
max_local=0, # Skip local results for updates
329
time_range="week", # Recent content only
330
save_foreign=True, # Save to expand knowledge base
331
search_depth="advanced"
332
)
333
334
print(f"Added {len([r for r in results if r['origin'] == 'foreign'])} new documents")
335
336
# Run periodically (e.g., daily via cron job)
337
update_knowledge_base()
338
```
339
340
### Multi-Modal Knowledge Integration
341
342
Combine different types of content in your RAG system:
343
344
```python
345
def enhanced_document_processor(result):
346
"""Process web results with enhanced metadata extraction."""
347
348
content = result['content']
349
350
# Basic content analysis
351
word_count = len(content.split())
352
has_code = '```' in content or 'def ' in content or 'class ' in content
353
has_data = any(word in content.lower() for word in ['data', 'statistics', 'metrics'])
354
355
processed = {
356
'content': content,
357
'embeddings': result['embeddings'],
358
'metadata': {
359
'source_url': result.get('url', ''),
360
'word_count': word_count,
361
'content_type': 'code' if has_code else 'data' if has_data else 'text',
362
'processed_date': datetime.utcnow(),
363
'embedding_model': 'cohere-embed-english-v3.0'
364
}
365
}
366
367
return processed
368
369
# Search with enhanced processing
370
results = hybrid_client.search(
371
query="Python web scraping techniques",
372
save_foreign=enhanced_document_processor,
373
max_results=10,
374
include_raw_content="markdown" # Preserve formatting for code examples
375
)
376
```
377
378
## Error Handling and Validation
379
380
Handle errors in hybrid RAG operations:
381
382
```python
383
from tavily import TavilyHybridClient, InvalidAPIKeyError
384
import pymongo.errors
385
386
try:
387
# Initialize client
388
hybrid_client = TavilyHybridClient(
389
api_key="tvly-YOUR_API_KEY",
390
db_provider="mongodb",
391
collection=collection,
392
index="vector_index"
393
)
394
395
# Perform search with error handling
396
results = hybrid_client.search(
397
query="example query",
398
max_results=10,
399
save_foreign=True
400
)
401
402
except ValueError as e:
403
# Handle database configuration errors
404
print(f"Database configuration error: {e}")
405
406
except InvalidAPIKeyError:
407
# Handle Tavily API key errors
408
print("Invalid Tavily API key")
409
410
except pymongo.errors.PyMongoError as e:
411
# Handle MongoDB errors
412
print(f"Database error: {e}")
413
414
except Exception as e:
415
# Handle unexpected errors
416
print(f"Unexpected error: {e}")
417
```
418
419
## Performance Optimization
420
421
Optimize hybrid RAG performance:
422
423
```python
424
# Balanced search configuration
425
optimized_results = hybrid_client.search(
426
query="query",
427
max_results=10, # Reasonable result count
428
max_local=7, # Favor local results (faster)
429
max_foreign=5, # Limit web requests
430
timeout=30, # Reasonable timeout
431
search_depth="basic" # Faster web search
432
)
433
434
# Batch processing for multiple queries
435
queries = ["query1", "query2", "query3"]
436
all_results = []
437
438
for query in queries:
439
try:
440
results = hybrid_client.search(
441
query=query,
442
max_results=5, # Smaller batches
443
save_foreign=False # Skip saving for batch processing
444
)
445
all_results.extend(results)
446
except Exception as e:
447
print(f"Failed to process query '{query}': {e}")
448
continue
449
450
print(f"Processed {len(all_results)} total results from {len(queries)} queries")
451
```