0
# Content Operations
1
2
Extract content from individual URLs or crawl entire websites with intelligent navigation, content filtering, and structured data extraction capabilities.
3
4
## Capabilities
5
6
### Content Extraction
7
8
Extract structured content from one or more URLs with options for different output formats and extraction depth levels.
9
10
```python { .api }
11
def extract(
12
urls: Union[List[str], str],
13
include_images: bool = None,
14
extract_depth: Literal["basic", "advanced"] = None,
15
format: Literal["markdown", "text"] = None,
16
timeout: int = 60,
17
include_favicon: bool = None,
18
**kwargs
19
) -> dict:
20
"""
21
Extract content from single URL or list of URLs.
22
23
Parameters:
24
- urls: Single URL string or list of URL strings to extract content from
25
- include_images: Include image URLs in extracted content
26
- extract_depth: Extraction thoroughness ("basic" for main content, "advanced" for comprehensive)
27
- format: Output format ("markdown" for structured text, "text" for plain text)
28
- timeout: Request timeout in seconds (max 120)
29
- include_favicon: Include website favicon URLs
30
- **kwargs: Additional extraction parameters
31
32
Returns:
33
Dict containing:
34
- results: List of extraction result objects with:
35
- url: Source URL
36
- content: Extracted content
37
- title: Page title
38
- score: Content quality score
39
- failed_results: List of URLs that failed extraction with error details
40
"""
41
```
42
43
**Usage Examples:**
44
45
```python
46
# Extract from single URL
47
result = client.extract("https://example.com/article")
48
print(result['results'][0]['content'])
49
50
# Extract from multiple URLs
51
urls = [
52
"https://example.com/page1",
53
"https://example.com/page2",
54
"https://example.com/page3"
55
]
56
results = client.extract(
57
urls=urls,
58
format="markdown",
59
extract_depth="advanced",
60
include_images=True
61
)
62
63
# Process results and handle failures
64
for result in results['results']:
65
print(f"URL: {result['url']}")
66
print(f"Title: {result['title']}")
67
print(f"Content: {result['content'][:200]}...")
68
69
for failed in results['failed_results']:
70
print(f"Failed to extract: {failed['url']} - {failed['error']}")
71
```
72
73
### Website Crawling
74
75
Intelligently crawl websites with custom navigation instructions, content filtering, and structured data extraction.
76
77
```python { .api }
78
def crawl(
79
url: str,
80
max_depth: int = None,
81
max_breadth: int = None,
82
limit: int = None,
83
instructions: str = None,
84
select_paths: Sequence[str] = None,
85
select_domains: Sequence[str] = None,
86
exclude_paths: Sequence[str] = None,
87
exclude_domains: Sequence[str] = None,
88
allow_external: bool = None,
89
include_images: bool = None,
90
extract_depth: Literal["basic", "advanced"] = None,
91
format: Literal["markdown", "text"] = None,
92
timeout: int = 60,
93
include_favicon: bool = None,
94
**kwargs
95
) -> dict:
96
"""
97
Crawl website with intelligent navigation and content extraction.
98
99
Parameters:
100
- url: Starting URL for crawling
101
- max_depth: Maximum depth to crawl from starting URL
102
- max_breadth: Maximum number of pages to crawl per depth level
103
- limit: Total maximum number of pages to crawl
104
- instructions: Natural language instructions for crawling behavior
105
- select_paths: List of path patterns to include (supports wildcards)
106
- select_domains: List of domains to crawl
107
- exclude_paths: List of path patterns to exclude
108
- exclude_domains: List of domains to avoid
109
- allow_external: Allow crawling external domains from starting domain
110
- include_images: Include image URLs in crawled content
111
- extract_depth: Content extraction thoroughness
112
- format: Output format for extracted content
113
- timeout: Request timeout in seconds (max 120)
114
- include_favicon: Include website favicon URLs
115
116
Returns:
117
Dict containing crawling results with pages and extracted content
118
"""
119
```
120
121
**Usage Examples:**
122
123
```python
124
# Basic website crawl
125
crawl_result = client.crawl(
126
url="https://docs.python.org",
127
max_depth=2,
128
limit=20
129
)
130
131
# Advanced crawl with filtering
132
crawl_result = client.crawl(
133
url="https://example.com",
134
max_depth=3,
135
max_breadth=10,
136
instructions="Focus on documentation and tutorial pages",
137
select_paths=["/docs/*", "/tutorials/*"],
138
exclude_paths=["/admin/*", "/private/*"],
139
format="markdown",
140
extract_depth="advanced"
141
)
142
143
# Cross-domain crawl
144
crawl_result = client.crawl(
145
url="https://company.com",
146
allow_external=True,
147
select_domains=["company.com", "docs.company.com"],
148
limit=50
149
)
150
```
151
152
### Advanced Crawling Patterns
153
154
**Targeted Content Crawling:**
155
156
```python
157
# Crawl specific content types
158
blog_crawl = client.crawl(
159
url="https://techblog.com",
160
instructions="Only crawl blog posts and articles, skip navigation pages",
161
select_paths=["/blog/*", "/articles/*", "/posts/*"],
162
exclude_paths=["/tags/*", "/categories/*", "/authors/*"],
163
max_depth=2,
164
format="markdown"
165
)
166
167
# E-commerce product crawl
168
product_crawl = client.crawl(
169
url="https://store.com",
170
instructions="Focus on product pages with descriptions and specifications",
171
select_paths=["/products/*", "/items/*"],
172
exclude_paths=["/cart/*", "/checkout/*", "/account/*"],
173
include_images=True,
174
limit=100
175
)
176
```
177
178
**Research and Documentation Crawling:**
179
180
```python
181
# Academic paper crawl
182
research_crawl = client.crawl(
183
url="https://university.edu/research",
184
instructions="Crawl research papers and publications, skip administrative pages",
185
select_paths=["/papers/*", "/publications/*", "/research/*"],
186
extract_depth="advanced",
187
max_depth=3
188
)
189
190
# API documentation crawl
191
docs_crawl = client.crawl(
192
url="https://api.example.com/docs",
193
instructions="Focus on API reference and tutorial content",
194
format="markdown",
195
max_depth=4,
196
limit=200
197
)
198
```
199
200
## Crawling Instructions
201
202
The `instructions` parameter accepts natural language descriptions that guide the crawling behavior:
203
204
**Effective Instruction Examples:**
205
206
```python
207
# Content-focused instructions
208
instructions = "Focus on main content pages, skip navigation, sidebar, and footer links"
209
210
# Topic-specific instructions
211
instructions = "Only crawl pages related to machine learning and AI, ignore general company pages"
212
213
# Quality-focused instructions
214
instructions = "Prioritize pages with substantial text content, skip image galleries and empty pages"
215
216
# Structure-focused instructions
217
instructions = "Follow documentation hierarchy, crawl systematically through sections and subsections"
218
```
219
220
## Path and Domain Filtering
221
222
**Path Pattern Examples:**
223
224
```python
225
# Include patterns
226
select_paths = [
227
"/docs/*", # All documentation
228
"/api/*/reference", # API reference pages
229
"/blog/2024/*", # 2024 blog posts
230
"*/tutorial*" # Any tutorial pages
231
]
232
233
# Exclude patterns
234
exclude_paths = [
235
"/admin/*", # Admin pages
236
"/private/*", # Private content
237
"*/download*", # Download pages
238
"*.pdf", # PDF files
239
"*.jpg", # Image files
240
]
241
```
242
243
**Domain Management:**
244
245
```python
246
# Multi-domain crawling
247
result = client.crawl(
248
url="https://main-site.com",
249
allow_external=True,
250
select_domains=[
251
"main-site.com",
252
"docs.main-site.com",
253
"blog.main-site.com",
254
"support.main-site.com"
255
],
256
exclude_domains=[
257
"ads.main-site.com",
258
"tracking.main-site.com"
259
]
260
)
261
```
262
263
## Performance and Limits
264
265
**Optimization Strategies:**
266
267
```python
268
# Balanced crawl for large sites
269
balanced_crawl = client.crawl(
270
url="https://large-site.com",
271
max_depth=2, # Limit depth to avoid going too deep
272
max_breadth=15, # Limit breadth to focus on important pages
273
limit=100, # Overall page limit
274
timeout=90 # Longer timeout for complex sites
275
)
276
277
# Fast shallow crawl
278
quick_crawl = client.crawl(
279
url="https://site.com",
280
max_depth=1, # Only immediate links
281
limit=20, # Small page count
282
timeout=30 # Quick timeout
283
)
284
```
285
286
## Error Handling
287
288
Content operations include robust error handling for failed extractions and crawling issues:
289
290
```python
291
from tavily import TavilyClient, TimeoutError, BadRequestError
292
293
try:
294
result = client.crawl("https://example.com", limit=50)
295
296
# Process successful results
297
for page in result.get('results', []):
298
print(f"Crawled: {page['url']}")
299
300
# Handle any failed pages
301
for failure in result.get('failed_results', []):
302
print(f"Failed: {failure['url']} - {failure.get('error', 'Unknown error')}")
303
304
except TimeoutError:
305
print("Crawling operation timed out")
306
except BadRequestError as e:
307
print(f"Invalid crawl parameters: {e}")
308
```