0
# Website Mapping
1
2
Discover and map website structures without extracting full content, useful for understanding site architecture and finding relevant pages before detailed crawling or extraction operations.
3
4
## Capabilities
5
6
### Website Structure Mapping
7
8
Map website structure and discover pages without extracting full content, providing an efficient way to understand site architecture and identify relevant content areas.
9
10
```python { .api }
11
def map(
12
url: str,
13
max_depth: int = None,
14
max_breadth: int = None,
15
limit: int = None,
16
instructions: str = None,
17
select_paths: Sequence[str] = None,
18
select_domains: Sequence[str] = None,
19
exclude_paths: Sequence[str] = None,
20
exclude_domains: Sequence[str] = None,
21
allow_external: bool = None,
22
include_images: bool = None,
23
timeout: int = 60,
24
**kwargs
25
) -> dict:
26
"""
27
Map website structure and discover pages without full content extraction.
28
29
Parameters:
30
- url: Starting URL for mapping
31
- max_depth: Maximum depth to explore from starting URL
32
- max_breadth: Maximum number of pages to discover per depth level
33
- limit: Total maximum number of pages to map
34
- instructions: Natural language instructions for mapping behavior
35
- select_paths: List of path patterns to include in mapping
36
- select_domains: List of domains to explore
37
- exclude_paths: List of path patterns to exclude from mapping
38
- exclude_domains: List of domains to avoid
39
- allow_external: Allow mapping external domains from starting domain
40
- include_images: Include image URLs in mapping results
41
- timeout: Request timeout in seconds (max 120)
42
- **kwargs: Additional mapping parameters
43
44
Returns:
45
Dict containing website structure map with discovered pages and hierarchy
46
"""
47
```
48
49
**Usage Examples:**
50
51
```python
52
# Basic website mapping
53
site_map = client.map(
54
url="https://docs.python.org",
55
max_depth=3,
56
limit=100
57
)
58
59
# Focused documentation mapping
60
docs_map = client.map(
61
url="https://api.example.com",
62
instructions="Map API documentation structure, focus on reference sections",
63
select_paths=["/docs/*", "/reference/*", "/api/*"],
64
max_depth=4,
65
limit=200
66
)
67
68
# Multi-domain site mapping
69
company_map = client.map(
70
url="https://company.com",
71
allow_external=True,
72
select_domains=[
73
"company.com",
74
"docs.company.com",
75
"support.company.com"
76
],
77
exclude_paths=["/admin/*", "/private/*"],
78
max_depth=2
79
)
80
```
81
82
## Mapping Use Cases
83
84
### Pre-Crawl Site Analysis
85
86
Use mapping to understand site structure before performing expensive crawling operations:
87
88
```python
89
# Map first to understand structure
90
site_structure = client.map(
91
url="https://large-company.com",
92
max_depth=2,
93
limit=50
94
)
95
96
# Analyze the structure
97
print("Discovered pages:")
98
for page in site_structure.get('results', []):
99
print(f"- {page['url']} (depth: {page.get('depth', 0)})")
100
101
# Then crawl specific areas based on mapping results
102
focused_crawl = client.crawl(
103
url="https://large-company.com/products",
104
select_paths=["/products/*", "/solutions/*"],
105
max_depth=3,
106
format="markdown"
107
)
108
```
109
110
### Content Discovery
111
112
Identify content-rich areas of websites before extraction:
113
114
```python
115
# Map to find content sections
116
content_map = client.map(
117
url="https://news-site.com",
118
instructions="Find main content sections like articles, reports, and analysis",
119
exclude_paths=["/ads/*", "/widgets/*", "/social/*"],
120
max_depth=2
121
)
122
123
# Extract content from discovered high-value pages
124
high_value_pages = [
125
page['url'] for page in content_map.get('results', [])
126
if 'article' in page['url'] or 'report' in page['url']
127
]
128
129
content_results = client.extract(
130
urls=high_value_pages[:10], # Extract from top 10 pages
131
format="markdown",
132
extract_depth="advanced"
133
)
134
```
135
136
### Site Architecture Analysis
137
138
Understand website organization and navigation patterns:
139
140
```python
141
# Comprehensive site mapping
142
architecture_map = client.map(
143
url="https://enterprise-site.com",
144
instructions="Map the complete site structure to understand organization",
145
max_depth=3,
146
max_breadth=20,
147
limit=500
148
)
149
150
# Analyze navigation patterns
151
pages_by_depth = {}
152
for page in architecture_map.get('results', []):
153
depth = page.get('depth', 0)
154
if depth not in pages_by_depth:
155
pages_by_depth[depth] = []
156
pages_by_depth[depth].append(page['url'])
157
158
print("Site structure by depth:")
159
for depth, urls in pages_by_depth.items():
160
print(f"Depth {depth}: {len(urls)} pages")
161
for url in urls[:5]: # Show first 5 URLs per depth
162
print(f" - {url}")
163
```
164
165
## Advanced Mapping Patterns
166
167
### Selective Domain Exploration
168
169
Map specific parts of multi-domain organizations:
170
171
```python
172
# Map organization's web presence
173
org_map = client.map(
174
url="https://university.edu",
175
allow_external=True,
176
select_domains=[
177
"university.edu", # Main site
178
"research.university.edu", # Research portal
179
"library.university.edu", # Library system
180
"news.university.edu" # News site
181
],
182
exclude_domains=[
183
"admin.university.edu", # Admin systems
184
"student.university.edu" # Student portals
185
],
186
instructions="Map public-facing educational content and research information",
187
max_depth=2
188
)
189
```
190
191
### Topic-Focused Mapping
192
193
Discover content related to specific topics or themes:
194
195
```python
196
# Map AI/ML content across a tech site
197
ai_content_map = client.map(
198
url="https://tech-company.com",
199
instructions="Find pages related to artificial intelligence, machine learning, and data science",
200
select_paths=[
201
"/ai/*",
202
"/machine-learning/*",
203
"/data-science/*",
204
"/blog/*ai*",
205
"/research/*ml*"
206
],
207
max_depth=3,
208
limit=150
209
)
210
211
# Map specific product documentation
212
product_docs_map = client.map(
213
url="https://company.com/products/api-gateway",
214
instructions="Map all documentation related to the API Gateway product",
215
select_paths=[
216
"/products/api-gateway/*",
217
"/docs/api-gateway/*",
218
"/guides/api-gateway/*"
219
],
220
max_depth=4
221
)
222
```
223
224
### Quality-Based Filtering
225
226
Map only high-quality content pages:
227
228
```python
229
# Map substantial content pages
230
quality_map = client.map(
231
url="https://content-site.com",
232
instructions="Focus on pages with substantial text content, skip navigation and utility pages",
233
exclude_paths=[
234
"/search*", # Search pages
235
"/tag/*", # Tag pages
236
"/category/*", # Category pages
237
"/author/*", # Author pages
238
"*/print*", # Print versions
239
"*/amp*" # AMP versions
240
],
241
max_depth=2,
242
limit=200
243
)
244
```
245
246
## Mapping Results Analysis
247
248
Process and analyze mapping results effectively:
249
250
```python
251
# Comprehensive mapping analysis
252
site_map = client.map(
253
url="https://target-site.com",
254
max_depth=3,
255
limit=300
256
)
257
258
# Analyze results
259
results = site_map.get('results', [])
260
261
# Group by URL patterns
262
url_patterns = {}
263
for page in results:
264
url = page['url']
265
path_parts = url.split('/')[3:] # Skip protocol and domain
266
if path_parts:
267
pattern = '/' + path_parts[0] + '/*'
268
if pattern not in url_patterns:
269
url_patterns[pattern] = []
270
url_patterns[pattern].append(url)
271
272
print("Content organization:")
273
for pattern, urls in url_patterns.items():
274
print(f"{pattern}: {len(urls)} pages")
275
276
# Find potential high-value targets for extraction
277
content_candidates = [
278
page['url'] for page in results
279
if any(keyword in page['url'].lower()
280
for keyword in ['article', 'post', 'guide', 'tutorial', 'doc'])
281
]
282
283
print(f"\nFound {len(content_candidates)} potential content pages for extraction")
284
```
285
286
## Performance Considerations
287
288
Mapping is more efficient than crawling for site discovery:
289
290
```python
291
# Efficient large site exploration
292
efficient_map = client.map(
293
url="https://large-site.com",
294
max_depth=2, # Shallow but broad exploration
295
max_breadth=25, # More pages per level
296
limit=200, # Reasonable total limit
297
timeout=60 # Standard timeout
298
)
299
300
# Quick site overview
301
quick_overview = client.map(
302
url="https://new-site.com",
303
max_depth=1, # Just immediate links
304
limit=50, # Small set for overview
305
timeout=30 # Fast exploration
306
)
307
```
308
309
## Error Handling
310
311
Handle mapping errors and partial results:
312
313
```python
314
from tavily import TavilyClient, TimeoutError, BadRequestError
315
316
try:
317
site_map = client.map("https://example.com", limit=100)
318
319
# Process successful mapping
320
discovered_pages = site_map.get('results', [])
321
print(f"Successfully mapped {len(discovered_pages)} pages")
322
323
# Handle any failed discoveries
324
failed_mappings = site_map.get('failed_results', [])
325
if failed_mappings:
326
print(f"Failed to map {len(failed_mappings)} pages")
327
328
except TimeoutError:
329
print("Mapping operation timed out - partial results may be available")
330
except BadRequestError as e:
331
print(f"Invalid mapping parameters: {e}")
332
```