or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

async.mdcontent.mdhybrid-rag.mdindex.mdmapping.mdsearch.md

mapping.mddocs/

0

# Website Mapping

1

2

Discover and map website structures without extracting full content, useful for understanding site architecture and finding relevant pages before detailed crawling or extraction operations.

3

4

## Capabilities

5

6

### Website Structure Mapping

7

8

Map website structure and discover pages without extracting full content, providing an efficient way to understand site architecture and identify relevant content areas.

9

10

```python { .api }

11

def map(

12

url: str,

13

max_depth: int = None,

14

max_breadth: int = None,

15

limit: int = None,

16

instructions: str = None,

17

select_paths: Sequence[str] = None,

18

select_domains: Sequence[str] = None,

19

exclude_paths: Sequence[str] = None,

20

exclude_domains: Sequence[str] = None,

21

allow_external: bool = None,

22

include_images: bool = None,

23

timeout: int = 60,

24

**kwargs

25

) -> dict:

26

"""

27

Map website structure and discover pages without full content extraction.

28

29

Parameters:

30

- url: Starting URL for mapping

31

- max_depth: Maximum depth to explore from starting URL

32

- max_breadth: Maximum number of pages to discover per depth level

33

- limit: Total maximum number of pages to map

34

- instructions: Natural language instructions for mapping behavior

35

- select_paths: List of path patterns to include in mapping

36

- select_domains: List of domains to explore

37

- exclude_paths: List of path patterns to exclude from mapping

38

- exclude_domains: List of domains to avoid

39

- allow_external: Allow mapping external domains from starting domain

40

- include_images: Include image URLs in mapping results

41

- timeout: Request timeout in seconds (max 120)

42

- **kwargs: Additional mapping parameters

43

44

Returns:

45

Dict containing website structure map with discovered pages and hierarchy

46

"""

47

```

48

49

**Usage Examples:**

50

51

```python

52

# Basic website mapping

53

site_map = client.map(

54

url="https://docs.python.org",

55

max_depth=3,

56

limit=100

57

)

58

59

# Focused documentation mapping

60

docs_map = client.map(

61

url="https://api.example.com",

62

instructions="Map API documentation structure, focus on reference sections",

63

select_paths=["/docs/*", "/reference/*", "/api/*"],

64

max_depth=4,

65

limit=200

66

)

67

68

# Multi-domain site mapping

69

company_map = client.map(

70

url="https://company.com",

71

allow_external=True,

72

select_domains=[

73

"company.com",

74

"docs.company.com",

75

"support.company.com"

76

],

77

exclude_paths=["/admin/*", "/private/*"],

78

max_depth=2

79

)

80

```

81

82

## Mapping Use Cases

83

84

### Pre-Crawl Site Analysis

85

86

Use mapping to understand site structure before performing expensive crawling operations:

87

88

```python

89

# Map first to understand structure

90

site_structure = client.map(

91

url="https://large-company.com",

92

max_depth=2,

93

limit=50

94

)

95

96

# Analyze the structure

97

print("Discovered pages:")

98

for page in site_structure.get('results', []):

99

print(f"- {page['url']} (depth: {page.get('depth', 0)})")

100

101

# Then crawl specific areas based on mapping results

102

focused_crawl = client.crawl(

103

url="https://large-company.com/products",

104

select_paths=["/products/*", "/solutions/*"],

105

max_depth=3,

106

format="markdown"

107

)

108

```

109

110

### Content Discovery

111

112

Identify content-rich areas of websites before extraction:

113

114

```python

115

# Map to find content sections

116

content_map = client.map(

117

url="https://news-site.com",

118

instructions="Find main content sections like articles, reports, and analysis",

119

exclude_paths=["/ads/*", "/widgets/*", "/social/*"],

120

max_depth=2

121

)

122

123

# Extract content from discovered high-value pages

124

high_value_pages = [

125

page['url'] for page in content_map.get('results', [])

126

if 'article' in page['url'] or 'report' in page['url']

127

]

128

129

content_results = client.extract(

130

urls=high_value_pages[:10], # Extract from top 10 pages

131

format="markdown",

132

extract_depth="advanced"

133

)

134

```

135

136

### Site Architecture Analysis

137

138

Understand website organization and navigation patterns:

139

140

```python

141

# Comprehensive site mapping

142

architecture_map = client.map(

143

url="https://enterprise-site.com",

144

instructions="Map the complete site structure to understand organization",

145

max_depth=3,

146

max_breadth=20,

147

limit=500

148

)

149

150

# Analyze navigation patterns

151

pages_by_depth = {}

152

for page in architecture_map.get('results', []):

153

depth = page.get('depth', 0)

154

if depth not in pages_by_depth:

155

pages_by_depth[depth] = []

156

pages_by_depth[depth].append(page['url'])

157

158

print("Site structure by depth:")

159

for depth, urls in pages_by_depth.items():

160

print(f"Depth {depth}: {len(urls)} pages")

161

for url in urls[:5]: # Show first 5 URLs per depth

162

print(f" - {url}")

163

```

164

165

## Advanced Mapping Patterns

166

167

### Selective Domain Exploration

168

169

Map specific parts of multi-domain organizations:

170

171

```python

172

# Map organization's web presence

173

org_map = client.map(

174

url="https://university.edu",

175

allow_external=True,

176

select_domains=[

177

"university.edu", # Main site

178

"research.university.edu", # Research portal

179

"library.university.edu", # Library system

180

"news.university.edu" # News site

181

],

182

exclude_domains=[

183

"admin.university.edu", # Admin systems

184

"student.university.edu" # Student portals

185

],

186

instructions="Map public-facing educational content and research information",

187

max_depth=2

188

)

189

```

190

191

### Topic-Focused Mapping

192

193

Discover content related to specific topics or themes:

194

195

```python

196

# Map AI/ML content across a tech site

197

ai_content_map = client.map(

198

url="https://tech-company.com",

199

instructions="Find pages related to artificial intelligence, machine learning, and data science",

200

select_paths=[

201

"/ai/*",

202

"/machine-learning/*",

203

"/data-science/*",

204

"/blog/*ai*",

205

"/research/*ml*"

206

],

207

max_depth=3,

208

limit=150

209

)

210

211

# Map specific product documentation

212

product_docs_map = client.map(

213

url="https://company.com/products/api-gateway",

214

instructions="Map all documentation related to the API Gateway product",

215

select_paths=[

216

"/products/api-gateway/*",

217

"/docs/api-gateway/*",

218

"/guides/api-gateway/*"

219

],

220

max_depth=4

221

)

222

```

223

224

### Quality-Based Filtering

225

226

Map only high-quality content pages:

227

228

```python

229

# Map substantial content pages

230

quality_map = client.map(

231

url="https://content-site.com",

232

instructions="Focus on pages with substantial text content, skip navigation and utility pages",

233

exclude_paths=[

234

"/search*", # Search pages

235

"/tag/*", # Tag pages

236

"/category/*", # Category pages

237

"/author/*", # Author pages

238

"*/print*", # Print versions

239

"*/amp*" # AMP versions

240

],

241

max_depth=2,

242

limit=200

243

)

244

```

245

246

## Mapping Results Analysis

247

248

Process and analyze mapping results effectively:

249

250

```python

251

# Comprehensive mapping analysis

252

site_map = client.map(

253

url="https://target-site.com",

254

max_depth=3,

255

limit=300

256

)

257

258

# Analyze results

259

results = site_map.get('results', [])

260

261

# Group by URL patterns

262

url_patterns = {}

263

for page in results:

264

url = page['url']

265

path_parts = url.split('/')[3:] # Skip protocol and domain

266

if path_parts:

267

pattern = '/' + path_parts[0] + '/*'

268

if pattern not in url_patterns:

269

url_patterns[pattern] = []

270

url_patterns[pattern].append(url)

271

272

print("Content organization:")

273

for pattern, urls in url_patterns.items():

274

print(f"{pattern}: {len(urls)} pages")

275

276

# Find potential high-value targets for extraction

277

content_candidates = [

278

page['url'] for page in results

279

if any(keyword in page['url'].lower()

280

for keyword in ['article', 'post', 'guide', 'tutorial', 'doc'])

281

]

282

283

print(f"\nFound {len(content_candidates)} potential content pages for extraction")

284

```

285

286

## Performance Considerations

287

288

Mapping is more efficient than crawling for site discovery:

289

290

```python

291

# Efficient large site exploration

292

efficient_map = client.map(

293

url="https://large-site.com",

294

max_depth=2, # Shallow but broad exploration

295

max_breadth=25, # More pages per level

296

limit=200, # Reasonable total limit

297

timeout=60 # Standard timeout

298

)

299

300

# Quick site overview

301

quick_overview = client.map(

302

url="https://new-site.com",

303

max_depth=1, # Just immediate links

304

limit=50, # Small set for overview

305

timeout=30 # Fast exploration

306

)

307

```

308

309

## Error Handling

310

311

Handle mapping errors and partial results:

312

313

```python

314

from tavily import TavilyClient, TimeoutError, BadRequestError

315

316

try:

317

site_map = client.map("https://example.com", limit=100)

318

319

# Process successful mapping

320

discovered_pages = site_map.get('results', [])

321

print(f"Successfully mapped {len(discovered_pages)} pages")

322

323

# Handle any failed discoveries

324

failed_mappings = site_map.get('failed_results', [])

325

if failed_mappings:

326

print(f"Failed to map {len(failed_mappings)} pages")

327

328

except TimeoutError:

329

print("Mapping operation timed out - partial results may be available")

330

except BadRequestError as e:

331

print(f"Invalid mapping parameters: {e}")

332

```