or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

async.mdcontent.mdhybrid-rag.mdindex.mdmapping.mdsearch.md

content.mddocs/

0

# Content Operations

1

2

Extract content from individual URLs or crawl entire websites with intelligent navigation, content filtering, and structured data extraction capabilities.

3

4

## Capabilities

5

6

### Content Extraction

7

8

Extract structured content from one or more URLs with options for different output formats and extraction depth levels.

9

10

```python { .api }

11

def extract(

12

urls: Union[List[str], str],

13

include_images: bool = None,

14

extract_depth: Literal["basic", "advanced"] = None,

15

format: Literal["markdown", "text"] = None,

16

timeout: int = 60,

17

include_favicon: bool = None,

18

**kwargs

19

) -> dict:

20

"""

21

Extract content from single URL or list of URLs.

22

23

Parameters:

24

- urls: Single URL string or list of URL strings to extract content from

25

- include_images: Include image URLs in extracted content

26

- extract_depth: Extraction thoroughness ("basic" for main content, "advanced" for comprehensive)

27

- format: Output format ("markdown" for structured text, "text" for plain text)

28

- timeout: Request timeout in seconds (max 120)

29

- include_favicon: Include website favicon URLs

30

- **kwargs: Additional extraction parameters

31

32

Returns:

33

Dict containing:

34

- results: List of extraction result objects with:

35

- url: Source URL

36

- content: Extracted content

37

- title: Page title

38

- score: Content quality score

39

- failed_results: List of URLs that failed extraction with error details

40

"""

41

```

42

43

**Usage Examples:**

44

45

```python

46

# Extract from single URL

47

result = client.extract("https://example.com/article")

48

print(result['results'][0]['content'])

49

50

# Extract from multiple URLs

51

urls = [

52

"https://example.com/page1",

53

"https://example.com/page2",

54

"https://example.com/page3"

55

]

56

results = client.extract(

57

urls=urls,

58

format="markdown",

59

extract_depth="advanced",

60

include_images=True

61

)

62

63

# Process results and handle failures

64

for result in results['results']:

65

print(f"URL: {result['url']}")

66

print(f"Title: {result['title']}")

67

print(f"Content: {result['content'][:200]}...")

68

69

for failed in results['failed_results']:

70

print(f"Failed to extract: {failed['url']} - {failed['error']}")

71

```

72

73

### Website Crawling

74

75

Intelligently crawl websites with custom navigation instructions, content filtering, and structured data extraction.

76

77

```python { .api }

78

def crawl(

79

url: str,

80

max_depth: int = None,

81

max_breadth: int = None,

82

limit: int = None,

83

instructions: str = None,

84

select_paths: Sequence[str] = None,

85

select_domains: Sequence[str] = None,

86

exclude_paths: Sequence[str] = None,

87

exclude_domains: Sequence[str] = None,

88

allow_external: bool = None,

89

include_images: bool = None,

90

extract_depth: Literal["basic", "advanced"] = None,

91

format: Literal["markdown", "text"] = None,

92

timeout: int = 60,

93

include_favicon: bool = None,

94

**kwargs

95

) -> dict:

96

"""

97

Crawl website with intelligent navigation and content extraction.

98

99

Parameters:

100

- url: Starting URL for crawling

101

- max_depth: Maximum depth to crawl from starting URL

102

- max_breadth: Maximum number of pages to crawl per depth level

103

- limit: Total maximum number of pages to crawl

104

- instructions: Natural language instructions for crawling behavior

105

- select_paths: List of path patterns to include (supports wildcards)

106

- select_domains: List of domains to crawl

107

- exclude_paths: List of path patterns to exclude

108

- exclude_domains: List of domains to avoid

109

- allow_external: Allow crawling external domains from starting domain

110

- include_images: Include image URLs in crawled content

111

- extract_depth: Content extraction thoroughness

112

- format: Output format for extracted content

113

- timeout: Request timeout in seconds (max 120)

114

- include_favicon: Include website favicon URLs

115

116

Returns:

117

Dict containing crawling results with pages and extracted content

118

"""

119

```

120

121

**Usage Examples:**

122

123

```python

124

# Basic website crawl

125

crawl_result = client.crawl(

126

url="https://docs.python.org",

127

max_depth=2,

128

limit=20

129

)

130

131

# Advanced crawl with filtering

132

crawl_result = client.crawl(

133

url="https://example.com",

134

max_depth=3,

135

max_breadth=10,

136

instructions="Focus on documentation and tutorial pages",

137

select_paths=["/docs/*", "/tutorials/*"],

138

exclude_paths=["/admin/*", "/private/*"],

139

format="markdown",

140

extract_depth="advanced"

141

)

142

143

# Cross-domain crawl

144

crawl_result = client.crawl(

145

url="https://company.com",

146

allow_external=True,

147

select_domains=["company.com", "docs.company.com"],

148

limit=50

149

)

150

```

151

152

### Advanced Crawling Patterns

153

154

**Targeted Content Crawling:**

155

156

```python

157

# Crawl specific content types

158

blog_crawl = client.crawl(

159

url="https://techblog.com",

160

instructions="Only crawl blog posts and articles, skip navigation pages",

161

select_paths=["/blog/*", "/articles/*", "/posts/*"],

162

exclude_paths=["/tags/*", "/categories/*", "/authors/*"],

163

max_depth=2,

164

format="markdown"

165

)

166

167

# E-commerce product crawl

168

product_crawl = client.crawl(

169

url="https://store.com",

170

instructions="Focus on product pages with descriptions and specifications",

171

select_paths=["/products/*", "/items/*"],

172

exclude_paths=["/cart/*", "/checkout/*", "/account/*"],

173

include_images=True,

174

limit=100

175

)

176

```

177

178

**Research and Documentation Crawling:**

179

180

```python

181

# Academic paper crawl

182

research_crawl = client.crawl(

183

url="https://university.edu/research",

184

instructions="Crawl research papers and publications, skip administrative pages",

185

select_paths=["/papers/*", "/publications/*", "/research/*"],

186

extract_depth="advanced",

187

max_depth=3

188

)

189

190

# API documentation crawl

191

docs_crawl = client.crawl(

192

url="https://api.example.com/docs",

193

instructions="Focus on API reference and tutorial content",

194

format="markdown",

195

max_depth=4,

196

limit=200

197

)

198

```

199

200

## Crawling Instructions

201

202

The `instructions` parameter accepts natural language descriptions that guide the crawling behavior:

203

204

**Effective Instruction Examples:**

205

206

```python

207

# Content-focused instructions

208

instructions = "Focus on main content pages, skip navigation, sidebar, and footer links"

209

210

# Topic-specific instructions

211

instructions = "Only crawl pages related to machine learning and AI, ignore general company pages"

212

213

# Quality-focused instructions

214

instructions = "Prioritize pages with substantial text content, skip image galleries and empty pages"

215

216

# Structure-focused instructions

217

instructions = "Follow documentation hierarchy, crawl systematically through sections and subsections"

218

```

219

220

## Path and Domain Filtering

221

222

**Path Pattern Examples:**

223

224

```python

225

# Include patterns

226

select_paths = [

227

"/docs/*", # All documentation

228

"/api/*/reference", # API reference pages

229

"/blog/2024/*", # 2024 blog posts

230

"*/tutorial*" # Any tutorial pages

231

]

232

233

# Exclude patterns

234

exclude_paths = [

235

"/admin/*", # Admin pages

236

"/private/*", # Private content

237

"*/download*", # Download pages

238

"*.pdf", # PDF files

239

"*.jpg", # Image files

240

]

241

```

242

243

**Domain Management:**

244

245

```python

246

# Multi-domain crawling

247

result = client.crawl(

248

url="https://main-site.com",

249

allow_external=True,

250

select_domains=[

251

"main-site.com",

252

"docs.main-site.com",

253

"blog.main-site.com",

254

"support.main-site.com"

255

],

256

exclude_domains=[

257

"ads.main-site.com",

258

"tracking.main-site.com"

259

]

260

)

261

```

262

263

## Performance and Limits

264

265

**Optimization Strategies:**

266

267

```python

268

# Balanced crawl for large sites

269

balanced_crawl = client.crawl(

270

url="https://large-site.com",

271

max_depth=2, # Limit depth to avoid going too deep

272

max_breadth=15, # Limit breadth to focus on important pages

273

limit=100, # Overall page limit

274

timeout=90 # Longer timeout for complex sites

275

)

276

277

# Fast shallow crawl

278

quick_crawl = client.crawl(

279

url="https://site.com",

280

max_depth=1, # Only immediate links

281

limit=20, # Small page count

282

timeout=30 # Quick timeout

283

)

284

```

285

286

## Error Handling

287

288

Content operations include robust error handling for failed extractions and crawling issues:

289

290

```python

291

from tavily import TavilyClient, TimeoutError, BadRequestError

292

293

try:

294

result = client.crawl("https://example.com", limit=50)

295

296

# Process successful results

297

for page in result.get('results', []):

298

print(f"Crawled: {page['url']}")

299

300

# Handle any failed pages

301

for failure in result.get('failed_results', []):

302

print(f"Failed: {failure['url']} - {failure.get('error', 'Unknown error')}")

303

304

except TimeoutError:

305

print("Crawling operation timed out")

306

except BadRequestError as e:

307

print(f"Invalid crawl parameters: {e}")

308

```