or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

batch.mdcrawling.mdextraction.mdindex.mdmonitoring.mdscraping.mdusage.mdv1-api.md

scraping.mddocs/

0

# Scraping Operations

1

2

Essential web scraping functionality for extracting content from single URLs, searching the web, and mapping website structures. These operations provide immediate results with comprehensive format and processing options.

3

4

## Capabilities

5

6

### Single URL Scraping

7

8

Extract content from a single webpage with extensive formatting and processing options including markdown conversion, HTML extraction, screenshots, and metadata collection.

9

10

```python { .api }

11

def scrape(

12

url: str,

13

*,

14

formats: Optional[List[str]] = None,

15

headers: Optional[Dict[str, str]] = None,

16

include_tags: Optional[List[str]] = None,

17

exclude_tags: Optional[List[str]] = None,

18

only_main_content: Optional[bool] = None,

19

timeout: Optional[int] = None,

20

wait_for: Optional[int] = None,

21

mobile: Optional[bool] = None,

22

parsers: Optional[List[str]] = None,

23

actions: Optional[List[dict]] = None,

24

location: Optional[dict] = None,

25

skip_tls_verification: Optional[bool] = None,

26

remove_base64_images: Optional[bool] = None,

27

fast_mode: Optional[bool] = None,

28

use_mock: Optional[str] = None,

29

block_ads: Optional[bool] = None,

30

proxy: Optional[str] = None,

31

max_age: Optional[int] = None,

32

store_in_cache: Optional[bool] = None,

33

integration: Optional[str] = None

34

) -> Document:

35

"""

36

Scrape content from a single URL.

37

38

Parameters:

39

- url: str, target URL to scrape

40

- formats: List[str], output formats ("markdown", "html", "rawHtml", "screenshot", "links")

41

- headers: Dict[str, str], custom HTTP headers

42

- include_tags: List[str], HTML tags to include

43

- exclude_tags: List[str], HTML tags to exclude

44

- only_main_content: bool, extract only main content

45

- timeout: int, request timeout in milliseconds

46

- wait_for: int, wait time before scraping in milliseconds

47

- mobile: bool, use mobile user agent

48

- parsers: List[str], content parsers to use

49

- actions: List[dict], browser actions to perform

50

- location: dict, geographic location settings

51

- skip_tls_verification: bool, skip SSL certificate verification

52

- remove_base64_images: bool, remove base64 encoded images

53

- fast_mode: bool, use faster scraping mode

54

- use_mock: str, use mock response for testing

55

- block_ads: bool, block advertisements

56

- proxy: str, proxy server to use

57

- max_age: int, maximum cache age in seconds

58

- store_in_cache: bool, store result in cache

59

- integration: str, integration identifier

60

61

Returns:

62

- Document: scraped content and metadata

63

"""

64

```

65

66

### Web Search

67

68

Search the web with content extraction, returning relevant results with extracted content formatted for LLM consumption.

69

70

```python { .api }

71

def search(

72

query: str,

73

*,

74

sources: Optional[List[str]] = None,

75

categories: Optional[List[str]] = None,

76

limit: Optional[int] = None,

77

tbs: Optional[str] = None,

78

location: Optional[str] = None,

79

ignore_invalid_urls: Optional[bool] = None,

80

timeout: Optional[int] = None,

81

scrape_options: Optional[dict] = None,

82

integration: Optional[str] = None

83

) -> SearchData:

84

"""

85

Search the web and extract content from results.

86

87

Parameters:

88

- query: str, search query

89

- sources: List[str], search sources to use

90

- categories: List[str], content categories to filter

91

- limit: int, maximum number of results

92

- tbs: str, time-based search parameters

93

- location: str, geographic location for search

94

- ignore_invalid_urls: bool, skip invalid URLs in results

95

- timeout: int, request timeout in milliseconds

96

- scrape_options: dict, options for scraping search results

97

- integration: str, integration identifier

98

99

Returns:

100

- SearchData: search results with extracted content

101

"""

102

```

103

104

### Website Mapping

105

106

Generate a structural map of a website showing available pages and their relationships, useful for understanding site architecture before crawling.

107

108

```python { .api }

109

def map(

110

url: str,

111

*,

112

search: Optional[str] = None,

113

include_subdomains: Optional[bool] = None,

114

limit: Optional[int] = None,

115

sitemap: str = "include",

116

timeout: Optional[int] = None,

117

integration: Optional[str] = None,

118

location: Optional[dict] = None

119

) -> MapData:

120

"""

121

Generate a map of website structure.

122

123

Parameters:

124

- url: str, target website URL

125

- search: Optional[str], search term to filter URLs

126

- include_subdomains: Optional[bool], include subdomain URLs

127

- limit: Optional[int], maximum number of URLs to return

128

- sitemap: str, sitemap handling ("include", "exclude", "only")

129

- timeout: Optional[int], request timeout in milliseconds

130

- integration: Optional[str], integration identifier

131

- location: Optional[dict], geographic location settings

132

133

Returns:

134

- MapData: website structure map with URLs and metadata

135

"""

136

```

137

138

## Usage Examples

139

140

### Basic Scraping

141

142

```python

143

from firecrawl import Firecrawl, ScrapeOptions

144

145

app = Firecrawl(api_key="your-api-key")

146

147

# Simple scraping

148

result = app.scrape("https://example.com")

149

print(result.data.content)

150

151

# Scraping with options

152

options = ScrapeOptions(

153

formats=["markdown", "html"],

154

include_tags=["article", "main"],

155

wait_for=2000,

156

screenshot=True

157

)

158

result = app.scrape("https://example.com", options)

159

```

160

161

### Web Search

162

163

```python

164

from firecrawl import Firecrawl, SearchOptions

165

166

app = Firecrawl(api_key="your-api-key")

167

168

# Basic search

169

results = app.search("latest AI developments")

170

for doc in results.data:

171

print(f"Title: {doc.metadata.get('title')}")

172

print(f"Content: {doc.content[:200]}...")

173

174

# Search with options

175

options = SearchOptions(

176

limit=10,

177

search_type="news",

178

language="en",

179

country="US"

180

)

181

results = app.search("AI breakthrough", options)

182

```

183

184

### Website Mapping

185

186

```python

187

from firecrawl import Firecrawl, MapOptions

188

189

app = Firecrawl(api_key="your-api-key")

190

191

# Generate site map

192

options = MapOptions(max_depth=3)

193

site_map = app.map("https://example.com", options)

194

195

for page in site_map.data:

196

print(f"URL: {page.url}")

197

print(f"Status: {page.status}")

198

```

199

200

## Types

201

202

```python { .api }

203

class ScrapeOptions:

204

"""Configuration options for scraping operations"""

205

formats: Optional[List[str]] # Output formats: ["markdown", "html", "rawHtml", "screenshot", "links"]

206

include_tags: Optional[List[str]] # HTML tags to include

207

exclude_tags: Optional[List[str]] # HTML tags to exclude

208

wait_for: Optional[int] # Wait time in milliseconds

209

screenshot: Optional[bool] # Capture screenshot

210

full_page_screenshot: Optional[bool] # Full page screenshot

211

mobile: Optional[bool] # Use mobile user agent

212

213

class ScrapeResponse:

214

"""Response from scrape operation"""

215

success: bool

216

data: Document

217

218

class SearchOptions:

219

"""Configuration options for search operations"""

220

limit: Optional[int] # Maximum number of results (default: 5)

221

search_type: Optional[str] # Search type: "web", "news", "academic"

222

language: Optional[str] # Language code (e.g., "en")

223

country: Optional[str] # Country code (e.g., "US")

224

225

class SearchResponse:

226

"""Response from search operation"""

227

success: bool

228

data: List[Document]

229

230

class MapOptions:

231

"""Configuration options for mapping operations"""

232

max_depth: Optional[int] # Maximum crawl depth

233

limit: Optional[int] # Maximum pages to map

234

ignore_sitemap: Optional[bool] # Ignore sitemap.xml

235

236

class MapResponse:

237

"""Response from map operation"""

238

success: bool

239

data: List[dict] # List of page information

240

```

241

242

## Async Usage

243

244

All scraping operations have async equivalents:

245

246

```python

247

import asyncio

248

from firecrawl import AsyncFirecrawl

249

250

async def scrape_async():

251

app = AsyncFirecrawl(api_key="your-api-key")

252

253

# Async scraping

254

result = await app.scrape("https://example.com")

255

256

# Async search

257

search_results = await app.search("query")

258

259

# Async mapping

260

site_map = await app.map("https://example.com")

261

262

asyncio.run(scrape_async())

263

```