0
# Scraping Operations
1
2
Essential web scraping functionality for extracting content from single URLs, searching the web, and mapping website structures. These operations provide immediate results with comprehensive format and processing options.
3
4
## Capabilities
5
6
### Single URL Scraping
7
8
Extract content from a single webpage with extensive formatting and processing options including markdown conversion, HTML extraction, screenshots, and metadata collection.
9
10
```python { .api }
11
def scrape(
12
url: str,
13
*,
14
formats: Optional[List[str]] = None,
15
headers: Optional[Dict[str, str]] = None,
16
include_tags: Optional[List[str]] = None,
17
exclude_tags: Optional[List[str]] = None,
18
only_main_content: Optional[bool] = None,
19
timeout: Optional[int] = None,
20
wait_for: Optional[int] = None,
21
mobile: Optional[bool] = None,
22
parsers: Optional[List[str]] = None,
23
actions: Optional[List[dict]] = None,
24
location: Optional[dict] = None,
25
skip_tls_verification: Optional[bool] = None,
26
remove_base64_images: Optional[bool] = None,
27
fast_mode: Optional[bool] = None,
28
use_mock: Optional[str] = None,
29
block_ads: Optional[bool] = None,
30
proxy: Optional[str] = None,
31
max_age: Optional[int] = None,
32
store_in_cache: Optional[bool] = None,
33
integration: Optional[str] = None
34
) -> Document:
35
"""
36
Scrape content from a single URL.
37
38
Parameters:
39
- url: str, target URL to scrape
40
- formats: List[str], output formats ("markdown", "html", "rawHtml", "screenshot", "links")
41
- headers: Dict[str, str], custom HTTP headers
42
- include_tags: List[str], HTML tags to include
43
- exclude_tags: List[str], HTML tags to exclude
44
- only_main_content: bool, extract only main content
45
- timeout: int, request timeout in milliseconds
46
- wait_for: int, wait time before scraping in milliseconds
47
- mobile: bool, use mobile user agent
48
- parsers: List[str], content parsers to use
49
- actions: List[dict], browser actions to perform
50
- location: dict, geographic location settings
51
- skip_tls_verification: bool, skip SSL certificate verification
52
- remove_base64_images: bool, remove base64 encoded images
53
- fast_mode: bool, use faster scraping mode
54
- use_mock: str, use mock response for testing
55
- block_ads: bool, block advertisements
56
- proxy: str, proxy server to use
57
- max_age: int, maximum cache age in seconds
58
- store_in_cache: bool, store result in cache
59
- integration: str, integration identifier
60
61
Returns:
62
- Document: scraped content and metadata
63
"""
64
```
65
66
### Web Search
67
68
Search the web with content extraction, returning relevant results with extracted content formatted for LLM consumption.
69
70
```python { .api }
71
def search(
72
query: str,
73
*,
74
sources: Optional[List[str]] = None,
75
categories: Optional[List[str]] = None,
76
limit: Optional[int] = None,
77
tbs: Optional[str] = None,
78
location: Optional[str] = None,
79
ignore_invalid_urls: Optional[bool] = None,
80
timeout: Optional[int] = None,
81
scrape_options: Optional[dict] = None,
82
integration: Optional[str] = None
83
) -> SearchData:
84
"""
85
Search the web and extract content from results.
86
87
Parameters:
88
- query: str, search query
89
- sources: List[str], search sources to use
90
- categories: List[str], content categories to filter
91
- limit: int, maximum number of results
92
- tbs: str, time-based search parameters
93
- location: str, geographic location for search
94
- ignore_invalid_urls: bool, skip invalid URLs in results
95
- timeout: int, request timeout in milliseconds
96
- scrape_options: dict, options for scraping search results
97
- integration: str, integration identifier
98
99
Returns:
100
- SearchData: search results with extracted content
101
"""
102
```
103
104
### Website Mapping
105
106
Generate a structural map of a website showing available pages and their relationships, useful for understanding site architecture before crawling.
107
108
```python { .api }
109
def map(
110
url: str,
111
*,
112
search: Optional[str] = None,
113
include_subdomains: Optional[bool] = None,
114
limit: Optional[int] = None,
115
sitemap: str = "include",
116
timeout: Optional[int] = None,
117
integration: Optional[str] = None,
118
location: Optional[dict] = None
119
) -> MapData:
120
"""
121
Generate a map of website structure.
122
123
Parameters:
124
- url: str, target website URL
125
- search: Optional[str], search term to filter URLs
126
- include_subdomains: Optional[bool], include subdomain URLs
127
- limit: Optional[int], maximum number of URLs to return
128
- sitemap: str, sitemap handling ("include", "exclude", "only")
129
- timeout: Optional[int], request timeout in milliseconds
130
- integration: Optional[str], integration identifier
131
- location: Optional[dict], geographic location settings
132
133
Returns:
134
- MapData: website structure map with URLs and metadata
135
"""
136
```
137
138
## Usage Examples
139
140
### Basic Scraping
141
142
```python
143
from firecrawl import Firecrawl, ScrapeOptions
144
145
app = Firecrawl(api_key="your-api-key")
146
147
# Simple scraping
148
result = app.scrape("https://example.com")
149
print(result.data.content)
150
151
# Scraping with options
152
options = ScrapeOptions(
153
formats=["markdown", "html"],
154
include_tags=["article", "main"],
155
wait_for=2000,
156
screenshot=True
157
)
158
result = app.scrape("https://example.com", options)
159
```
160
161
### Web Search
162
163
```python
164
from firecrawl import Firecrawl, SearchOptions
165
166
app = Firecrawl(api_key="your-api-key")
167
168
# Basic search
169
results = app.search("latest AI developments")
170
for doc in results.data:
171
print(f"Title: {doc.metadata.get('title')}")
172
print(f"Content: {doc.content[:200]}...")
173
174
# Search with options
175
options = SearchOptions(
176
limit=10,
177
search_type="news",
178
language="en",
179
country="US"
180
)
181
results = app.search("AI breakthrough", options)
182
```
183
184
### Website Mapping
185
186
```python
187
from firecrawl import Firecrawl, MapOptions
188
189
app = Firecrawl(api_key="your-api-key")
190
191
# Generate site map
192
options = MapOptions(max_depth=3)
193
site_map = app.map("https://example.com", options)
194
195
for page in site_map.data:
196
print(f"URL: {page.url}")
197
print(f"Status: {page.status}")
198
```
199
200
## Types
201
202
```python { .api }
203
class ScrapeOptions:
204
"""Configuration options for scraping operations"""
205
formats: Optional[List[str]] # Output formats: ["markdown", "html", "rawHtml", "screenshot", "links"]
206
include_tags: Optional[List[str]] # HTML tags to include
207
exclude_tags: Optional[List[str]] # HTML tags to exclude
208
wait_for: Optional[int] # Wait time in milliseconds
209
screenshot: Optional[bool] # Capture screenshot
210
full_page_screenshot: Optional[bool] # Full page screenshot
211
mobile: Optional[bool] # Use mobile user agent
212
213
class ScrapeResponse:
214
"""Response from scrape operation"""
215
success: bool
216
data: Document
217
218
class SearchOptions:
219
"""Configuration options for search operations"""
220
limit: Optional[int] # Maximum number of results (default: 5)
221
search_type: Optional[str] # Search type: "web", "news", "academic"
222
language: Optional[str] # Language code (e.g., "en")
223
country: Optional[str] # Country code (e.g., "US")
224
225
class SearchResponse:
226
"""Response from search operation"""
227
success: bool
228
data: List[Document]
229
230
class MapOptions:
231
"""Configuration options for mapping operations"""
232
max_depth: Optional[int] # Maximum crawl depth
233
limit: Optional[int] # Maximum pages to map
234
ignore_sitemap: Optional[bool] # Ignore sitemap.xml
235
236
class MapResponse:
237
"""Response from map operation"""
238
success: bool
239
data: List[dict] # List of page information
240
```
241
242
## Async Usage
243
244
All scraping operations have async equivalents:
245
246
```python
247
import asyncio
248
from firecrawl import AsyncFirecrawl
249
250
async def scrape_async():
251
app = AsyncFirecrawl(api_key="your-api-key")
252
253
# Async scraping
254
result = await app.scrape("https://example.com")
255
256
# Async search
257
search_results = await app.search("query")
258
259
# Async mapping
260
site_map = await app.map("https://example.com")
261
262
asyncio.run(scrape_async())
263
```