0
# CLI Tools
1
2
Command-line interface for project scaffolding and development workflow automation. The Crawlee CLI provides tools for quickly creating new projects with best practices and templates.
3
4
## Command Line Interface
5
6
### Main CLI Application
7
8
The main CLI entry point providing access to all Crawlee command-line tools.
9
10
```python { .api }
11
# Command line usage:
12
# crawlee --help # Show help information
13
# crawlee --version # Display version information
14
# crawlee create # Create new project
15
```
16
17
## Available Commands
18
19
### Project Creation
20
21
Create new Crawlee projects using predefined templates with best practices and common patterns.
22
23
```bash
24
# Create new project interactively
25
crawlee create
26
27
# Create project with specific name
28
crawlee create my-crawler
29
30
# Create project with template
31
crawlee create my-crawler --template basic
32
33
# Create project in specific directory
34
crawlee create my-crawler --output-dir ./projects
35
```
36
37
### Version Information
38
39
Display version information for the installed Crawlee package.
40
41
```bash
42
# Show version
43
crawlee --version
44
45
# Alternative version command
46
crawlee version
47
```
48
49
## Project Templates
50
51
### Available Templates
52
53
The CLI provides several project templates optimized for different use cases:
54
55
- **basic**: Simple HTTP crawler template
56
- **playwright**: Browser automation template using Playwright
57
- **beautifulsoup**: HTML parsing template using BeautifulSoup
58
- **adaptive**: Intelligent crawler that adapts between HTTP and browser modes
59
- **advanced**: Full-featured template with all Crawlee capabilities
60
61
### Template Structure
62
63
Generated projects include:
64
65
```
66
my-crawler/
67
├── src/
68
│ └── main.py # Main crawler implementation
69
├── requirements.txt # Python dependencies
70
├── pyproject.toml # Project configuration
71
├── README.md # Project documentation
72
├── .gitignore # Git ignore rules
73
└── storage/ # Default storage directory
74
├── datasets/ # Scraped data storage
75
├── key_value_stores/ # Key-value storage
76
└── request_queues/ # Request queue storage
77
```
78
79
## Usage Examples
80
81
### Interactive Project Creation
82
83
```bash
84
$ crawlee create
85
86
? Project name: my-web-scraper
87
? Select template:
88
> basic
89
playwright
90
beautifulsoup
91
adaptive
92
advanced
93
? Description: A web scraper for extracting product data
94
? Author name: John Doe
95
? Author email: john@example.com
96
? Use session management? (y/N): y
97
? Use proxy rotation? (y/N): n
98
? Include example handlers? (Y/n): y
99
100
✅ Project created successfully!
101
102
📁 Project location: ./my-web-scraper
103
📋 Next steps:
104
1. cd my-web-scraper
105
2. pip install -r requirements.txt
106
3. python src/main.py
107
108
🚀 Happy crawling!
109
```
110
111
### Basic Template Example
112
113
```python
114
# Generated src/main.py for basic template
115
import asyncio
116
from crawlee.crawlers import HttpCrawler, HttpCrawlingContext
117
118
async def main() -> None:
119
crawler = HttpCrawler(
120
# Limit the crawl to max requests. Remove or increase it for crawling all links.
121
max_requests_per_crawl=10,
122
)
123
124
# Define the default request handler, which will be called for every request.
125
@crawler.router.default_handler
126
async def request_handler(context: HttpCrawlingContext) -> None:
127
context.log.info(f'Processing {context.request.url} ...')
128
129
# Extract data from the page.
130
data = {
131
'url': context.request.url,
132
'title': 'TODO: Extract title from response', # Add your extraction logic
133
'content_length': len(context.body),
134
}
135
136
# Push the extracted data to the default dataset.
137
await context.push_data(data)
138
139
# Find and enqueue links from the current page.
140
await context.enqueue_links()
141
142
# Run the crawler with the initial list of URLs.
143
await crawler.run(['https://crawlee.dev'])
144
145
if __name__ == '__main__':
146
asyncio.run(main())
147
```
148
149
### Playwright Template Example
150
151
```python
152
# Generated src/main.py for playwright template
153
import asyncio
154
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
155
156
async def main() -> None:
157
crawler = PlaywrightCrawler(
158
# Limit the crawl to max requests. Remove or increase it for crawling all links.
159
max_requests_per_crawl=10,
160
# Headless mode (set to False to see browser window)
161
headless=True,
162
)
163
164
# Define the default request handler, which will be called for every request.
165
@crawler.router.default_handler
166
async def request_handler(context: PlaywrightCrawlingContext) -> None:
167
context.log.info(f'Processing {context.request.url} ...')
168
169
# Wait for the page to fully load
170
await context.page.wait_for_load_state('networkidle')
171
172
# Extract data from the page using Playwright selectors.
173
title = await context.page.title()
174
175
data = {
176
'url': context.request.url,
177
'title': title,
178
}
179
180
# Push the extracted data to the default dataset.
181
await context.push_data(data)
182
183
# Find and enqueue links from the current page.
184
await context.enqueue_links()
185
186
# Run the crawler with the initial list of URLs.
187
await crawler.run(['https://crawlee.dev'])
188
189
if __name__ == '__main__':
190
asyncio.run(main())
191
```
192
193
### BeautifulSoup Template Example
194
195
```python
196
# Generated src/main.py for beautifulsoup template
197
import asyncio
198
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
199
200
async def main() -> None:
201
crawler = BeautifulSoupCrawler(
202
# Limit the crawl to max requests. Remove or increase it for crawling all links.
203
max_requests_per_crawl=10,
204
)
205
206
# Define the default request handler, which will be called for every request.
207
@crawler.router.default_handler
208
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
209
context.log.info(f'Processing {context.request.url} ...')
210
211
# Extract data from the page using BeautifulSoup.
212
title_element = context.soup.find('title')
213
title = title_element.get_text().strip() if title_element else 'No title'
214
215
data = {
216
'url': context.request.url,
217
'title': title,
218
}
219
220
# Push the extracted data to the default dataset.
221
await context.push_data(data)
222
223
# Find and enqueue links from the current page.
224
await context.enqueue_links()
225
226
# Run the crawler with the initial list of URLs.
227
await crawler.run(['https://crawlee.dev'])
228
229
if __name__ == '__main__':
230
asyncio.run(main())
231
```
232
233
### Advanced Template Features
234
235
The advanced template includes:
236
237
- Multiple crawler types with routing
238
- Session management
239
- Proxy configuration
240
- Error handling
241
- Statistics monitoring
242
- Data export functionality
243
- Configuration management
244
245
```python
246
# Advanced template excerpt
247
import asyncio
248
from crawlee.crawlers import AdaptivePlaywrightCrawler, AdaptivePlaywrightCrawlingContext
249
from crawlee.sessions import SessionPool
250
from crawlee.proxy_configuration import ProxyConfiguration
251
from crawlee.statistics import Statistics
252
253
async def main() -> None:
254
# Configure session pool
255
session_pool = SessionPool(max_pool_size=100)
256
257
# Configure proxy rotation (optional)
258
# proxy_config = ProxyConfiguration([
259
# 'http://proxy1:8080',
260
# 'http://proxy2:8080'
261
# ])
262
263
# Configure statistics
264
stats = Statistics()
265
266
crawler = AdaptivePlaywrightCrawler(
267
max_requests_per_crawl=50,
268
session_pool=session_pool,
269
# proxy_configuration=proxy_config,
270
statistics=stats,
271
headless=True,
272
)
273
274
# Route for product pages
275
@crawler.router.route('product')
276
async def product_handler(context: AdaptivePlaywrightCrawlingContext) -> None:
277
context.log.info(f'Processing product: {context.request.url}')
278
279
# Extract product data
280
# Add your product extraction logic here
281
282
data = {
283
'url': context.request.url,
284
'type': 'product',
285
'title': 'TODO: Extract product title',
286
'price': 'TODO: Extract product price',
287
}
288
289
await context.push_data(data)
290
291
# Default handler for other pages
292
@crawler.router.default_handler
293
async def default_handler(context: AdaptivePlaywrightCrawlingContext) -> None:
294
context.log.info(f'Processing {context.request.url}')
295
296
# Enqueue product links
297
await context.enqueue_links(
298
selector='a[href*="/product"]',
299
label='product'
300
)
301
302
data = {
303
'url': context.request.url,
304
'type': 'page',
305
}
306
307
await context.push_data(data)
308
309
# Error handler
310
@crawler.router.error_handler
311
async def error_handler(context: AdaptivePlaywrightCrawlingContext, error: Exception) -> None:
312
context.log.error(f'Error processing {context.request.url}: {error}')
313
314
# Log error data
315
await context.push_data({
316
'url': context.request.url,
317
'error': str(error),
318
'type': 'error'
319
})
320
321
# Run the crawler
322
final_stats = await crawler.run(['https://example-store.com'])
323
324
# Print final statistics
325
print(f'Crawl completed. Success rate: {final_stats.success_rate:.1f}%')
326
327
if __name__ == '__main__':
328
asyncio.run(main())
329
```
330
331
## Configuration Files
332
333
### Project Configuration (pyproject.toml)
334
335
```toml
336
# Generated pyproject.toml
337
[build-system]
338
requires = ["hatchling"]
339
build-backend = "hatchling.build"
340
341
[project]
342
name = "my-web-scraper"
343
version = "0.1.0"
344
description = "A web scraper for extracting product data"
345
authors = [
346
{name = "John Doe", email = "john@example.com"},
347
]
348
readme = "README.md"
349
license = {file = "LICENSE"}
350
requires-python = ">=3.9"
351
dependencies = [
352
"crawlee[all]>=0.6.0",
353
]
354
355
[project.optional-dependencies]
356
dev = [
357
"pytest>=7.0",
358
"pytest-asyncio>=0.21.0",
359
"black>=23.0",
360
"ruff>=0.1.0",
361
]
362
363
[tool.black]
364
line-length = 100
365
366
[tool.ruff]
367
line-length = 100
368
select = ["E", "F", "I"]
369
370
[tool.pytest.ini_options]
371
asyncio_mode = "auto"
372
```
373
374
### Requirements File
375
376
```txt
377
# Generated requirements.txt
378
crawlee[all]>=0.6.0
379
380
# Development dependencies (optional)
381
# pytest>=7.0
382
# pytest-asyncio>=0.21.0
383
# black>=23.0
384
# ruff>=0.1.0
385
```
386
387
### README Template
388
389
```markdown
390
# My Web Scraper
391
392
A web scraper for extracting product data built with Crawlee.
393
394
## Installation
395
396
1. Install dependencies:
397
```bash
398
pip install -r requirements.txt
399
```
400
401
2. Install Playwright browsers (if using Playwright):
402
```bash
403
playwright install
404
```
405
406
## Usage
407
408
Run the scraper:
409
410
```bash
411
python src/main.py
412
```
413
414
## Configuration
415
416
- Modify `src/main.py` to customize scraping logic
417
- Adjust `max_requests_per_crawl` to control crawl size
418
- Update target URLs in the `crawler.run()` call
419
420
## Data Output
421
422
Scraped data is saved to:
423
- `./storage/datasets/` - Structured data in JSON format
424
- `./storage/key_value_stores/` - Key-value pairs and files
425
- `./storage/request_queues/` - Request queue state
426
427
## Development
428
429
Run tests:
430
```bash
431
pytest
432
```
433
434
Format code:
435
```bash
436
black src/
437
ruff check src/
438
```
439
440
## License
441
442
This project is licensed under the MIT License.
443
```
444
445
## Advanced CLI Usage
446
447
### Custom Templates
448
449
Create custom templates by extending the CLI:
450
451
```python
452
# custom_template.py
453
from crawlee._cli import create_project_template
454
455
def create_custom_template(project_name: str, output_dir: str):
456
"""Create project with custom template."""
457
458
template_data = {
459
'project_name': project_name,
460
'crawler_type': 'custom',
461
'features': ['sessions', 'statistics', 'error_handling']
462
}
463
464
return create_project_template(
465
template_name='custom',
466
project_name=project_name,
467
output_dir=output_dir,
468
template_data=template_data
469
)
470
```
471
472
### Programmatic Usage
473
474
Use CLI functionality programmatically:
475
476
```python
477
import asyncio
478
from crawlee._cli import CLICommands
479
480
async def create_project_programmatically():
481
"""Create project using CLI programmatically."""
482
483
cli = CLICommands()
484
485
result = await cli.create_project(
486
project_name='automated-scraper',
487
template='playwright',
488
output_dir='./projects',
489
options={
490
'author_name': 'Automation Script',
491
'author_email': 'automation@example.com',
492
'include_sessions': True,
493
'include_examples': True
494
}
495
)
496
497
if result.success:
498
print(f"Project created: {result.project_path}")
499
else:
500
print(f"Failed to create project: {result.error}")
501
502
asyncio.run(create_project_programmatically())
503
```
504
505
The Crawlee CLI provides a quick and efficient way to bootstrap new web scraping projects with industry best practices, allowing developers to focus on extraction logic rather than project setup and configuration.