Tessl Tile for pypi/crawlee@0.6.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

browser-automation.md cli-tools.md configuration.md core-types.md crawlers.md error-handling.md events.md fingerprinting.md http-clients.md index.md request-management.md sessions.md statistics.md storage.md

cli-tools.mddocs/

0
# CLI Tools
1

2
Command-line interface for project scaffolding and development workflow automation. The Crawlee CLI provides tools for quickly creating new projects with best practices and templates.
3

4
## Command Line Interface
5

6
### Main CLI Application
7

8
The main CLI entry point providing access to all Crawlee command-line tools.
9

10
```python { .api }
11
# Command line usage:
12
# crawlee --help           # Show help information
13
# crawlee --version        # Display version information
14
# crawlee create           # Create new project
15
```
16

17
## Available Commands
18

19
### Project Creation
20

21
Create new Crawlee projects using predefined templates with best practices and common patterns.
22

23
```bash
24
# Create new project interactively
25
crawlee create
26

27
# Create project with specific name
28
crawlee create my-crawler
29

30
# Create project with template
31
crawlee create my-crawler --template basic
32

33
# Create project in specific directory
34
crawlee create my-crawler --output-dir ./projects
35
```
36

37
### Version Information
38

39
Display version information for the installed Crawlee package.
40

41
```bash
42
# Show version
43
crawlee --version
44

45
# Alternative version command
46
crawlee version
47
```
48

49
## Project Templates
50

51
### Available Templates
52

53
The CLI provides several project templates optimized for different use cases:
54

55
- **basic**: Simple HTTP crawler template
56
- **playwright**: Browser automation template using Playwright
57
- **beautifulsoup**: HTML parsing template using BeautifulSoup
58
- **adaptive**: Intelligent crawler that adapts between HTTP and browser modes
59
- **advanced**: Full-featured template with all Crawlee capabilities
60

61
### Template Structure
62

63
Generated projects include:
64

65
```
66
my-crawler/
67
├── src/
68
│   └── main.py              # Main crawler implementation
69
├── requirements.txt         # Python dependencies
70
├── pyproject.toml          # Project configuration
71
├── README.md               # Project documentation
72
├── .gitignore              # Git ignore rules
73
└── storage/                # Default storage directory
74
    ├── datasets/           # Scraped data storage
75
    ├── key_value_stores/   # Key-value storage
76
    └── request_queues/     # Request queue storage
77
```
78

79
## Usage Examples
80

81
### Interactive Project Creation
82

83
```bash
84
$ crawlee create
85

86
? Project name: my-web-scraper
87
? Select template:
88
  > basic
89
    playwright
90
    beautifulsoup
91
    adaptive
92
    advanced
93
? Description: A web scraper for extracting product data
94
? Author name: John Doe
95
? Author email: john@example.com
96
? Use session management? (y/N): y
97
? Use proxy rotation? (y/N): n
98
? Include example handlers? (Y/n): y
99

100
✅ Project created successfully!
101

102
📁 Project location: ./my-web-scraper
103
📋 Next steps:
104
   1. cd my-web-scraper
105
   2. pip install -r requirements.txt
106
   3. python src/main.py
107

108
🚀 Happy crawling!
109
```
110

111
### Basic Template Example
112

113
```python
114
# Generated src/main.py for basic template
115
import asyncio
116
from crawlee.crawlers import HttpCrawler, HttpCrawlingContext
117

118
async def main() -> None:
119
    crawler = HttpCrawler(
120
        # Limit the crawl to max requests. Remove or increase it for crawling all links.
121
        max_requests_per_crawl=10,
122
    )
123

124
    # Define the default request handler, which will be called for every request.
125
    @crawler.router.default_handler
126
    async def request_handler(context: HttpCrawlingContext) -> None:
127
        context.log.info(f'Processing {context.request.url} ...')
128

129
        # Extract data from the page.
130
        data = {
131
            'url': context.request.url,
132
            'title': 'TODO: Extract title from response',  # Add your extraction logic
133
            'content_length': len(context.body),
134
        }
135

136
        # Push the extracted data to the default dataset.
137
        await context.push_data(data)
138

139
        # Find and enqueue links from the current page.
140
        await context.enqueue_links()
141

142
    # Run the crawler with the initial list of URLs.
143
    await crawler.run(['https://crawlee.dev'])
144

145
if __name__ == '__main__':
146
    asyncio.run(main())
147
```
148

149
### Playwright Template Example
150

151
```python
152
# Generated src/main.py for playwright template
153
import asyncio
154
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
155

156
async def main() -> None:
157
    crawler = PlaywrightCrawler(
158
        # Limit the crawl to max requests. Remove or increase it for crawling all links.
159
        max_requests_per_crawl=10,
160
        # Headless mode (set to False to see browser window)
161
        headless=True,
162
    )
163

164
    # Define the default request handler, which will be called for every request.
165
    @crawler.router.default_handler
166
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
167
        context.log.info(f'Processing {context.request.url} ...')
168

169
        # Wait for the page to fully load
170
        await context.page.wait_for_load_state('networkidle')
171

172
        # Extract data from the page using Playwright selectors.
173
        title = await context.page.title()
174

175
        data = {
176
            'url': context.request.url,
177
            'title': title,
178
        }
179

180
        # Push the extracted data to the default dataset.
181
        await context.push_data(data)
182

183
        # Find and enqueue links from the current page.
184
        await context.enqueue_links()
185

186
    # Run the crawler with the initial list of URLs.
187
    await crawler.run(['https://crawlee.dev'])
188

189
if __name__ == '__main__':
190
    asyncio.run(main())
191
```
192

193
### BeautifulSoup Template Example
194

195
```python
196
# Generated src/main.py for beautifulsoup template
197
import asyncio
198
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
199

200
async def main() -> None:
201
    crawler = BeautifulSoupCrawler(
202
        # Limit the crawl to max requests. Remove or increase it for crawling all links.
203
        max_requests_per_crawl=10,
204
    )
205

206
    # Define the default request handler, which will be called for every request.
207
    @crawler.router.default_handler
208
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
209
        context.log.info(f'Processing {context.request.url} ...')
210

211
        # Extract data from the page using BeautifulSoup.
212
        title_element = context.soup.find('title')
213
        title = title_element.get_text().strip() if title_element else 'No title'
214

215
        data = {
216
            'url': context.request.url,
217
            'title': title,
218
        }
219

220
        # Push the extracted data to the default dataset.
221
        await context.push_data(data)
222

223
        # Find and enqueue links from the current page.
224
        await context.enqueue_links()
225

226
    # Run the crawler with the initial list of URLs.
227
    await crawler.run(['https://crawlee.dev'])
228

229
if __name__ == '__main__':
230
    asyncio.run(main())
231
```
232

233
### Advanced Template Features
234

235
The advanced template includes:
236

237
- Multiple crawler types with routing
238
- Session management
239
- Proxy configuration
240
- Error handling
241
- Statistics monitoring
242
- Data export functionality
243
- Configuration management
244

245
```python
246
# Advanced template excerpt
247
import asyncio
248
from crawlee.crawlers import AdaptivePlaywrightCrawler, AdaptivePlaywrightCrawlingContext
249
from crawlee.sessions import SessionPool
250
from crawlee.proxy_configuration import ProxyConfiguration
251
from crawlee.statistics import Statistics
252

253
async def main() -> None:
254
    # Configure session pool
255
    session_pool = SessionPool(max_pool_size=100)
256

257
    # Configure proxy rotation (optional)
258
    # proxy_config = ProxyConfiguration([
259
    #     'http://proxy1:8080',
260
    #     'http://proxy2:8080'
261
    # ])
262

263
    # Configure statistics
264
    stats = Statistics()
265

266
    crawler = AdaptivePlaywrightCrawler(
267
        max_requests_per_crawl=50,
268
        session_pool=session_pool,
269
        # proxy_configuration=proxy_config,
270
        statistics=stats,
271
        headless=True,
272
    )
273

274
    # Route for product pages
275
    @crawler.router.route('product')
276
    async def product_handler(context: AdaptivePlaywrightCrawlingContext) -> None:
277
        context.log.info(f'Processing product: {context.request.url}')
278

279
        # Extract product data
280
        # Add your product extraction logic here
281

282
        data = {
283
            'url': context.request.url,
284
            'type': 'product',
285
            'title': 'TODO: Extract product title',
286
            'price': 'TODO: Extract product price',
287
        }
288

289
        await context.push_data(data)
290

291
    # Default handler for other pages
292
    @crawler.router.default_handler
293
    async def default_handler(context: AdaptivePlaywrightCrawlingContext) -> None:
294
        context.log.info(f'Processing {context.request.url}')
295

296
        # Enqueue product links
297
        await context.enqueue_links(
298
            selector='a[href*="/product"]',
299
            label='product'
300
        )
301

302
        data = {
303
            'url': context.request.url,
304
            'type': 'page',
305
        }
306

307
        await context.push_data(data)
308

309
    # Error handler
310
    @crawler.router.error_handler
311
    async def error_handler(context: AdaptivePlaywrightCrawlingContext, error: Exception) -> None:
312
        context.log.error(f'Error processing {context.request.url}: {error}')
313

314
        # Log error data
315
        await context.push_data({
316
            'url': context.request.url,
317
            'error': str(error),
318
            'type': 'error'
319
        })
320

321
    # Run the crawler
322
    final_stats = await crawler.run(['https://example-store.com'])
323

324
    # Print final statistics
325
    print(f'Crawl completed. Success rate: {final_stats.success_rate:.1f}%')
326

327
if __name__ == '__main__':
328
    asyncio.run(main())
329
```
330

331
## Configuration Files
332

333
### Project Configuration (pyproject.toml)
334

335
```toml
336
# Generated pyproject.toml
337
[build-system]
338
requires = ["hatchling"]
339
build-backend = "hatchling.build"
340

341
[project]
342
name = "my-web-scraper"
343
version = "0.1.0"
344
description = "A web scraper for extracting product data"
345
authors = [
346
    {name = "John Doe", email = "john@example.com"},
347
]
348
readme = "README.md"
349
license = {file = "LICENSE"}
350
requires-python = ">=3.9"
351
dependencies = [
352
    "crawlee[all]>=0.6.0",
353
]
354

355
[project.optional-dependencies]
356
dev = [
357
    "pytest>=7.0",
358
    "pytest-asyncio>=0.21.0",
359
    "black>=23.0",
360
    "ruff>=0.1.0",
361
]
362

363
[tool.black]
364
line-length = 100
365

366
[tool.ruff]
367
line-length = 100
368
select = ["E", "F", "I"]
369

370
[tool.pytest.ini_options]
371
asyncio_mode = "auto"
372
```
373

374
### Requirements File
375

376
```txt
377
# Generated requirements.txt
378
crawlee[all]>=0.6.0
379

380
# Development dependencies (optional)
381
# pytest>=7.0
382
# pytest-asyncio>=0.21.0
383
# black>=23.0
384
# ruff>=0.1.0
385
```
386

387
### README Template
388

389
```markdown
390
# My Web Scraper
391

392
A web scraper for extracting product data built with Crawlee.
393

394
## Installation
395

396
1. Install dependencies:
397
   ```bash
398
   pip install -r requirements.txt
399
   ```
400

401
2. Install Playwright browsers (if using Playwright):
402
   ```bash
403
   playwright install
404
   ```
405

406
## Usage
407

408
Run the scraper:
409

410
```bash
411
python src/main.py
412
```
413

414
## Configuration
415

416
- Modify `src/main.py` to customize scraping logic
417
- Adjust `max_requests_per_crawl` to control crawl size
418
- Update target URLs in the `crawler.run()` call
419

420
## Data Output
421

422
Scraped data is saved to:
423
- `./storage/datasets/` - Structured data in JSON format
424
- `./storage/key_value_stores/` - Key-value pairs and files
425
- `./storage/request_queues/` - Request queue state
426

427
## Development
428

429
Run tests:
430
```bash
431
pytest
432
```
433

434
Format code:
435
```bash
436
black src/
437
ruff check src/
438
```
439

440
## License
441

442
This project is licensed under the MIT License.
443
```
444
445
## Advanced CLI Usage
446

447
### Custom Templates
448

449
Create custom templates by extending the CLI:
450

451
```python
452
# custom_template.py
453
from crawlee._cli import create_project_template
454

455
def create_custom_template(project_name: str, output_dir: str):
456
    """Create project with custom template."""
457

458
    template_data = {
459
        'project_name': project_name,
460
        'crawler_type': 'custom',
461
        'features': ['sessions', 'statistics', 'error_handling']
462
    }
463

464
    return create_project_template(
465
        template_name='custom',
466
        project_name=project_name,
467
        output_dir=output_dir,
468
        template_data=template_data
469
    )
470
```
471

472
### Programmatic Usage
473

474
Use CLI functionality programmatically:
475

476
```python
477
import asyncio
478
from crawlee._cli import CLICommands
479

480
async def create_project_programmatically():
481
    """Create project using CLI programmatically."""
482

483
    cli = CLICommands()
484

485
    result = await cli.create_project(
486
        project_name='automated-scraper',
487
        template='playwright',
488
        output_dir='./projects',
489
        options={
490
            'author_name': 'Automation Script',
491
            'author_email': 'automation@example.com',
492
            'include_sessions': True,
493
            'include_examples': True
494
        }
495
    )
496

497
    if result.success:
498
        print(f"Project created: {result.project_path}")
499
    else:
500
        print(f"Failed to create project: {result.error}")
501

502
asyncio.run(create_project_programmatically())
503
```
504

505
The Crawlee CLI provides a quick and efficient way to bootstrap new web scraping projects with industry best practices, allowing developers to focus on extraction logic rather than project setup and configuration.

Version

Tile

Files

cli-tools.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

cli-tools.mddocs/