or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

browser-automation.mdcli-tools.mdconfiguration.mdcore-types.mdcrawlers.mderror-handling.mdevents.mdfingerprinting.mdhttp-clients.mdindex.mdrequest-management.mdsessions.mdstatistics.mdstorage.md

cli-tools.mddocs/

0

# CLI Tools

1

2

Command-line interface for project scaffolding and development workflow automation. The Crawlee CLI provides tools for quickly creating new projects with best practices and templates.

3

4

## Command Line Interface

5

6

### Main CLI Application

7

8

The main CLI entry point providing access to all Crawlee command-line tools.

9

10

```python { .api }

11

# Command line usage:

12

# crawlee --help # Show help information

13

# crawlee --version # Display version information

14

# crawlee create # Create new project

15

```

16

17

## Available Commands

18

19

### Project Creation

20

21

Create new Crawlee projects using predefined templates with best practices and common patterns.

22

23

```bash

24

# Create new project interactively

25

crawlee create

26

27

# Create project with specific name

28

crawlee create my-crawler

29

30

# Create project with template

31

crawlee create my-crawler --template basic

32

33

# Create project in specific directory

34

crawlee create my-crawler --output-dir ./projects

35

```

36

37

### Version Information

38

39

Display version information for the installed Crawlee package.

40

41

```bash

42

# Show version

43

crawlee --version

44

45

# Alternative version command

46

crawlee version

47

```

48

49

## Project Templates

50

51

### Available Templates

52

53

The CLI provides several project templates optimized for different use cases:

54

55

- **basic**: Simple HTTP crawler template

56

- **playwright**: Browser automation template using Playwright

57

- **beautifulsoup**: HTML parsing template using BeautifulSoup

58

- **adaptive**: Intelligent crawler that adapts between HTTP and browser modes

59

- **advanced**: Full-featured template with all Crawlee capabilities

60

61

### Template Structure

62

63

Generated projects include:

64

65

```

66

my-crawler/

67

├── src/

68

│ └── main.py # Main crawler implementation

69

├── requirements.txt # Python dependencies

70

├── pyproject.toml # Project configuration

71

├── README.md # Project documentation

72

├── .gitignore # Git ignore rules

73

└── storage/ # Default storage directory

74

├── datasets/ # Scraped data storage

75

├── key_value_stores/ # Key-value storage

76

└── request_queues/ # Request queue storage

77

```

78

79

## Usage Examples

80

81

### Interactive Project Creation

82

83

```bash

84

$ crawlee create

85

86

? Project name: my-web-scraper

87

? Select template:

88

> basic

89

playwright

90

beautifulsoup

91

adaptive

92

advanced

93

? Description: A web scraper for extracting product data

94

? Author name: John Doe

95

? Author email: john@example.com

96

? Use session management? (y/N): y

97

? Use proxy rotation? (y/N): n

98

? Include example handlers? (Y/n): y

99

100

✅ Project created successfully!

101

102

📁 Project location: ./my-web-scraper

103

📋 Next steps:

104

1. cd my-web-scraper

105

2. pip install -r requirements.txt

106

3. python src/main.py

107

108

🚀 Happy crawling!

109

```

110

111

### Basic Template Example

112

113

```python

114

# Generated src/main.py for basic template

115

import asyncio

116

from crawlee.crawlers import HttpCrawler, HttpCrawlingContext

117

118

async def main() -> None:

119

crawler = HttpCrawler(

120

# Limit the crawl to max requests. Remove or increase it for crawling all links.

121

max_requests_per_crawl=10,

122

)

123

124

# Define the default request handler, which will be called for every request.

125

@crawler.router.default_handler

126

async def request_handler(context: HttpCrawlingContext) -> None:

127

context.log.info(f'Processing {context.request.url} ...')

128

129

# Extract data from the page.

130

data = {

131

'url': context.request.url,

132

'title': 'TODO: Extract title from response', # Add your extraction logic

133

'content_length': len(context.body),

134

}

135

136

# Push the extracted data to the default dataset.

137

await context.push_data(data)

138

139

# Find and enqueue links from the current page.

140

await context.enqueue_links()

141

142

# Run the crawler with the initial list of URLs.

143

await crawler.run(['https://crawlee.dev'])

144

145

if __name__ == '__main__':

146

asyncio.run(main())

147

```

148

149

### Playwright Template Example

150

151

```python

152

# Generated src/main.py for playwright template

153

import asyncio

154

from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext

155

156

async def main() -> None:

157

crawler = PlaywrightCrawler(

158

# Limit the crawl to max requests. Remove or increase it for crawling all links.

159

max_requests_per_crawl=10,

160

# Headless mode (set to False to see browser window)

161

headless=True,

162

)

163

164

# Define the default request handler, which will be called for every request.

165

@crawler.router.default_handler

166

async def request_handler(context: PlaywrightCrawlingContext) -> None:

167

context.log.info(f'Processing {context.request.url} ...')

168

169

# Wait for the page to fully load

170

await context.page.wait_for_load_state('networkidle')

171

172

# Extract data from the page using Playwright selectors.

173

title = await context.page.title()

174

175

data = {

176

'url': context.request.url,

177

'title': title,

178

}

179

180

# Push the extracted data to the default dataset.

181

await context.push_data(data)

182

183

# Find and enqueue links from the current page.

184

await context.enqueue_links()

185

186

# Run the crawler with the initial list of URLs.

187

await crawler.run(['https://crawlee.dev'])

188

189

if __name__ == '__main__':

190

asyncio.run(main())

191

```

192

193

### BeautifulSoup Template Example

194

195

```python

196

# Generated src/main.py for beautifulsoup template

197

import asyncio

198

from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext

199

200

async def main() -> None:

201

crawler = BeautifulSoupCrawler(

202

# Limit the crawl to max requests. Remove or increase it for crawling all links.

203

max_requests_per_crawl=10,

204

)

205

206

# Define the default request handler, which will be called for every request.

207

@crawler.router.default_handler

208

async def request_handler(context: BeautifulSoupCrawlingContext) -> None:

209

context.log.info(f'Processing {context.request.url} ...')

210

211

# Extract data from the page using BeautifulSoup.

212

title_element = context.soup.find('title')

213

title = title_element.get_text().strip() if title_element else 'No title'

214

215

data = {

216

'url': context.request.url,

217

'title': title,

218

}

219

220

# Push the extracted data to the default dataset.

221

await context.push_data(data)

222

223

# Find and enqueue links from the current page.

224

await context.enqueue_links()

225

226

# Run the crawler with the initial list of URLs.

227

await crawler.run(['https://crawlee.dev'])

228

229

if __name__ == '__main__':

230

asyncio.run(main())

231

```

232

233

### Advanced Template Features

234

235

The advanced template includes:

236

237

- Multiple crawler types with routing

238

- Session management

239

- Proxy configuration

240

- Error handling

241

- Statistics monitoring

242

- Data export functionality

243

- Configuration management

244

245

```python

246

# Advanced template excerpt

247

import asyncio

248

from crawlee.crawlers import AdaptivePlaywrightCrawler, AdaptivePlaywrightCrawlingContext

249

from crawlee.sessions import SessionPool

250

from crawlee.proxy_configuration import ProxyConfiguration

251

from crawlee.statistics import Statistics

252

253

async def main() -> None:

254

# Configure session pool

255

session_pool = SessionPool(max_pool_size=100)

256

257

# Configure proxy rotation (optional)

258

# proxy_config = ProxyConfiguration([

259

# 'http://proxy1:8080',

260

# 'http://proxy2:8080'

261

# ])

262

263

# Configure statistics

264

stats = Statistics()

265

266

crawler = AdaptivePlaywrightCrawler(

267

max_requests_per_crawl=50,

268

session_pool=session_pool,

269

# proxy_configuration=proxy_config,

270

statistics=stats,

271

headless=True,

272

)

273

274

# Route for product pages

275

@crawler.router.route('product')

276

async def product_handler(context: AdaptivePlaywrightCrawlingContext) -> None:

277

context.log.info(f'Processing product: {context.request.url}')

278

279

# Extract product data

280

# Add your product extraction logic here

281

282

data = {

283

'url': context.request.url,

284

'type': 'product',

285

'title': 'TODO: Extract product title',

286

'price': 'TODO: Extract product price',

287

}

288

289

await context.push_data(data)

290

291

# Default handler for other pages

292

@crawler.router.default_handler

293

async def default_handler(context: AdaptivePlaywrightCrawlingContext) -> None:

294

context.log.info(f'Processing {context.request.url}')

295

296

# Enqueue product links

297

await context.enqueue_links(

298

selector='a[href*="/product"]',

299

label='product'

300

)

301

302

data = {

303

'url': context.request.url,

304

'type': 'page',

305

}

306

307

await context.push_data(data)

308

309

# Error handler

310

@crawler.router.error_handler

311

async def error_handler(context: AdaptivePlaywrightCrawlingContext, error: Exception) -> None:

312

context.log.error(f'Error processing {context.request.url}: {error}')

313

314

# Log error data

315

await context.push_data({

316

'url': context.request.url,

317

'error': str(error),

318

'type': 'error'

319

})

320

321

# Run the crawler

322

final_stats = await crawler.run(['https://example-store.com'])

323

324

# Print final statistics

325

print(f'Crawl completed. Success rate: {final_stats.success_rate:.1f}%')

326

327

if __name__ == '__main__':

328

asyncio.run(main())

329

```

330

331

## Configuration Files

332

333

### Project Configuration (pyproject.toml)

334

335

```toml

336

# Generated pyproject.toml

337

[build-system]

338

requires = ["hatchling"]

339

build-backend = "hatchling.build"

340

341

[project]

342

name = "my-web-scraper"

343

version = "0.1.0"

344

description = "A web scraper for extracting product data"

345

authors = [

346

{name = "John Doe", email = "john@example.com"},

347

]

348

readme = "README.md"

349

license = {file = "LICENSE"}

350

requires-python = ">=3.9"

351

dependencies = [

352

"crawlee[all]>=0.6.0",

353

]

354

355

[project.optional-dependencies]

356

dev = [

357

"pytest>=7.0",

358

"pytest-asyncio>=0.21.0",

359

"black>=23.0",

360

"ruff>=0.1.0",

361

]

362

363

[tool.black]

364

line-length = 100

365

366

[tool.ruff]

367

line-length = 100

368

select = ["E", "F", "I"]

369

370

[tool.pytest.ini_options]

371

asyncio_mode = "auto"

372

```

373

374

### Requirements File

375

376

```txt

377

# Generated requirements.txt

378

crawlee[all]>=0.6.0

379

380

# Development dependencies (optional)

381

# pytest>=7.0

382

# pytest-asyncio>=0.21.0

383

# black>=23.0

384

# ruff>=0.1.0

385

```

386

387

### README Template

388

389

```markdown

390

# My Web Scraper

391

392

A web scraper for extracting product data built with Crawlee.

393

394

## Installation

395

396

1. Install dependencies:

397

```bash

398

pip install -r requirements.txt

399

```

400

401

2. Install Playwright browsers (if using Playwright):

402

```bash

403

playwright install

404

```

405

406

## Usage

407

408

Run the scraper:

409

410

```bash

411

python src/main.py

412

```

413

414

## Configuration

415

416

- Modify `src/main.py` to customize scraping logic

417

- Adjust `max_requests_per_crawl` to control crawl size

418

- Update target URLs in the `crawler.run()` call

419

420

## Data Output

421

422

Scraped data is saved to:

423

- `./storage/datasets/` - Structured data in JSON format

424

- `./storage/key_value_stores/` - Key-value pairs and files

425

- `./storage/request_queues/` - Request queue state

426

427

## Development

428

429

Run tests:

430

```bash

431

pytest

432

```

433

434

Format code:

435

```bash

436

black src/

437

ruff check src/

438

```

439

440

## License

441

442

This project is licensed under the MIT License.

443

```

444

445

## Advanced CLI Usage

446

447

### Custom Templates

448

449

Create custom templates by extending the CLI:

450

451

```python

452

# custom_template.py

453

from crawlee._cli import create_project_template

454

455

def create_custom_template(project_name: str, output_dir: str):

456

"""Create project with custom template."""

457

458

template_data = {

459

'project_name': project_name,

460

'crawler_type': 'custom',

461

'features': ['sessions', 'statistics', 'error_handling']

462

}

463

464

return create_project_template(

465

template_name='custom',

466

project_name=project_name,

467

output_dir=output_dir,

468

template_data=template_data

469

)

470

```

471

472

### Programmatic Usage

473

474

Use CLI functionality programmatically:

475

476

```python

477

import asyncio

478

from crawlee._cli import CLICommands

479

480

async def create_project_programmatically():

481

"""Create project using CLI programmatically."""

482

483

cli = CLICommands()

484

485

result = await cli.create_project(

486

project_name='automated-scraper',

487

template='playwright',

488

output_dir='./projects',

489

options={

490

'author_name': 'Automation Script',

491

'author_email': 'automation@example.com',

492

'include_sessions': True,

493

'include_examples': True

494

}

495

)

496

497

if result.success:

498

print(f"Project created: {result.project_path}")

499

else:

500

print(f"Failed to create project: {result.error}")

501

502

asyncio.run(create_project_programmatically())

503

```

504

505

The Crawlee CLI provides a quick and efficient way to bootstrap new web scraping projects with industry best practices, allowing developers to focus on extraction logic rather than project setup and configuration.