or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

categories.mdcontent-extraction.mdindex.mdpage-navigation.mdwikipedia-wrapper.md

content-extraction.mddocs/

0

# Content Extraction

1

2

Extract and access Wikipedia page content including summaries, full text, sections, and hierarchical page structure. Content is loaded lazily when properties are first accessed, with support for both WIKI and HTML extraction formats.

3

4

## Capabilities

5

6

### Page Content Access

7

8

Access various forms of page content from summary to full text with sections.

9

10

```python { .api }

11

class WikipediaPage:

12

@property

13

def summary(self) -> str:

14

"""

15

Get the page summary (lead section without subsections).

16

17

Returns:

18

Summary text as string. Empty string if page doesn't exist.

19

"""

20

21

@property

22

def text(self) -> str:

23

"""

24

Get the complete page text including all sections.

25

26

Returns:

27

Full page text with section headers. Combines summary and all sections.

28

"""

29

30

@property

31

def sections(self) -> list[WikipediaPageSection]:

32

"""

33

Get all top-level sections of the page.

34

35

Returns:

36

List of WikipediaPageSection objects in document order.

37

"""

38

```

39

40

#### Usage Examples

41

42

```python

43

import wikipediaapi

44

45

wiki = wikipediaapi.Wikipedia('MyApp/1.0', 'en')

46

page = wiki.page('Artificial_intelligence')

47

48

# Get page summary

49

print("Summary:")

50

print(page.summary[:200] + "...")

51

52

# Get full page text

53

full_text = page.text

54

print(f"Full text length: {len(full_text)} characters")

55

56

# Access all sections

57

print("\nTop-level sections:")

58

for i, section in enumerate(page.sections):

59

print(f"{i+1}. {section.title} (level {section.level})")

60

```

61

62

### Page Existence and Metadata

63

64

Check if pages exist and access basic metadata and URLs.

65

66

```python { .api }

67

class WikipediaPage:

68

def exists(self) -> bool:

69

"""

70

Check if the page exists on Wikipedia.

71

72

Returns:

73

True if page exists, False otherwise.

74

"""

75

76

@property

77

def title(self) -> str:

78

"""

79

Get the page title.

80

81

Returns:

82

Page title as string.

83

"""

84

85

@property

86

def language(self) -> str:

87

"""

88

Get the page language.

89

90

Returns:

91

Language code (e.g., 'en', 'es', 'fr').

92

"""

93

94

@property

95

def variant(self) -> Optional[str]:

96

"""

97

Get the language variant if specified.

98

99

Returns:

100

Language variant code or None.

101

"""

102

103

@property

104

def namespace(self) -> int:

105

"""

106

Get the page namespace.

107

108

Returns:

109

Namespace integer (0 for main, 14 for categories, etc.).

110

"""

111

112

@property

113

def pageid(self) -> int:

114

"""

115

Get the unique page ID.

116

117

Returns:

118

Integer page ID, or -1 if page doesn't exist.

119

"""

120

121

@property

122

def fullurl(self) -> str:

123

"""

124

Get the full URL to the page.

125

126

Returns:

127

Complete URL to the Wikipedia page.

128

"""

129

130

@property

131

def canonicalurl(self) -> str:

132

"""

133

Get the canonical URL to the page.

134

135

Returns:

136

Canonical URL to the Wikipedia page.

137

"""

138

139

@property

140

def editurl(self) -> str:

141

"""

142

Get the edit URL for the page.

143

144

Returns:

145

URL for editing the Wikipedia page.

146

"""

147

148

@property

149

def displaytitle(self) -> str:

150

"""

151

Get the display title (may differ from title for formatting).

152

153

Returns:

154

Display title with formatting.

155

"""

156

```

157

158

#### Usage Examples

159

160

```python

161

# Check if page exists

162

page = wiki.page('Nonexistent_Page_123456')

163

if page.exists():

164

print(f"Page '{page.title}' exists")

165

print(f"Language: {page.language}")

166

print(f"Namespace: {page.namespace}")

167

print(f"Page ID: {page.pageid}")

168

print(f"URL: {page.fullurl}")

169

else:

170

print("Page does not exist")

171

print(f"Page ID: {page.pageid}") # Will be -1 for non-existent pages

172

173

# Page metadata

174

real_page = wiki.page('Python_(programming_language)')

175

print(f"Title: {real_page.title}")

176

print(f"Display Title: {real_page.displaytitle}")

177

print(f"Exists: {real_page.exists()}")

178

print(f"Language: {real_page.language}")

179

print(f"Page ID: {real_page.pageid}")

180

print(f"Full URL: {real_page.fullurl}")

181

print(f"Canonical URL: {real_page.canonicalurl}")

182

print(f"Edit URL: {real_page.editurl}")

183

```

184

185

### Section Navigation

186

187

Navigate and search through page sections with hierarchical structure support.

188

189

```python { .api }

190

class WikipediaPage:

191

def section_by_title(self, title: str) -> Optional[WikipediaPageSection]:

192

"""

193

Get the last section with the specified title.

194

195

Parameters:

196

- title: Section title to search for

197

198

Returns:

199

WikipediaPageSection object or None if not found.

200

"""

201

202

def sections_by_title(self, title: str) -> list[WikipediaPageSection]:

203

"""

204

Get all sections with the specified title.

205

206

Parameters:

207

- title: Section title to search for

208

209

Returns:

210

List of WikipediaPageSection objects. Empty list if none found.

211

"""

212

```

213

214

#### Usage Examples

215

216

```python

217

page = wiki.page('Machine_learning')

218

219

# Find specific section

220

history_section = page.section_by_title('History')

221

if history_section:

222

print(f"Found section: {history_section.title}")

223

print(f"Section text: {history_section.text[:100]}...")

224

225

# Find all sections with same title (if duplicated)

226

overview_sections = page.sections_by_title('Overview')

227

print(f"Found {len(overview_sections)} sections titled 'Overview'")

228

229

# Navigate section hierarchy

230

for section in page.sections:

231

print(f"Section: {section.title}")

232

for subsection in section.sections:

233

print(f" Subsection: {subsection.title}")

234

```

235

236

### Section Content Access

237

238

Access individual section content and hierarchical structure.

239

240

```python { .api }

241

class WikipediaPageSection:

242

@property

243

def title(self) -> str:

244

"""

245

Get the section title.

246

247

Returns:

248

Section title as string.

249

"""

250

251

@property

252

def text(self) -> str:

253

"""

254

Get the section text content (without subsections).

255

256

Returns:

257

Section text as string.

258

"""

259

260

@property

261

def level(self) -> int:

262

"""

263

Get the section heading level.

264

265

Returns:

266

Integer level (0=top-level, 1=subsection, etc.).

267

"""

268

269

@property

270

def sections(self) -> list[WikipediaPageSection]:

271

"""

272

Get direct subsections of this section.

273

274

Returns:

275

List of WikipediaPageSection objects.

276

"""

277

278

def section_by_title(self, title: str) -> Optional[WikipediaPageSection]:

279

"""

280

Find subsection by title within this section.

281

282

Parameters:

283

- title: Subsection title to search for

284

285

Returns:

286

WikipediaPageSection object or None if not found.

287

"""

288

289

def full_text(self, level: int = 1) -> str:

290

"""

291

Get section text including all subsections with proper formatting.

292

293

Parameters:

294

- level: Starting heading level for formatting

295

296

Returns:

297

Complete section text with subsections and headers.

298

"""

299

```

300

301

#### Usage Examples

302

303

```python

304

page = wiki.page('Climate_change')

305

306

# Work with sections

307

for section in page.sections:

308

print(f"\n=== {section.title} (Level {section.level}) ===")

309

print(f"Text length: {len(section.text)} characters")

310

311

# Show subsections

312

if section.sections:

313

print(f"Subsections ({len(section.sections)}):")

314

for subsection in section.sections:

315

print(f" - {subsection.title}")

316

317

# Get full text with subsections

318

if section.title == "Causes":

319

full_content = section.full_text()

320

print(f"Full section with subsections: {len(full_content)} characters")

321

322

# Find nested subsection

323

effects_section = page.section_by_title('Effects')

324

if effects_section:

325

temperature_subsection = effects_section.section_by_title('Temperature')

326

if temperature_subsection:

327

print(f"Found nested subsection: {temperature_subsection.title}")

328

print(f"Content: {temperature_subsection.text[:150]}...")

329

```

330

331

## Content Formats

332

333

Wikipedia-API supports two extraction formats that affect how content is parsed and presented.

334

335

### WIKI Format (Default)

336

337

```python

338

wiki = wikipediaapi.Wikipedia(

339

'MyApp/1.0',

340

'en',

341

extract_format=wikipediaapi.ExtractFormat.WIKI

342

)

343

```

344

345

- Plain text content

346

- Section headers as plain text

347

- Allows proper section recognition and hierarchy parsing

348

- Suitable for text analysis and content extraction

349

350

### HTML Format

351

352

```python

353

wiki = wikipediaapi.Wikipedia(

354

'MyApp/1.0',

355

'en',

356

extract_format=wikipediaapi.ExtractFormat.HTML

357

)

358

```

359

360

- HTML formatted content with tags

361

- Section headers as HTML `<h1>`, `<h2>`, etc.

362

- Preserves formatting, links, and markup

363

- Suitable for display or HTML processing

364

365

#### Format Comparison Example

366

367

```python

368

# WIKI format

369

wiki_plain = wikipediaapi.Wikipedia('MyApp/1.0', 'en',

370

extract_format=wikipediaapi.ExtractFormat.WIKI)

371

page_plain = wiki_plain.page('Python_(programming_language)')

372

373

# HTML format

374

wiki_html = wikipediaapi.Wikipedia('MyApp/1.0', 'en',

375

extract_format=wikipediaapi.ExtractFormat.HTML)

376

page_html = wiki_html.page('Python_(programming_language)')

377

378

print("WIKI format summary:")

379

print(page_plain.summary[:100])

380

381

print("\nHTML format summary:")

382

print(page_html.summary[:100])

383

```