0
# Content Extraction
1
2
Extract and access Wikipedia page content including summaries, full text, sections, and hierarchical page structure. Content is loaded lazily when properties are first accessed, with support for both WIKI and HTML extraction formats.
3
4
## Capabilities
5
6
### Page Content Access
7
8
Access various forms of page content from summary to full text with sections.
9
10
```python { .api }
11
class WikipediaPage:
12
@property
13
def summary(self) -> str:
14
"""
15
Get the page summary (lead section without subsections).
16
17
Returns:
18
Summary text as string. Empty string if page doesn't exist.
19
"""
20
21
@property
22
def text(self) -> str:
23
"""
24
Get the complete page text including all sections.
25
26
Returns:
27
Full page text with section headers. Combines summary and all sections.
28
"""
29
30
@property
31
def sections(self) -> list[WikipediaPageSection]:
32
"""
33
Get all top-level sections of the page.
34
35
Returns:
36
List of WikipediaPageSection objects in document order.
37
"""
38
```
39
40
#### Usage Examples
41
42
```python
43
import wikipediaapi
44
45
wiki = wikipediaapi.Wikipedia('MyApp/1.0', 'en')
46
page = wiki.page('Artificial_intelligence')
47
48
# Get page summary
49
print("Summary:")
50
print(page.summary[:200] + "...")
51
52
# Get full page text
53
full_text = page.text
54
print(f"Full text length: {len(full_text)} characters")
55
56
# Access all sections
57
print("\nTop-level sections:")
58
for i, section in enumerate(page.sections):
59
print(f"{i+1}. {section.title} (level {section.level})")
60
```
61
62
### Page Existence and Metadata
63
64
Check if pages exist and access basic metadata and URLs.
65
66
```python { .api }
67
class WikipediaPage:
68
def exists(self) -> bool:
69
"""
70
Check if the page exists on Wikipedia.
71
72
Returns:
73
True if page exists, False otherwise.
74
"""
75
76
@property
77
def title(self) -> str:
78
"""
79
Get the page title.
80
81
Returns:
82
Page title as string.
83
"""
84
85
@property
86
def language(self) -> str:
87
"""
88
Get the page language.
89
90
Returns:
91
Language code (e.g., 'en', 'es', 'fr').
92
"""
93
94
@property
95
def variant(self) -> Optional[str]:
96
"""
97
Get the language variant if specified.
98
99
Returns:
100
Language variant code or None.
101
"""
102
103
@property
104
def namespace(self) -> int:
105
"""
106
Get the page namespace.
107
108
Returns:
109
Namespace integer (0 for main, 14 for categories, etc.).
110
"""
111
112
@property
113
def pageid(self) -> int:
114
"""
115
Get the unique page ID.
116
117
Returns:
118
Integer page ID, or -1 if page doesn't exist.
119
"""
120
121
@property
122
def fullurl(self) -> str:
123
"""
124
Get the full URL to the page.
125
126
Returns:
127
Complete URL to the Wikipedia page.
128
"""
129
130
@property
131
def canonicalurl(self) -> str:
132
"""
133
Get the canonical URL to the page.
134
135
Returns:
136
Canonical URL to the Wikipedia page.
137
"""
138
139
@property
140
def editurl(self) -> str:
141
"""
142
Get the edit URL for the page.
143
144
Returns:
145
URL for editing the Wikipedia page.
146
"""
147
148
@property
149
def displaytitle(self) -> str:
150
"""
151
Get the display title (may differ from title for formatting).
152
153
Returns:
154
Display title with formatting.
155
"""
156
```
157
158
#### Usage Examples
159
160
```python
161
# Check if page exists
162
page = wiki.page('Nonexistent_Page_123456')
163
if page.exists():
164
print(f"Page '{page.title}' exists")
165
print(f"Language: {page.language}")
166
print(f"Namespace: {page.namespace}")
167
print(f"Page ID: {page.pageid}")
168
print(f"URL: {page.fullurl}")
169
else:
170
print("Page does not exist")
171
print(f"Page ID: {page.pageid}") # Will be -1 for non-existent pages
172
173
# Page metadata
174
real_page = wiki.page('Python_(programming_language)')
175
print(f"Title: {real_page.title}")
176
print(f"Display Title: {real_page.displaytitle}")
177
print(f"Exists: {real_page.exists()}")
178
print(f"Language: {real_page.language}")
179
print(f"Page ID: {real_page.pageid}")
180
print(f"Full URL: {real_page.fullurl}")
181
print(f"Canonical URL: {real_page.canonicalurl}")
182
print(f"Edit URL: {real_page.editurl}")
183
```
184
185
### Section Navigation
186
187
Navigate and search through page sections with hierarchical structure support.
188
189
```python { .api }
190
class WikipediaPage:
191
def section_by_title(self, title: str) -> Optional[WikipediaPageSection]:
192
"""
193
Get the last section with the specified title.
194
195
Parameters:
196
- title: Section title to search for
197
198
Returns:
199
WikipediaPageSection object or None if not found.
200
"""
201
202
def sections_by_title(self, title: str) -> list[WikipediaPageSection]:
203
"""
204
Get all sections with the specified title.
205
206
Parameters:
207
- title: Section title to search for
208
209
Returns:
210
List of WikipediaPageSection objects. Empty list if none found.
211
"""
212
```
213
214
#### Usage Examples
215
216
```python
217
page = wiki.page('Machine_learning')
218
219
# Find specific section
220
history_section = page.section_by_title('History')
221
if history_section:
222
print(f"Found section: {history_section.title}")
223
print(f"Section text: {history_section.text[:100]}...")
224
225
# Find all sections with same title (if duplicated)
226
overview_sections = page.sections_by_title('Overview')
227
print(f"Found {len(overview_sections)} sections titled 'Overview'")
228
229
# Navigate section hierarchy
230
for section in page.sections:
231
print(f"Section: {section.title}")
232
for subsection in section.sections:
233
print(f" Subsection: {subsection.title}")
234
```
235
236
### Section Content Access
237
238
Access individual section content and hierarchical structure.
239
240
```python { .api }
241
class WikipediaPageSection:
242
@property
243
def title(self) -> str:
244
"""
245
Get the section title.
246
247
Returns:
248
Section title as string.
249
"""
250
251
@property
252
def text(self) -> str:
253
"""
254
Get the section text content (without subsections).
255
256
Returns:
257
Section text as string.
258
"""
259
260
@property
261
def level(self) -> int:
262
"""
263
Get the section heading level.
264
265
Returns:
266
Integer level (0=top-level, 1=subsection, etc.).
267
"""
268
269
@property
270
def sections(self) -> list[WikipediaPageSection]:
271
"""
272
Get direct subsections of this section.
273
274
Returns:
275
List of WikipediaPageSection objects.
276
"""
277
278
def section_by_title(self, title: str) -> Optional[WikipediaPageSection]:
279
"""
280
Find subsection by title within this section.
281
282
Parameters:
283
- title: Subsection title to search for
284
285
Returns:
286
WikipediaPageSection object or None if not found.
287
"""
288
289
def full_text(self, level: int = 1) -> str:
290
"""
291
Get section text including all subsections with proper formatting.
292
293
Parameters:
294
- level: Starting heading level for formatting
295
296
Returns:
297
Complete section text with subsections and headers.
298
"""
299
```
300
301
#### Usage Examples
302
303
```python
304
page = wiki.page('Climate_change')
305
306
# Work with sections
307
for section in page.sections:
308
print(f"\n=== {section.title} (Level {section.level}) ===")
309
print(f"Text length: {len(section.text)} characters")
310
311
# Show subsections
312
if section.sections:
313
print(f"Subsections ({len(section.sections)}):")
314
for subsection in section.sections:
315
print(f" - {subsection.title}")
316
317
# Get full text with subsections
318
if section.title == "Causes":
319
full_content = section.full_text()
320
print(f"Full section with subsections: {len(full_content)} characters")
321
322
# Find nested subsection
323
effects_section = page.section_by_title('Effects')
324
if effects_section:
325
temperature_subsection = effects_section.section_by_title('Temperature')
326
if temperature_subsection:
327
print(f"Found nested subsection: {temperature_subsection.title}")
328
print(f"Content: {temperature_subsection.text[:150]}...")
329
```
330
331
## Content Formats
332
333
Wikipedia-API supports two extraction formats that affect how content is parsed and presented.
334
335
### WIKI Format (Default)
336
337
```python
338
wiki = wikipediaapi.Wikipedia(
339
'MyApp/1.0',
340
'en',
341
extract_format=wikipediaapi.ExtractFormat.WIKI
342
)
343
```
344
345
- Plain text content
346
- Section headers as plain text
347
- Allows proper section recognition and hierarchy parsing
348
- Suitable for text analysis and content extraction
349
350
### HTML Format
351
352
```python
353
wiki = wikipediaapi.Wikipedia(
354
'MyApp/1.0',
355
'en',
356
extract_format=wikipediaapi.ExtractFormat.HTML
357
)
358
```
359
360
- HTML formatted content with tags
361
- Section headers as HTML `<h1>`, `<h2>`, etc.
362
- Preserves formatting, links, and markup
363
- Suitable for display or HTML processing
364
365
#### Format Comparison Example
366
367
```python
368
# WIKI format
369
wiki_plain = wikipediaapi.Wikipedia('MyApp/1.0', 'en',
370
extract_format=wikipediaapi.ExtractFormat.WIKI)
371
page_plain = wiki_plain.page('Python_(programming_language)')
372
373
# HTML format
374
wiki_html = wikipediaapi.Wikipedia('MyApp/1.0', 'en',
375
extract_format=wikipediaapi.ExtractFormat.HTML)
376
page_html = wiki_html.page('Python_(programming_language)')
377
378
print("WIKI format summary:")
379
print(page_plain.summary[:100])
380
381
print("\nHTML format summary:")
382
print(page_html.summary[:100])
383
```