Tessl Tile for pypi/wikipedia-api@0.8.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

categories.md content-extraction.md index.md wikipedia-wrapper.md

content-extraction.mddocs/

0
# Content Extraction
1

2
Extract and access Wikipedia page content including summaries, full text, sections, and hierarchical page structure. Content is loaded lazily when properties are first accessed, with support for both WIKI and HTML extraction formats.
3

4
## Capabilities
5

6
### Page Content Access
7

8
Access various forms of page content from summary to full text with sections.
9

10
```python { .api }
11
class WikipediaPage:
12
    @property
13
    def summary(self) -> str:
14
        """
15
        Get the page summary (lead section without subsections).
16
        
17
        Returns:
18
        Summary text as string. Empty string if page doesn't exist.
19
        """
20
    
21
    @property
22
    def text(self) -> str:
23
        """
24
        Get the complete page text including all sections.
25
        
26
        Returns:
27
        Full page text with section headers. Combines summary and all sections.
28
        """
29
    
30
    @property
31
    def sections(self) -> list[WikipediaPageSection]:
32
        """
33
        Get all top-level sections of the page.
34
        
35
        Returns:
36
        List of WikipediaPageSection objects in document order.
37
        """
38
```
39

40
#### Usage Examples
41

42
```python
43
import wikipediaapi
44

45
wiki = wikipediaapi.Wikipedia('MyApp/1.0', 'en')
46
page = wiki.page('Artificial_intelligence')
47

48
# Get page summary
49
print("Summary:")
50
print(page.summary[:200] + "...")
51

52
# Get full page text
53
full_text = page.text
54
print(f"Full text length: {len(full_text)} characters")
55

56
# Access all sections
57
print("\nTop-level sections:")
58
for i, section in enumerate(page.sections):
59
    print(f"{i+1}. {section.title} (level {section.level})")
60
```
61

62
### Page Existence and Metadata
63

64
Check if pages exist and access basic metadata and URLs.
65

66
```python { .api }
67
class WikipediaPage:
68
    def exists(self) -> bool:
69
        """
70
        Check if the page exists on Wikipedia.
71
        
72
        Returns:
73
        True if page exists, False otherwise.
74
        """
75
    
76
    @property
77
    def title(self) -> str:
78
        """
79
        Get the page title.
80
        
81
        Returns:
82
        Page title as string.
83
        """
84
    
85
    @property
86
    def language(self) -> str:
87
        """
88
        Get the page language.
89
        
90
        Returns:
91
        Language code (e.g., 'en', 'es', 'fr').
92
        """
93
    
94
    @property
95
    def variant(self) -> Optional[str]:
96
        """
97
        Get the language variant if specified.
98
        
99
        Returns:
100
        Language variant code or None.
101
        """
102
    
103
    @property
104
    def namespace(self) -> int:
105
        """
106
        Get the page namespace.
107
        
108
        Returns:
109
        Namespace integer (0 for main, 14 for categories, etc.).
110
        """
111
    
112
    @property
113
    def pageid(self) -> int:
114
        """
115
        Get the unique page ID.
116
        
117
        Returns:
118
        Integer page ID, or -1 if page doesn't exist.
119
        """
120
    
121
    @property
122
    def fullurl(self) -> str:
123
        """
124
        Get the full URL to the page.
125
        
126
        Returns:
127
        Complete URL to the Wikipedia page.
128
        """
129
    
130
    @property
131
    def canonicalurl(self) -> str:
132
        """
133
        Get the canonical URL to the page.
134
        
135
        Returns:
136
        Canonical URL to the Wikipedia page.
137
        """
138
    
139
    @property
140
    def editurl(self) -> str:
141
        """
142
        Get the edit URL for the page.
143
        
144
        Returns:
145
        URL for editing the Wikipedia page.
146
        """
147
    
148
    @property
149
    def displaytitle(self) -> str:
150
        """
151
        Get the display title (may differ from title for formatting).
152
        
153
        Returns:
154
        Display title with formatting.
155
        """
156
```
157

158
#### Usage Examples
159

160
```python
161
# Check if page exists
162
page = wiki.page('Nonexistent_Page_123456')
163
if page.exists():
164
    print(f"Page '{page.title}' exists")
165
    print(f"Language: {page.language}")
166
    print(f"Namespace: {page.namespace}")
167
    print(f"Page ID: {page.pageid}")
168
    print(f"URL: {page.fullurl}")
169
else:
170
    print("Page does not exist")
171
    print(f"Page ID: {page.pageid}")  # Will be -1 for non-existent pages
172

173
# Page metadata
174
real_page = wiki.page('Python_(programming_language)')
175
print(f"Title: {real_page.title}")
176
print(f"Display Title: {real_page.displaytitle}")
177
print(f"Exists: {real_page.exists()}")
178
print(f"Language: {real_page.language}")
179
print(f"Page ID: {real_page.pageid}")
180
print(f"Full URL: {real_page.fullurl}")
181
print(f"Canonical URL: {real_page.canonicalurl}")
182
print(f"Edit URL: {real_page.editurl}")
183
```
184

185
### Section Navigation
186

187
Navigate and search through page sections with hierarchical structure support.
188

189
```python { .api }
190
class WikipediaPage:
191
    def section_by_title(self, title: str) -> Optional[WikipediaPageSection]:
192
        """
193
        Get the last section with the specified title.
194
        
195
        Parameters:
196
        - title: Section title to search for
197
        
198
        Returns:
199
        WikipediaPageSection object or None if not found.
200
        """
201
    
202
    def sections_by_title(self, title: str) -> list[WikipediaPageSection]:
203
        """
204
        Get all sections with the specified title.
205
        
206
        Parameters:
207
        - title: Section title to search for
208
        
209
        Returns:
210
        List of WikipediaPageSection objects. Empty list if none found.
211
        """
212
```
213

214
#### Usage Examples
215

216
```python
217
page = wiki.page('Machine_learning')
218

219
# Find specific section
220
history_section = page.section_by_title('History')
221
if history_section:
222
    print(f"Found section: {history_section.title}")
223
    print(f"Section text: {history_section.text[:100]}...")
224

225
# Find all sections with same title (if duplicated)
226
overview_sections = page.sections_by_title('Overview')
227
print(f"Found {len(overview_sections)} sections titled 'Overview'")
228

229
# Navigate section hierarchy
230
for section in page.sections:
231
    print(f"Section: {section.title}")
232
    for subsection in section.sections:
233
        print(f"  Subsection: {subsection.title}")
234
```
235

236
### Section Content Access
237

238
Access individual section content and hierarchical structure.
239

240
```python { .api }
241
class WikipediaPageSection:
242
    @property
243
    def title(self) -> str:
244
        """
245
        Get the section title.
246
        
247
        Returns:
248
        Section title as string.
249
        """
250
    
251
    @property
252
    def text(self) -> str:
253
        """
254
        Get the section text content (without subsections).
255
        
256
        Returns:
257
        Section text as string.
258
        """
259
    
260
    @property
261
    def level(self) -> int:
262
        """
263
        Get the section heading level.
264
        
265
        Returns:
266
        Integer level (0=top-level, 1=subsection, etc.).
267
        """
268
    
269
    @property
270
    def sections(self) -> list[WikipediaPageSection]:
271
        """
272
        Get direct subsections of this section.
273
        
274
        Returns:
275
        List of WikipediaPageSection objects.
276
        """
277
    
278
    def section_by_title(self, title: str) -> Optional[WikipediaPageSection]:
279
        """
280
        Find subsection by title within this section.
281
        
282
        Parameters:
283
        - title: Subsection title to search for
284
        
285
        Returns:
286
        WikipediaPageSection object or None if not found.
287
        """
288
    
289
    def full_text(self, level: int = 1) -> str:
290
        """
291
        Get section text including all subsections with proper formatting.
292
        
293
        Parameters:
294
        - level: Starting heading level for formatting
295
        
296
        Returns:
297
        Complete section text with subsections and headers.
298
        """
299
```
300

301
#### Usage Examples
302

303
```python
304
page = wiki.page('Climate_change')
305

306
# Work with sections
307
for section in page.sections:
308
    print(f"\n=== {section.title} (Level {section.level}) ===")
309
    print(f"Text length: {len(section.text)} characters")
310
    
311
    # Show subsections
312
    if section.sections:
313
        print(f"Subsections ({len(section.sections)}):")
314
        for subsection in section.sections:
315
            print(f"  - {subsection.title}")
316
    
317
    # Get full text with subsections
318
    if section.title == "Causes":
319
        full_content = section.full_text()
320
        print(f"Full section with subsections: {len(full_content)} characters")
321

322
# Find nested subsection
323
effects_section = page.section_by_title('Effects')
324
if effects_section:
325
    temperature_subsection = effects_section.section_by_title('Temperature')
326
    if temperature_subsection:
327
        print(f"Found nested subsection: {temperature_subsection.title}")
328
        print(f"Content: {temperature_subsection.text[:150]}...")
329
```
330

331
## Content Formats
332

333
Wikipedia-API supports two extraction formats that affect how content is parsed and presented.
334

335
### WIKI Format (Default)
336

337
```python
338
wiki = wikipediaapi.Wikipedia(
339
    'MyApp/1.0',
340
    'en',
341
    extract_format=wikipediaapi.ExtractFormat.WIKI
342
)
343
```
344

345
- Plain text content
346
- Section headers as plain text
347
- Allows proper section recognition and hierarchy parsing
348
- Suitable for text analysis and content extraction
349

350
### HTML Format
351

352
```python
353
wiki = wikipediaapi.Wikipedia(
354
    'MyApp/1.0',
355
    'en', 
356
    extract_format=wikipediaapi.ExtractFormat.HTML
357
)
358
```
359

360
- HTML formatted content with tags
361
- Section headers as HTML `<h1>`, `<h2>`, etc.
362
- Preserves formatting, links, and markup
363
- Suitable for display or HTML processing
364

365
#### Format Comparison Example
366

367
```python
368
# WIKI format
369
wiki_plain = wikipediaapi.Wikipedia('MyApp/1.0', 'en', 
370
                                   extract_format=wikipediaapi.ExtractFormat.WIKI)
371
page_plain = wiki_plain.page('Python_(programming_language)')
372

373
# HTML format
374
wiki_html = wikipediaapi.Wikipedia('MyApp/1.0', 'en',
375
                                  extract_format=wikipediaapi.ExtractFormat.HTML)  
376
page_html = wiki_html.page('Python_(programming_language)')
377

378
print("WIKI format summary:")
379
print(page_plain.summary[:100])
380

381
print("\nHTML format summary:")
382
print(page_html.summary[:100])
383
```

Version

Tile

Files

content-extraction.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

content-extraction.mddocs/