or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

css-translation.mddata-extraction.mdelement-modification.mdindex.mdparsing-selection.mdselectorlist-operations.mdxml-namespaces.mdxpath-extensions.md

element-modification.mddocs/

0

# Element Modification

1

2

Methods for removing and modifying DOM elements within the parsed document structure. These operations modify the underlying document tree and affect subsequent queries.

3

4

## Capabilities

5

6

### Element Removal

7

8

Remove selected elements from their parent containers in the document tree.

9

10

```python { .api }

11

def drop(self) -> None:

12

"""

13

Drop matched nodes from the parent element.

14

15

Removes the selected element from its parent in the DOM tree.

16

Uses appropriate removal method based on document type:

17

- HTML: Uses lxml's drop_tree() method

18

- XML: Uses parent.remove() method

19

20

Raises:

21

- CannotRemoveElementWithoutRoot: Element has no root document

22

- CannotDropElementWithoutParent: Element has no parent to remove from

23

"""

24

25

def remove(self) -> None:

26

"""

27

Remove matched nodes from the parent element.

28

29

Deprecated: Use drop() method instead.

30

31

Raises:

32

- CannotRemoveElementWithoutRoot: Element has no root document

33

- CannotRemoveElementWithoutParent: Element has no parent to remove from

34

- DeprecationWarning: Method is deprecated

35

"""

36

```

37

38

**Usage Example:**

39

40

```python

41

from parsel import Selector

42

43

html = """

44

<article>

45

<h1>Article Title</h1>

46

<div class="ads">Advertisement content</div>

47

<p>First paragraph of content.</p>

48

<div class="ads">Another advertisement</div>

49

<p>Second paragraph of content.</p>

50

<div class="sidebar">Sidebar content</div>

51

</article>

52

"""

53

54

selector = Selector(text=html)

55

56

# Remove all advertisement elements

57

ads = selector.css('.ads')

58

ads.drop()

59

60

# Verify ads are removed

61

remaining_content = selector.css('article').get()

62

print("Ads removed:", "ads" not in remaining_content)

63

64

# Remove sidebar

65

sidebar = selector.css('.sidebar')

66

sidebar.drop()

67

68

# Check final structure - only h1 and p elements remain

69

final_structure = selector.css('article > *')

70

elements = [elem.root.tag for elem in final_structure]

71

# Returns: ['h1', 'p', 'p']

72

```

73

74

### Batch Element Removal

75

76

Remove multiple elements using SelectorList operations.

77

78

**Usage Example:**

79

80

```python

81

html_with_cleanup = """

82

<div class="content">

83

<h2>Important Heading</h2>

84

<script>tracking_code();</script>

85

<p>Valuable content paragraph.</p>

86

<div class="popup">Popup modal</div>

87

<p>Another valuable paragraph.</p>

88

<noscript>No JavaScript message</noscript>

89

<footer>Footer content</footer>

90

</div>

91

"""

92

93

selector = Selector(text=html_with_cleanup)

94

95

# Remove multiple unwanted element types at once

96

unwanted = selector.css('script, .popup, noscript')

97

unwanted.drop()

98

99

# Verify cleanup

100

cleaned_content = selector.css('.content').get()

101

print("Scripts removed:", "script" not in cleaned_content)

102

print("Popups removed:", "popup" not in cleaned_content)

103

print("Noscript removed:", "noscript" not in cleaned_content)

104

105

# Extract clean content

106

clean_paragraphs = selector.css('p::text').getall()

107

# Returns: ['Valuable content paragraph.', 'Another valuable paragraph.']

108

```

109

110

### Conditional Element Removal

111

112

Remove elements based on content or attribute conditions.

113

114

**Usage Example:**

115

116

```python

117

html_with_conditions = """

118

<div class="comments">

119

<div class="comment" data-score="5">Great article!</div>

120

<div class="comment" data-score="1">Spam content here</div>

121

<div class="comment" data-score="4">Very helpful, thanks.</div>

122

<div class="comment" data-score="2">Not very useful</div>

123

<div class="comment" data-score="5">Excellent explanation!</div>

124

</div>

125

"""

126

127

selector = Selector(text=html_with_conditions)

128

129

# Remove low-quality comments (score <= 2)

130

low_quality = selector.xpath('//div[@class="comment"][@data-score<=2]')

131

low_quality.drop()

132

133

# Verify only high-quality comments remain

134

remaining_scores = selector.css('.comment').xpath('./@data-score').getall()

135

# Returns: ['5', '4', '5'] - only scores > 2

136

137

# Remove comments containing specific text

138

spam_comments = selector.xpath('//div[@class="comment"][contains(text(), "spam")]')

139

spam_comments.drop()

140

```

141

142

### Targeted Content Removal

143

144

Remove specific content while preserving structure.

145

146

**Usage Example:**

147

148

```python

149

html_with_mixed_content = """

150

<article>

151

<h1>Product Review</h1>

152

<div class="meta">

153

<span class="author">John Doe</span>

154

<span class="date">2024-01-15</span>

155

<span class="tracking" data-track="view">TRACK123</span>

156

</div>

157

<div class="content">

158

<p>This product is amazing!</p>

159

<div class="affiliate-link">

160

<a href="/affiliate?id=123">Buy Now - Special Offer!</a>

161

</div>

162

<p>I highly recommend it to everyone.</p>

163

</div>

164

</article>

165

"""

166

167

selector = Selector(text=html_with_mixed_content)

168

169

# Remove tracking and affiliate elements

170

tracking_elements = selector.css('[data-track], .affiliate-link')

171

tracking_elements.drop()

172

173

# Extract clean content

174

article_text = selector.css('.content p::text').getall()

175

# Returns: ['This product is amazing!', 'I highly recommend it to everyone.']

176

177

# Verify meta information is preserved (author, date kept)

178

meta_info = selector.css('.meta span:not(.tracking)::text').getall()

179

# Returns: ['John Doe', '2024-01-15']

180

```

181

182

## Exception Handling

183

184

Element modification operations can raise specific exceptions that should be handled appropriately.

185

186

### Exception Types

187

188

```python { .api }

189

class CannotRemoveElementWithoutRoot(Exception):

190

"""

191

Raised when attempting to remove an element that has no root document.

192

193

Common causes:

194

- Trying to remove text nodes or pseudo-elements

195

- Working with detached elements

196

"""

197

198

class CannotRemoveElementWithoutParent(Exception):

199

"""

200

Raised when attempting to remove an element that has no parent.

201

202

Common causes:

203

- Trying to remove the root element

204

- Working with already-removed elements

205

"""

206

207

class CannotDropElementWithoutParent(CannotRemoveElementWithoutParent):

208

"""

209

Specific exception for drop() operations.

210

Inherits from CannotRemoveElementWithoutParent.

211

"""

212

```

213

214

**Exception Handling Example:**

215

216

```python

217

from parsel import Selector

218

from parsel.selector import (

219

CannotRemoveElementWithoutRoot,

220

CannotDropElementWithoutParent

221

)

222

223

html = """

224

<div>

225

<p>Paragraph with <em>emphasis</em> text.</p>

226

<ul>

227

<li>Item 1</li>

228

<li>Item 2</li>

229

</ul>

230

</div>

231

"""

232

233

selector = Selector(text=html)

234

235

# Safe element removal with exception handling

236

def safe_remove_elements(selector, css_query):

237

try:

238

elements = selector.css(css_query)

239

elements.drop()

240

return True

241

except CannotRemoveElementWithoutRoot:

242

print(f"Cannot remove {css_query}: elements have no root")

243

return False

244

except CannotDropElementWithoutParent:

245

print(f"Cannot remove {css_query}: elements have no parent")

246

return False

247

248

# Remove list items safely

249

success = safe_remove_elements(selector, 'li')

250

print(f"List items removed: {success}")

251

252

# Try to remove text nodes (will fail gracefully)

253

text_nodes = selector.xpath('//text()')

254

try:

255

text_nodes.drop()

256

except CannotRemoveElementWithoutRoot as e:

257

print(f"Expected error: {e}")

258

259

# Try to remove root element (will fail)

260

try:

261

root_div = selector.css('div')

262

if root_div:

263

root_div[0].drop() # Try to remove root

264

except CannotDropElementWithoutParent as e:

265

print(f"Cannot remove root: {e}")

266

```

267

268

## Document State After Modification

269

270

Element removal permanently modifies the document structure:

271

272

- **Subsequent queries** reflect the modified document state

273

- **Removed elements** are no longer accessible via selectors

274

- **Parent-child relationships** are updated automatically

275

- **Document serialization** excludes removed elements

276

277

**State Tracking Example:**

278

279

```python

280

html = """

281

<nav>

282

<ul>

283

<li><a href="/home">Home</a></li>

284

<li class="active"><a href="/products">Products</a></li>

285

<li><a href="/contact">Contact</a></li>

286

</ul>

287

</nav>

288

"""

289

290

selector = Selector(text=html)

291

292

# Count elements before removal

293

initial_count = len(selector.css('li'))

294

print(f"Initial list items: {initial_count}") # 3

295

296

# Remove active item

297

active_item = selector.css('li.active')

298

active_item.drop()

299

300

# Count elements after removal

301

final_count = len(selector.css('li'))

302

print(f"Remaining list items: {final_count}") # 2

303

304

# Verify active item is gone

305

active_check = selector.css('li.active')

306

print(f"Active items found: {len(active_check)}") # 0

307

308

# Get final HTML structure

309

final_html = selector.css('nav').get()

310

print("Active class removed:", "active" not in final_html)

311

```

312

313

## Performance and Memory Considerations

314

315

- **Memory usage**: Removed elements are freed from memory

316

- **Query performance**: Fewer elements improve subsequent query speed

317

- **Irreversible**: Element removal cannot be undone without re-parsing

318

- **Document size**: Serialized output is smaller after element removal