or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

css-translation.md data-extraction.md element-modification.md index.md parsing-selection.md selectorlist-operations.md xml-namespaces.md xpath-extensions.md

tile.json

tessl/pypi-parsel

Parsel is a library to extract data from HTML and XML using XPath and CSS selectors

Workspace: tessl
Visibility: Public
Created: 3 months ago
Last updated: 3 months ago
Describes: pkg:pypi/parsel@1.10.x

To install, run

npx @tessl/cli install tessl/pypi-parsel@1.10.0

Parsel

Parsel is a library to extract data from HTML, XML, and JSON documents using XPath and CSS selectors. It provides a unified API through the Selector and SelectorList classes that enables developers to chain operations and extract data from web documents efficiently with support for XPath expressions, CSS selectors, JMESPath for JSON, and regular expressions.

Package Information

Package Name: parsel
Language: Python
Installation: pip install parsel
Dependencies: lxml, cssselect, jmespath, w3lib, packaging

Core Imports

from parsel import Selector, SelectorList

Direct module imports:

from parsel import css2xpath
from parsel import xpathfuncs

Basic Usage

from parsel import Selector

# Parse HTML document
html = """
<html>
    <body>
        <h1>Hello, Parsel!</h1>
        <ul>
            <li><a href="http://example.com">Link 1</a></li>
            <li><a href="http://scrapy.org">Link 2</a></li>
        </ul>
        <script type="application/json">{"a": ["b", "c"]}</script>
    </body>
</html>
"""

selector = Selector(text=html)

# Extract text using CSS selectors
title = selector.css('h1::text').get()  # 'Hello, Parsel!'

# Extract links using XPath
for li in selector.css('ul > li'):
    href = li.xpath('.//@href').get()
    print(href)

# Extract and parse JSON content
json_data = selector.css('script::text').jmespath("a").getall()  # ['b', 'c']

# Use regular expressions
words = selector.xpath('//h1/text()').re(r'\\w+')  # ['Hello', 'Parsel']

Architecture

Parsel's architecture centers around two main classes:

Selector: Wraps input data (HTML/XML/JSON/text) and provides selection methods
SelectorList: List of Selector objects with chainable methods for batch operations

The library supports multiple parsing strategies:

HTML parsing: Using lxml.html.HTMLParser with CSS pseudo-element support
XML parsing: Using SafeXMLParser (extends lxml.etree.XMLParser) with namespace management
JSON parsing: Native Python JSON parsing with JMESPath query support
Text parsing: Plain text content with regex extraction

Capabilities

Document Parsing and Selection

Core functionality for parsing HTML, XML, JSON, and text documents with unified selector interface supporting multiple query languages.

class Selector:
    def __init__(
        self,
        text: Optional[str] = None,
        type: Optional[str] = None,
        body: bytes = b"",
        encoding: str = "utf-8",
        namespaces: Optional[Mapping[str, str]] = None,
        root: Optional[Any] = None,
        base_url: Optional[str] = None,
        _expr: Optional[str] = None,
        huge_tree: bool = True,
    ) -> None: ...

    def xpath(
        self,
        query: str,
        namespaces: Optional[Mapping[str, str]] = None,
        **kwargs: Any,
    ) -> SelectorList["Selector"]: ...

    def css(self, query: str) -> SelectorList["Selector"]: ...

    def jmespath(self, query: str, **kwargs: Any) -> SelectorList["Selector"]: ...

Document Parsing and Selection

Data Extraction and Content Retrieval

Methods for extracting text content, attributes, and serialized data from selected elements with support for entity replacement and formatting.

def get(self) -> Any: ...
def getall(self) -> List[str]: ...
def re(
    self, regex: Union[str, Pattern[str]], replace_entities: bool = True
) -> List[str]: ...
def re_first(
    self,
    regex: Union[str, Pattern[str]],
    default: Optional[str] = None,
    replace_entities: bool = True,
) -> Optional[str]: ...

@property
def attrib(self) -> Dict[str, str]: ...

Data Extraction

SelectorList Operations

Batch operations on multiple selectors with chainable methods for filtering, extracting, and transforming collections of selected elements.

class SelectorList(List["Selector"]):
    def xpath(
        self,
        xpath: str,
        namespaces: Optional[Mapping[str, str]] = None,
        **kwargs: Any,
    ) -> "SelectorList[Selector]": ...

    def css(self, query: str) -> "SelectorList[Selector]": ...

    def jmespath(self, query: str, **kwargs: Any) -> "SelectorList[Selector]": ...

    def get(self, default: Optional[str] = None) -> Optional[str]: ...
    def getall(self) -> List[str]: ...

SelectorList Operations

XML Namespace Management

Functionality for working with XML namespaces including registration, removal, and namespace-aware queries.

def register_namespace(self, prefix: str, uri: str) -> None: ...
def remove_namespaces(self) -> None: ...

XML Namespace Management

Element Modification

Methods for removing and modifying DOM elements within the parsed document structure.

def drop(self) -> None: ...
def remove(self) -> None: ...  # deprecated

Element Modification

CSS Selector Translation

Utilities for converting CSS selectors to XPath expressions with support for pseudo-elements and custom CSS features.

def css2xpath(query: str) -> str: ...

class GenericTranslator:
    def css_to_xpath(self, css: str, prefix: str = "descendant-or-self::") -> str: ...

class HTMLTranslator:
    def css_to_xpath(self, css: str, prefix: str = "descendant-or-self::") -> str: ...

CSS Translation

XPath Extension Functions

Custom XPath functions for enhanced element selection including CSS class checking and other utility functions.

def set_xpathfunc(fname: str, func: Optional[Callable]) -> None: ...
def has_class(context: Any, *classes: str) -> bool: ...
def setup() -> None: ...

XPath Extensions

Types

# Type aliases
_SelectorType = TypeVar("_SelectorType", bound="Selector")
_ParserType = Union[etree.XMLParser, etree.HTMLParser]
_TostringMethodType = Literal["html", "xml"]

# Exception classes
class CannotRemoveElementWithoutRoot(Exception): ...
class CannotRemoveElementWithoutParent(Exception): ...
class CannotDropElementWithoutParent(CannotRemoveElementWithoutParent): ...

# CSS Translator classes
class XPathExpr:
    textnode: bool
    attribute: Optional[str]
    
    @classmethod
    def from_xpath(
        cls,
        xpath: "XPathExpr", 
        textnode: bool = False, 
        attribute: Optional[str] = None
    ) -> "XPathExpr": ...

Version

Tile

Files

tessl/pypi-parsel

To install, run

index.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

Parsel

Package Information

Core Imports

Basic Usage

Architecture

Capabilities

Document Parsing and Selection

Data Extraction and Content Retrieval

SelectorList Operations

XML Namespace Management

Element Modification

CSS Selector Translation

XPath Extension Functions

Types

index.mddocs/