or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

bpf.mdcontext-ctxhttp.mdcontext.mddict.mddns-dnsmessage.mdhtml-atom.mdhtml-charset.mdhtml.mdhttp-httpguts.mdhttp-httpproxy.mdhttp2-h2c.mdhttp2-hpack.mdhttp2.mdicmp.mdidna.mdindex.mdipv4.mdipv6.mdnettest.mdnetutil.mdproxy.mdpublicsuffix.mdquic-qlog.mdquic.mdtrace.mdwebdav.mdwebsocket.mdxsrftoken.md
tile.json

html-charset.mddocs/

HTML Character Set Detection

Package charset provides common text encodings for HTML documents.

The mapping from encoding labels to encodings is defined at https://encoding.spec.whatwg.org/.

Import

import "golang.org/x/net/html/charset"

Functions

// DetermineEncoding determines the encoding of an HTML document
func DetermineEncoding(content []byte, contentType string) (e encoding.Encoding, name string, certain bool)

// Lookup returns the encoding with the specified label, and its canonical name
func Lookup(label string) (e encoding.Encoding, name string)

// NewReader returns an io.Reader that converts the content of r to UTF-8
func NewReader(r io.Reader, contentType string) (io.Reader, error)

// NewReaderLabel returns a reader that converts from the specified charset to UTF-8
func NewReaderLabel(label string, input io.Reader) (io.Reader, error)

Function Details

DetermineEncoding

DetermineEncoding determines the encoding of an HTML document by examining up to the first 1024 bytes of content and the declared Content-Type.

See http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#determining-the-character-encoding

Lookup

Lookup returns the encoding with the specified label, and its canonical name. It returns nil and the empty string if label is not one of the standard encodings for HTML. Matching is case-insensitive and ignores leading and trailing whitespace. Encoders will use HTML escape sequences for runes that are not supported by the character set.

NewReader

NewReader returns an io.Reader that converts the content of r to UTF-8. It calls DetermineEncoding to find out what r's encoding is.

NewReaderLabel

NewReaderLabel returns a reader that converts from the specified charset to UTF-8. It uses Lookup to find the encoding that corresponds to label, and returns an error if Lookup returns nil. It is suitable for use as encoding/xml.Decoder's CharsetReader function.

Usage Examples

Auto-Detecting Character Encoding

import (
    "fmt"
    "golang.org/x/net/html/charset"
    "io"
)

func parseHTMLWithEncoding(r io.Reader) error {
    // Auto-detect encoding and convert to UTF-8
    utf8Reader, err := charset.NewReader(r, "text/html; charset=unknown")
    if err != nil {
        return err
    }

    // Now utf8Reader provides UTF-8 encoded content
    content, err := io.ReadAll(utf8Reader)
    if err != nil {
        return err
    }

    fmt.Printf("UTF-8 content: %s\n", content)
    return nil
}

Converting from Specific Encoding

func convertFromEncoding(r io.Reader, label string) (string, error) {
    // Convert from specific encoding to UTF-8
    utf8Reader, err := charset.NewReaderLabel(label, r)
    if err != nil {
        return "", err
    }

    content, err := io.ReadAll(utf8Reader)
    if err != nil {
        return "", err
    }

    return string(content), nil
}

// Usage:
// content, err := convertFromEncoding(file, "iso-8859-1")
// content, err := convertFromEncoding(file, "windows-1252")
// content, err := convertFromEncoding(file, "shift_jis")

Looking Up Encodings

func checkEncoding(label string) {
    enc, name := charset.Lookup(label)
    if enc == nil {
        fmt.Printf("Unknown encoding: %s\n", label)
        return
    }

    fmt.Printf("Label '%s' maps to encoding '%s'\n", label, name)
}

Using with XML Decoder

import (
    "encoding/xml"
    "golang.org/x/net/html/charset"
)

func parseXMLWithCharset(r io.Reader) error {
    decoder := xml.NewDecoder(r)

    // Set CharsetReader to handle non-UTF-8 encodings
    decoder.CharsetReader = charset.NewReaderLabel

    var data MyStruct
    err := decoder.Decode(&data)
    return err
}