tessl/pypi-transformers

State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow

—

Quality

—

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

—

The risk profile of this skill

Overview

Eval results

Files

Transformers

Name: tessl/pypi-transformers
Author: tessl

State-of-the-art Machine Learning library for JAX, PyTorch and TensorFlow. Transformers provides a unified API for working with over 350 pre-trained models across natural language processing, computer vision, audio, and multimodal tasks. The library democratizes access to cutting-edge AI models with simple, efficient interfaces for both inference and training.

Package Information

Package Name: transformers
Language: Python
Installation: pip install transformers

Core Imports

import transformers

Common patterns for specific functionality:

# High-level Pipeline API (recommended for most use cases)
from transformers import pipeline

# Auto classes for automatic model/tokenizer selection
from transformers import AutoModel, AutoTokenizer, AutoConfig

# Specific model classes
from transformers import BertModel, BertTokenizer
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Training utilities
from transformers import Trainer, TrainingArguments

# Feature extraction for audio/vision
from transformers import AutoFeatureExtractor, AutoImageProcessor

Basic Usage

Quick Start with Pipelines

from transformers import pipeline

# Text classification
classifier = pipeline("text-classification")
results = classifier("I love using transformers!")

# Question answering
qa_pipeline = pipeline("question-answering")
answer = qa_pipeline(
    question="What is transformers?",
    context="Transformers is a library for natural language processing."
)

# Text generation
generator = pipeline("text-generation", model="gpt2")
output = generator("The future of AI is", max_length=50, num_return_sequences=1)

# Image classification
image_classifier = pipeline("image-classification")
results = image_classifier("path/to/image.jpg")

Working with Models Directly

from transformers import AutoModel, AutoTokenizer

# Load model and tokenizer
model_name = "bert-base-uncased"
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Encode text
text = "Hello, world!"
inputs = tokenizer(text, return_tensors="pt")

# Forward pass
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state

Training a Model

from transformers import Trainer, TrainingArguments
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Load model for fine-tuning
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased", 
    num_labels=2
)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Configure training
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs",
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

# Start training
trainer.train()

Architecture

The transformers library is built around several key architectural components:

Auto Classes: Automatically select the correct model, tokenizer, or configuration based on a model name or path
Model Classes: Implement specific architectures (BERT, GPT, T5, etc.) with consistent APIs across frameworks
Tokenizers: Convert text to tokens and back, handling different tokenization strategies and vocabularies
Pipelines: High-level abstraction providing simple interfaces for common ML tasks
Trainer: Comprehensive training framework with built-in optimization, logging, and evaluation
Hub Integration: Seamless downloading, caching, and sharing of models via Hugging Face Hub

This design enables transformers to serve as the foundational layer for the AI/ML ecosystem, providing consistent interfaces across 350+ model architectures while maintaining compatibility with PyTorch, TensorFlow, and JAX.

Capabilities

High-Level Pipeline API

Simple, task-oriented interface for common ML operations. Pipelines abstract away model selection, preprocessing, and postprocessing, providing immediate access to state-of-the-art capabilities.

def pipeline(
    task: str = None,
    model: str = None,
    tokenizer: str = None,
    **kwargs
) -> Pipeline

Pipelines

Model Management

Automatic model selection and loading with support for 350+ architectures. Auto classes intelligently choose the correct implementation based on model names or configurations.

class AutoModel:
    @classmethod
    def from_pretrained(cls, pretrained_model_name_or_path: str, **kwargs) -> PreTrainedModel

class AutoTokenizer:
    @classmethod 
    def from_pretrained(cls, pretrained_model_name_or_path: str, **kwargs) -> PreTrainedTokenizer

class AutoConfig:
    @classmethod
    def from_pretrained(cls, pretrained_model_name_or_path: str, **kwargs) -> PretrainedConfig

Models

Training and Fine-tuning

Comprehensive training framework with built-in optimization, distributed training support, and extensive customization options.

class Trainer:
    def __init__(
        self,
        model: PreTrainedModel,
        args: TrainingArguments,
        train_dataset = None,
        eval_dataset = None,
        **kwargs
    )
    
    def train(self) -> None
    def evaluate(self) -> Dict[str, float]
    def predict(self, test_dataset) -> PredictionOutput

class TrainingArguments:
    def __init__(
        self,
        output_dir: str,
        num_train_epochs: float = 3.0,
        per_device_train_batch_size: int = 8,
        learning_rate: float = 5e-5,
        **kwargs
    )

Training

Text Generation

Advanced text generation capabilities with multiple decoding strategies, fine-grained control over output, and support for conversational AI.

class GenerationMixin:
    def generate(
        self,
        inputs = None,
        max_length: int = None,
        num_beams: int = 1,
        temperature: float = 1.0,
        do_sample: bool = False,
        **kwargs
    ) -> torch.Tensor

class GenerationConfig:
    def __init__(
        self,
        max_length: int = 20,
        num_beams: int = 1,
        temperature: float = 1.0,
        **kwargs
    )

Generation

Tokenization

Comprehensive tokenization with support for 100+ different tokenizers, handling subword tokenization, special tokens, and efficient batch processing.

class PreTrainedTokenizer:
    def encode(
        self, 
        text: str,
        add_special_tokens: bool = True,
        **kwargs
    ) -> List[int]
    
    def decode(
        self,
        token_ids: List[int],
        skip_special_tokens: bool = False
    ) -> str
    
    def __call__(
        self,
        text,
        return_tensors: str = None,
        padding: bool = False,
        truncation: bool = False,
        **kwargs
    ) -> BatchEncoding

Tokenization

Feature Extraction

Audio and image preprocessing capabilities for multimodal models, providing consistent interfaces for different modalities.

class AutoFeatureExtractor:
    @classmethod
    def from_pretrained(cls, pretrained_model_name_or_path: str, **kwargs)

class AutoImageProcessor:
    @classmethod
    def from_pretrained(cls, pretrained_model_name_or_path: str, **kwargs)

Feature Extraction

Model Optimization

Advanced optimization techniques including quantization, mixed precision training, and hardware acceleration for efficient inference and training.

class BitsAndBytesConfig:
    def __init__(
        self,
        load_in_8bit: bool = False,
        load_in_4bit: bool = False,
        bnb_4bit_compute_dtype = None,
        **kwargs
    )

Optimization

Types

Core type definitions used throughout the library:

class PreTrainedModel:
    """Base class for all model implementations."""
    def forward(self, **kwargs)
    def save_pretrained(self, save_directory: str, **kwargs)
    @classmethod
    def from_pretrained(cls, pretrained_model_name_or_path: str, **kwargs)

class PretrainedConfig:
    """Base configuration class for all models."""
    def save_pretrained(self, save_directory: str, **kwargs)
    @classmethod
    def from_pretrained(cls, pretrained_model_name_or_path: str, **kwargs)

class BatchEncoding:
    """Container for tokenizer outputs with tensor conversion capabilities."""
    input_ids: List[List[int]]
    attention_mask: List[List[int]]
    def to(self, device: str) -> 'BatchEncoding'

class Pipeline:
    """Base class for all pipeline implementations."""
    def __call__(self, inputs, **kwargs)
    def save_pretrained(self, save_directory: str, **kwargs)

class ModelOutput:
    """Base class for all model outputs."""
    last_hidden_state: torch.Tensor
    hidden_states: Tuple[torch.Tensor]
    attentions: Tuple[torch.Tensor]