tessl/pypi-farm-haystack

LLM framework to build customizable, production-ready LLM applications with pipelines connecting models, vector DBs, and data processors.

—

Quality

—

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

—

The risk profile of this skill

Overview

Eval results

Files

Haystack

Name: tessl/pypi-farm-haystack
Author: tessl

Haystack is a comprehensive end-to-end NLP framework that enables developers to build sophisticated applications powered by Large Language Models (LLMs), Transformer models, and vector search capabilities. The framework provides a modular architecture based on Pipelines that connect various Nodes (preprocessing, retrieval, language model components) to perform complex NLP tasks such as retrieval-augmented generation (RAG), question answering, semantic document search, and answer generation.

Package Information

Package Name: farm-haystack
Language: Python
Installation: pip install farm-haystack
Python Support: 3.8+
License: Apache-2.0

Core Imports

import haystack
from haystack import Document, Answer, Label, MultiLabel, Span, EvaluationResult, TableCell, Pipeline, hash128
from haystack.nodes.base import BaseComponent

Common imports for building pipelines:

from haystack.document_stores import InMemoryDocumentStore, ElasticsearchDocumentStore, FAISSDocumentStore
from haystack.nodes import BM25Retriever, EmbeddingRetriever, FARMReader, TransformersReader
from haystack.pipelines import ExtractiveQAPipeline, DocumentSearchPipeline

Basic Usage

from haystack import Document, Pipeline
from haystack.document_stores import InMemoryDocumentStore
from haystack.nodes import BM25Retriever, FARMReader
from haystack.pipelines import ExtractiveQAPipeline

# Create documents
docs = [
    Document(content="Paris is the capital of France."),
    Document(content="Berlin is the capital of Germany."),
    Document(content="Madrid is the capital of Spain.")
]

# Initialize document store and add documents
document_store = InMemoryDocumentStore()
document_store.write_documents(docs)

# Create retriever and reader components
retriever = BM25Retriever(document_store=document_store)
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2")

# Build pipeline
pipeline = ExtractiveQAPipeline(reader=reader, retriever=retriever)

# Ask a question
result = pipeline.run(query="What is the capital of France?")
print(result["answers"][0].answer)  # "Paris"

Architecture

Haystack follows a modular component-based architecture with three core concepts:

Components (Nodes): Modular processing units that perform specific tasks (retrieval, reading, generation, preprocessing)
Document Stores: Backend storage systems for documents and embeddings (Elasticsearch, FAISS, Pinecone, etc.)
Pipelines: Orchestration layer that connects components in directed graphs to solve complex NLP tasks

The framework supports both Retrieval-Augmented Generation (RAG) workflows and Agent-based interactive systems, making it suitable for production-grade applications requiring sophisticated natural language processing capabilities.

Capabilities

Core Schema & Data Structures

Fundamental data classes for documents, answers, labels, and evaluation results that form the foundation of all Haystack operations.

class Document:
    def __init__(self, content: Union[str, DataFrame], content_type: str = "text", 
                 meta: Dict[str, Any] = None, id: Optional[str] = None): ...

class Answer:
    def __init__(self, answer: str, type: str = "extractive", 
                 score: Optional[float] = None, context: Optional[str] = None): ...

class Label:
    def __init__(self, query: str, answer: Answer, is_correct_answer: bool = True, 
                 is_correct_document: bool = True): ...

Core Schema

Document Stores

Backend storage systems supporting vector and keyword search across multiple databases including Elasticsearch, FAISS, Pinecone, Weaviate, and others.

class BaseDocumentStore:
    def write_documents(self, documents: List[Document]): ...
    def get_all_documents(self) -> List[Document]: ...
    def query(self, query: str, top_k: int = 10) -> List[Document]: ...

class ElasticsearchDocumentStore(BaseDocumentStore): ...
class FAISSDocumentStore(BaseDocumentStore): ...
class PineconeDocumentStore(BaseDocumentStore): ...

Document Stores

Retriever Components

Dense and sparse retrieval components for finding relevant documents using embeddings, BM25, TF-IDF, and specialized retrieval methods.

class BM25Retriever(BaseRetriever):
    def __init__(self, document_store: BaseDocumentStore): ...
    def retrieve(self, query: str, top_k: int = 10) -> List[Document]: ...

class EmbeddingRetriever(BaseRetriever):
    def __init__(self, document_store: BaseDocumentStore, 
                 embedding_model: str = "sentence-transformers/all-MiniLM-L6-v2"): ...

Retriever Components

Reader Components

Reading comprehension components for extractive question answering using FARM, Transformers, and specialized table readers.

class FARMReader(BaseReader):
    def __init__(self, model_name_or_path: str = "deepset/roberta-base-squad2"): ...
    def predict(self, query: str, documents: List[Document]) -> List[Answer]: ...

class TransformersReader(BaseReader):
    def __init__(self, model_name_or_path: str = "deepset/roberta-base-squad2"): ...

Reader Components

Generator Components

Language model components for text generation using OpenAI, Transformers, and other LLM providers for generative QA and text synthesis.

class OpenAIAnswerGenerator(BaseGenerator):
    def __init__(self, api_key: str, model: str = "gpt-3.5-turbo"): ...
    def predict(self, query: str, documents: List[Document]) -> List[Answer]: ...

class OpenAIChatGenerator(BaseGenerator):
    def __init__(self, api_key: str, model: str = "gpt-3.5-turbo"): ...

Generator Components

Pipeline System

Pre-built and custom pipeline templates for orchestrating component workflows including QA, search, generation, and indexing pipelines.

class Pipeline:
    def __init__(self): ...
    def add_node(self, component: BaseComponent, name: str, inputs: List[str]): ...
    def run(self, **kwargs): ...

class ExtractiveQAPipeline(Pipeline): ...
class GenerativeQAPipeline(Pipeline): ...
class DocumentSearchPipeline(Pipeline): ...

Pipeline System

Agent System

Interactive LLM agents with tool usage, memory management, and conversational capabilities for complex reasoning tasks.

class Agent:
    def __init__(self, prompt_node: PromptNode, memory: Optional[BaseMemory] = None): ...
    def run(self, query: str) -> AgentStep: ...

class ConversationalAgent(Agent): ...
class Tool:
    def __init__(self, name: str, pipeline_or_node: Union[BaseComponent, Pipeline]): ...

Agent System

File Processing

Document converters and preprocessors for handling PDF, DOCX, HTML, images, and other file formats with text extraction and cleaning.

class BaseConverter:
    def convert(self, file_path: Path, **kwargs) -> List[Document]: ...

class PDFToTextConverter(BaseConverter): ...
class DocxToTextConverter(BaseConverter): ...
class PreProcessor(BaseComponent):
    def process(self, documents: List[Document]) -> List[Document]: ...

File Processing

Evaluation & Utilities

Evaluation metrics, model evaluation tools, and utility functions for assessing pipeline performance and data processing.

def eval_pipeline(pipeline: Pipeline, eval_labels: List[Label]) -> EvaluationResult: ...

class EvaluationResult:
    def __init__(self): ...
    def calculate_metrics(self) -> Dict[str, float]: ...

Evaluation & Utilities