CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-farm-haystack

LLM framework to build customizable, production-ready LLM applications with pipelines connecting models, vector DBs, and data processors.

Pending
Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Pending

The risk profile of this skill

Overview
Eval results
Files

index.mddocs/

Haystack

Haystack is a comprehensive end-to-end NLP framework that enables developers to build sophisticated applications powered by Large Language Models (LLMs), Transformer models, and vector search capabilities. The framework provides a modular architecture based on Pipelines that connect various Nodes (preprocessing, retrieval, language model components) to perform complex NLP tasks such as retrieval-augmented generation (RAG), question answering, semantic document search, and answer generation.

Package Information

  • Package Name: farm-haystack
  • Language: Python
  • Installation: pip install farm-haystack
  • Python Support: 3.8+
  • License: Apache-2.0

Core Imports

import haystack
from haystack import Document, Answer, Label, MultiLabel, Span, EvaluationResult, TableCell, Pipeline, hash128
from haystack.nodes.base import BaseComponent

Common imports for building pipelines:

from haystack.document_stores import InMemoryDocumentStore, ElasticsearchDocumentStore, FAISSDocumentStore
from haystack.nodes import BM25Retriever, EmbeddingRetriever, FARMReader, TransformersReader
from haystack.pipelines import ExtractiveQAPipeline, DocumentSearchPipeline

Basic Usage

from haystack import Document, Pipeline
from haystack.document_stores import InMemoryDocumentStore
from haystack.nodes import BM25Retriever, FARMReader
from haystack.pipelines import ExtractiveQAPipeline

# Create documents
docs = [
    Document(content="Paris is the capital of France."),
    Document(content="Berlin is the capital of Germany."),
    Document(content="Madrid is the capital of Spain.")
]

# Initialize document store and add documents
document_store = InMemoryDocumentStore()
document_store.write_documents(docs)

# Create retriever and reader components
retriever = BM25Retriever(document_store=document_store)
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2")

# Build pipeline
pipeline = ExtractiveQAPipeline(reader=reader, retriever=retriever)

# Ask a question
result = pipeline.run(query="What is the capital of France?")
print(result["answers"][0].answer)  # "Paris"

Architecture

Haystack follows a modular component-based architecture with three core concepts:

  • Components (Nodes): Modular processing units that perform specific tasks (retrieval, reading, generation, preprocessing)
  • Document Stores: Backend storage systems for documents and embeddings (Elasticsearch, FAISS, Pinecone, etc.)
  • Pipelines: Orchestration layer that connects components in directed graphs to solve complex NLP tasks

The framework supports both Retrieval-Augmented Generation (RAG) workflows and Agent-based interactive systems, making it suitable for production-grade applications requiring sophisticated natural language processing capabilities.

Capabilities

Core Schema & Data Structures

Fundamental data classes for documents, answers, labels, and evaluation results that form the foundation of all Haystack operations.

class Document:
    def __init__(self, content: Union[str, DataFrame], content_type: str = "text", 
                 meta: Dict[str, Any] = None, id: Optional[str] = None): ...

class Answer:
    def __init__(self, answer: str, type: str = "extractive", 
                 score: Optional[float] = None, context: Optional[str] = None): ...

class Label:
    def __init__(self, query: str, answer: Answer, is_correct_answer: bool = True, 
                 is_correct_document: bool = True): ...

Core Schema

Document Stores

Backend storage systems supporting vector and keyword search across multiple databases including Elasticsearch, FAISS, Pinecone, Weaviate, and others.

class BaseDocumentStore:
    def write_documents(self, documents: List[Document]): ...
    def get_all_documents(self) -> List[Document]: ...
    def query(self, query: str, top_k: int = 10) -> List[Document]: ...

class ElasticsearchDocumentStore(BaseDocumentStore): ...
class FAISSDocumentStore(BaseDocumentStore): ...
class PineconeDocumentStore(BaseDocumentStore): ...

Document Stores

Retriever Components

Dense and sparse retrieval components for finding relevant documents using embeddings, BM25, TF-IDF, and specialized retrieval methods.

class BM25Retriever(BaseRetriever):
    def __init__(self, document_store: BaseDocumentStore): ...
    def retrieve(self, query: str, top_k: int = 10) -> List[Document]: ...

class EmbeddingRetriever(BaseRetriever):
    def __init__(self, document_store: BaseDocumentStore, 
                 embedding_model: str = "sentence-transformers/all-MiniLM-L6-v2"): ...

Retriever Components

Reader Components

Reading comprehension components for extractive question answering using FARM, Transformers, and specialized table readers.

class FARMReader(BaseReader):
    def __init__(self, model_name_or_path: str = "deepset/roberta-base-squad2"): ...
    def predict(self, query: str, documents: List[Document]) -> List[Answer]: ...

class TransformersReader(BaseReader):
    def __init__(self, model_name_or_path: str = "deepset/roberta-base-squad2"): ...

Reader Components

Generator Components

Language model components for text generation using OpenAI, Transformers, and other LLM providers for generative QA and text synthesis.

class OpenAIAnswerGenerator(BaseGenerator):
    def __init__(self, api_key: str, model: str = "gpt-3.5-turbo"): ...
    def predict(self, query: str, documents: List[Document]) -> List[Answer]: ...

class OpenAIChatGenerator(BaseGenerator):
    def __init__(self, api_key: str, model: str = "gpt-3.5-turbo"): ...

Generator Components

Pipeline System

Pre-built and custom pipeline templates for orchestrating component workflows including QA, search, generation, and indexing pipelines.

class Pipeline:
    def __init__(self): ...
    def add_node(self, component: BaseComponent, name: str, inputs: List[str]): ...
    def run(self, **kwargs): ...

class ExtractiveQAPipeline(Pipeline): ...
class GenerativeQAPipeline(Pipeline): ...
class DocumentSearchPipeline(Pipeline): ...

Pipeline System

Agent System

Interactive LLM agents with tool usage, memory management, and conversational capabilities for complex reasoning tasks.

class Agent:
    def __init__(self, prompt_node: PromptNode, memory: Optional[BaseMemory] = None): ...
    def run(self, query: str) -> AgentStep: ...

class ConversationalAgent(Agent): ...
class Tool:
    def __init__(self, name: str, pipeline_or_node: Union[BaseComponent, Pipeline]): ...

Agent System

File Processing

Document converters and preprocessors for handling PDF, DOCX, HTML, images, and other file formats with text extraction and cleaning.

class BaseConverter:
    def convert(self, file_path: Path, **kwargs) -> List[Document]: ...

class PDFToTextConverter(BaseConverter): ...
class DocxToTextConverter(BaseConverter): ...
class PreProcessor(BaseComponent):
    def process(self, documents: List[Document]) -> List[Document]: ...

File Processing

Evaluation & Utilities

Evaluation metrics, model evaluation tools, and utility functions for assessing pipeline performance and data processing.

def eval_pipeline(pipeline: Pipeline, eval_labels: List[Label]) -> EvaluationResult: ...

class EvaluationResult:
    def __init__(self): ...
    def calculate_metrics(self) -> Dict[str, float]: ...

Evaluation & Utilities

docs

agents.md

core-schema.md

document-stores.md

evaluation-utilities.md

file-processing.md

generators.md

index.md

pipelines.md

readers.md

retrievers.md

tile.json