CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-source-jina-ai-reader

Airbyte source connector for Jina AI Reader API enabling web content extraction and search through intelligent reading services

Pending

Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

Overview
Eval results
Files

data-streams.mddocs/

Data Streams

Two main data streams providing web content extraction and search functionality through Jina AI's intelligent reading services. The connector exposes "reader" and "search" streams that transform Jina AI's APIs into structured Airbyte data streams.

Stream Overview

Both streams share common characteristics:

  • Output Format: JSON records with consistent schema
  • Authentication: Optional Bearer token via api_key
  • Pagination: No pagination (single request per stream)
  • Record Selection: Data extracted from "data" field in API response
  • Configuration: Defined declaratively in manifest.yaml using Airbyte Low-Code CDK

Manifest Configuration

Streams are defined in the manifest.yaml file using Airbyte's declarative framework:

# Stream definitions
streams:
  - "#/definitions/reader_stream"
  - "#/definitions/search_stream"

# Reader stream definition
reader_stream:
  type: DeclarativeStream
  name: "reader"
  retriever:
    type: SimpleRetriever
    requester:
      type: CustomRequester
      class_name: source_jina_ai_reader.components.JinaAiHttpRequester
      url_base: "https://r.jina.ai/{{ config['read_prompt'] }}"
      http_method: "GET"

# Search stream definition  
search_stream:
  type: DeclarativeStream
  name: "search"
  retriever:
    type: SimpleRetriever
    requester:
      type: CustomRequester
      class_name: source_jina_ai_reader.components.JinaAiHttpRequester
      url_base: "https://s.jina.ai/{{ config['search_prompt'] }}"
      http_method: "GET"

Capabilities

Reader Stream

Extracts and processes content from specified URLs using Jina AI's Reader API.

Stream Configuration:

  • Name: "reader"
  • Endpoint: https://r.jina.ai/{read_prompt}
  • Method: GET
  • Purpose: Read and extract content from web pages
# Reader stream access (via Airbyte framework)
# Stream is configured declaratively in manifest.yaml

class ReaderStreamConfig(TypedDict):
    """Configuration for the reader stream."""
    read_prompt: str  # URL to read content from
    api_key: Optional[str]  # Optional API key for authentication
    gather_links: bool  # Include links summary in response
    gather_images: bool  # Include images summary in response

Request Headers:

{
    "Accept": "application/json",
    "X-With-Links-Summary": str(gather_links),  # "true" or "false"
    "X-With-Images-Summary": str(gather_images),  # "true" or "false"
    "Authorization": f"Bearer {api_key}"  # Only if api_key provided
}

Search Stream

Performs web searches and returns structured results using Jina AI's Search API.

Stream Configuration:

  • Name: "search"
  • Endpoint: https://s.jina.ai/{search_prompt}
  • Method: GET
  • Purpose: Search the web and return structured results
# Search stream access (via Airbyte framework)
# Stream is configured declaratively in manifest.yaml

class SearchStreamConfig(TypedDict):
    """Configuration for the search stream."""
    search_prompt: str  # URL-encoded search query
    api_key: Optional[str]  # Optional API key for authentication
    gather_links: bool  # Include links summary in response
    gather_images: bool  # Include images summary in response

Request Headers:

{
    "Accept": "application/json",
    "X-With-Links-Summary": str(gather_links),  # "true" or "false"
    "X-With-Images-Summary": str(gather_images),  # "true" or "false"
    "Authorization": f"Bearer {api_key}"  # Only if api_key provided
}

Data Schema

Both streams return records following the same JSON schema structure:

class ContentRecord(TypedDict):
    """
    Data record structure returned by both reader and search streams.
    
    This schema applies to both streams, representing extracted content
    with metadata and optional link/image summaries.
    """
    title: str  # Page or result title
    url: str  # Source URL of the content
    content: str  # Main extracted text content
    description: str  # Brief description or summary
    links: Dict[str, Any]  # Optional links summary object

Schema Details

title (string)

  • Page title for reader stream
  • Search result title for search stream
  • Always present in response

url (string)

  • Original URL for reader stream
  • Result URL for search stream
  • Always present in response

content (string)

  • Extracted text content from the page/result
  • Main content body processed by Jina AI
  • Always present in response

description (string)

  • Brief description or summary of the content
  • Generated by Jina AI's processing
  • Always present in response

links (object)

  • Additional properties with dynamic structure
  • Contains link summaries when gather_links=true
  • Structure varies based on content and API processing
  • May include nested properties like "More information..."

Stream Usage Examples

Reader Stream Configuration

{
  "api_key": "jina_your_api_key",
  "read_prompt": "https://news.example.com/article",
  "search_prompt": "placeholder",
  "gather_links": true,
  "gather_images": false
}

Expected Output:

{
  "title": "Breaking News: AI Advances in 2024",
  "url": "https://news.example.com/article", 
  "content": "Artificial intelligence continues to advance rapidly in 2024...",
  "description": "Latest developments in AI technology and their impact on industry",
  "links": {
    "More information...": "https://related-article.com"
  }
}

Search Stream Configuration

{
  "api_key": "jina_your_api_key",
  "read_prompt": "placeholder",
  "search_prompt": "machine%20learning%20tutorials",
  "gather_links": false,
  "gather_images": true
}

Expected Output:

{
  "title": "Complete Guide to Machine Learning",
  "url": "https://ml-tutorials.com/guide",
  "content": "This comprehensive guide covers machine learning fundamentals...",
  "description": "Step-by-step machine learning tutorial for beginners",
  "links": {}
}

Stream Discovery and Catalogs

Both streams are discoverable through Airbyte's standard discovery process:

# Discover available streams
poetry run source-jina-ai-reader discover --config config.json

Discovery Output Structure:

{
  "streams": [
    {
      "name": "reader",
      "json_schema": {
        "type": "object",
        "properties": {
          "title": {"type": "string"},
          "url": {"type": "string"},
          "content": {"type": "string"},
          "description": {"type": "string"},
          "links": {"type": "object", "additionalProperties": true}
        }
      }
    },
    {
      "name": "search", 
      "json_schema": {
        "type": "object",
        "properties": {
          "title": {"type": "string"},
          "url": {"type": "string"},
          "content": {"type": "string"}, 
          "description": {"type": "string"},
          "links": {"type": "object", "additionalProperties": true}
        }
      }
    }
  ]
}

Integration Patterns

Single Stream Usage

Configure to use only one stream by providing appropriate prompts:

# Reader-only configuration
config = {
    "read_prompt": "https://target-website.com",
    "search_prompt": "placeholder",  # Not used
    "gather_links": True
}

# Search-only configuration  
config = {
    "read_prompt": "placeholder",  # Not used
    "search_prompt": "your%20search%20terms",
    "gather_images": True
}

Dual Stream Usage

Use both streams for comprehensive content analysis:

config = {
    "read_prompt": "https://company.com/about",
    "search_prompt": "company%20name%20news",
    "gather_links": True,
    "gather_images": True
}

Error Handling and Limitations

  • API Rate Limits: Requests subject to Jina AI's rate limiting
  • Content Processing: Large pages may be truncated or summarized
  • URL Encoding: Search prompts must be properly URL-encoded
  • Authentication: Some features may require valid API key
  • Response Size: Large responses may impact performance
  • Network Dependencies: Requires internet access to Jina AI APIs

Install with Tessl CLI

npx tessl i tessl/pypi-source-jina-ai-reader

docs

configuration.md

core-interface.md

data-streams.md

http-handling.md

index.md

tile.json