tessl/pypi-source-jina-ai-reader

Airbyte source connector for Jina AI Reader API enabling web content extraction and search through intelligent reading services

—

Pending

Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

Overview

Eval results

Files

Data Streams

Name: tessl/pypi-source-jina-ai-reader
Author: tessl

Two main data streams providing web content extraction and search functionality through Jina AI's intelligent reading services. The connector exposes "reader" and "search" streams that transform Jina AI's APIs into structured Airbyte data streams.

Stream Overview

Both streams share common characteristics:

Output Format: JSON records with consistent schema
Authentication: Optional Bearer token via api_key
Pagination: No pagination (single request per stream)
Record Selection: Data extracted from "data" field in API response
Configuration: Defined declaratively in manifest.yaml using Airbyte Low-Code CDK

Manifest Configuration

Streams are defined in the manifest.yaml file using Airbyte's declarative framework:

# Stream definitions
streams:
  - "#/definitions/reader_stream"
  - "#/definitions/search_stream"

# Reader stream definition
reader_stream:
  type: DeclarativeStream
  name: "reader"
  retriever:
    type: SimpleRetriever
    requester:
      type: CustomRequester
      class_name: source_jina_ai_reader.components.JinaAiHttpRequester
      url_base: "https://r.jina.ai/{{ config['read_prompt'] }}"
      http_method: "GET"

# Search stream definition  
search_stream:
  type: DeclarativeStream
  name: "search"
  retriever:
    type: SimpleRetriever
    requester:
      type: CustomRequester
      class_name: source_jina_ai_reader.components.JinaAiHttpRequester
      url_base: "https://s.jina.ai/{{ config['search_prompt'] }}"
      http_method: "GET"

Capabilities

Reader Stream

Extracts and processes content from specified URLs using Jina AI's Reader API.

Stream Configuration:

Name: "reader"
Endpoint: https://r.jina.ai/{read_prompt}
Method: GET
Purpose: Read and extract content from web pages

# Reader stream access (via Airbyte framework)
# Stream is configured declaratively in manifest.yaml

class ReaderStreamConfig(TypedDict):
    """Configuration for the reader stream."""
    read_prompt: str  # URL to read content from
    api_key: Optional[str]  # Optional API key for authentication
    gather_links: bool  # Include links summary in response
    gather_images: bool  # Include images summary in response

Request Headers:

{
    "Accept": "application/json",
    "X-With-Links-Summary": str(gather_links),  # "true" or "false"
    "X-With-Images-Summary": str(gather_images),  # "true" or "false"
    "Authorization": f"Bearer {api_key}"  # Only if api_key provided
}

Search Stream

Performs web searches and returns structured results using Jina AI's Search API.

Stream Configuration:

Name: "search"
Endpoint: https://s.jina.ai/{search_prompt}
Method: GET
Purpose: Search the web and return structured results

# Search stream access (via Airbyte framework)
# Stream is configured declaratively in manifest.yaml

class SearchStreamConfig(TypedDict):
    """Configuration for the search stream."""
    search_prompt: str  # URL-encoded search query
    api_key: Optional[str]  # Optional API key for authentication
    gather_links: bool  # Include links summary in response
    gather_images: bool  # Include images summary in response

Request Headers:

{
    "Accept": "application/json",
    "X-With-Links-Summary": str(gather_links),  # "true" or "false"
    "X-With-Images-Summary": str(gather_images),  # "true" or "false"
    "Authorization": f"Bearer {api_key}"  # Only if api_key provided
}

Data Schema

Both streams return records following the same JSON schema structure:

class ContentRecord(TypedDict):
    """
    Data record structure returned by both reader and search streams.
    
    This schema applies to both streams, representing extracted content
    with metadata and optional link/image summaries.
    """
    title: str  # Page or result title
    url: str  # Source URL of the content
    content: str  # Main extracted text content
    description: str  # Brief description or summary
    links: Dict[str, Any]  # Optional links summary object

Schema Details

title (string)

Page title for reader stream
Search result title for search stream
Always present in response

url (string)

Original URL for reader stream
Result URL for search stream
Always present in response

content (string)

Extracted text content from the page/result
Main content body processed by Jina AI
Always present in response

description (string)

Brief description or summary of the content
Generated by Jina AI's processing
Always present in response

links (object)

Additional properties with dynamic structure
Contains link summaries when gather_links=true
Structure varies based on content and API processing
May include nested properties like "More information..."

Stream Usage Examples

Reader Stream Configuration

{
  "api_key": "jina_your_api_key",
  "read_prompt": "https://news.example.com/article",
  "search_prompt": "placeholder",
  "gather_links": true,
  "gather_images": false
}

Expected Output:

{
  "title": "Breaking News: AI Advances in 2024",
  "url": "https://news.example.com/article", 
  "content": "Artificial intelligence continues to advance rapidly in 2024...",
  "description": "Latest developments in AI technology and their impact on industry",
  "links": {
    "More information...": "https://related-article.com"
  }
}

Search Stream Configuration

{
  "api_key": "jina_your_api_key",
  "read_prompt": "placeholder",
  "search_prompt": "machine%20learning%20tutorials",
  "gather_links": false,
  "gather_images": true
}

Expected Output:

{
  "title": "Complete Guide to Machine Learning",
  "url": "https://ml-tutorials.com/guide",
  "content": "This comprehensive guide covers machine learning fundamentals...",
  "description": "Step-by-step machine learning tutorial for beginners",
  "links": {}
}

Stream Discovery and Catalogs

Both streams are discoverable through Airbyte's standard discovery process:

# Discover available streams
poetry run source-jina-ai-reader discover --config config.json

Discovery Output Structure:

{
  "streams": [
    {
      "name": "reader",
      "json_schema": {
        "type": "object",
        "properties": {
          "title": {"type": "string"},
          "url": {"type": "string"},
          "content": {"type": "string"},
          "description": {"type": "string"},
          "links": {"type": "object", "additionalProperties": true}
        }
      }
    },
    {
      "name": "search", 
      "json_schema": {
        "type": "object",
        "properties": {
          "title": {"type": "string"},
          "url": {"type": "string"},
          "content": {"type": "string"}, 
          "description": {"type": "string"},
          "links": {"type": "object", "additionalProperties": true}
        }
      }
    }
  ]
}

Integration Patterns

Single Stream Usage

Configure to use only one stream by providing appropriate prompts:

# Reader-only configuration
config = {
    "read_prompt": "https://target-website.com",
    "search_prompt": "placeholder",  # Not used
    "gather_links": True
}

# Search-only configuration  
config = {
    "read_prompt": "placeholder",  # Not used
    "search_prompt": "your%20search%20terms",
    "gather_images": True
}

Dual Stream Usage

Use both streams for comprehensive content analysis:

config = {
    "read_prompt": "https://company.com/about",
    "search_prompt": "company%20name%20news",
    "gather_links": True,
    "gather_images": True
}

Error Handling and Limitations

API Rate Limits: Requests subject to Jina AI's rate limiting
Content Processing: Large pages may be truncated or summarized
URL Encoding: Search prompts must be properly URL-encoded
Authentication: Some features may require valid API key
Response Size: Large responses may impact performance
Network Dependencies: Requires internet access to Jina AI APIs

Install with Tessl CLI

npx tessl i tessl/pypi-source-jina-ai-reader

docs

tessl/pypi-source-jina-ai-reader

data-streams.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

Data Streams

Stream Overview

Manifest Configuration

Capabilities

Reader Stream

Search Stream

Data Schema

Schema Details

Stream Usage Examples

Reader Stream Configuration

Search Stream Configuration

Stream Discovery and Catalogs

Integration Patterns

Single Stream Usage

Dual Stream Usage

Error Handling and Limitations

data-streams.mddocs/