Airbyte source connector for Jina AI Reader API enabling web content extraction and search through intelligent reading services
npx @tessl/cli install tessl/pypi-source-jina-ai-reader@0.1.0An Airbyte source connector for the Jina AI Reader API that enables intelligent web content extraction, search, and reading through Jina AI's services. The connector provides two main data streams for reading web content from URLs and performing web searches with optional link and image summarization.
pip install source-jina-ai-reader or poetry add source-jina-ai-readerfrom source_jina_ai_reader import SourceJinaAiReaderFor running the connector:
from source_jina_ai_reader.run import runFor config migration:
from source_jina_ai_reader.config_migration import JinaAiReaderConfigMigrationFor custom components:
from source_jina_ai_reader.components import JinaAiHttpRequesterAirbyte CDK imports:
from airbyte_cdk.sources.declarative.types import StreamSlice, StreamState
from airbyte_cdk.sources.message import MessageRepository, InMemoryMessageRepositoryfrom source_jina_ai_reader import SourceJinaAiReader
# Initialize the connector
source = SourceJinaAiReader()
# Use with Airbyte framework
# Configuration is handled through Airbyte's config system# Install via poetry
poetry install --with dev
# Run connector operations
poetry run source-jina-ai-reader spec
poetry run source-jina-ai-reader check --config config.json
poetry run source-jina-ai-reader discover --config config.json
poetry run source-jina-ai-reader read --config config.json --catalog catalog.json{
"api_key": "jina_your_api_key_here",
"read_prompt": "https://example.com",
"search_prompt": "AI%20powered%20search",
"gather_links": true,
"gather_images": true
}The connector follows Airbyte's declarative source pattern using YAML configuration:
The connector transforms Jina AI's web content extraction and search APIs into structured Airbyte data streams, handling authentication, request formatting, and data transformation automatically.
Main connector class and entry point functions that provide the foundation for Airbyte integration and command-line usage.
class SourceJinaAiReader(YamlDeclarativeSource):
def __init__(self): ...
def run() -> None: ...Configuration handling including validation, migration, and URL encoding for search prompts to ensure proper API integration.
class JinaAiReaderConfigMigration:
@classmethod
def should_migrate(cls, config: Mapping[str, Any]) -> bool: ...
@classmethod
def modify(cls, config: Mapping[str, Any]) -> Mapping[str, Any]: ...
@classmethod
def migrate(cls, args: List[str], source: Source) -> None: ...Custom HTTP requester with Bearer token authentication for secure API access to Jina AI services.
class JinaAiHttpRequester(HttpRequester):
def get_request_headers(
self,
*,
stream_state: Optional[StreamState] = None,
stream_slice: Optional[StreamSlice] = None,
next_page_token: Optional[Mapping[str, Any]] = None,
) -> Mapping[str, Any]: ...Two main data streams providing web content extraction and search functionality through Jina AI's intelligent reading services.
Stream Types:
reader: Extracts content from specified URLssearch: Performs web searches with optional content summarizationData Schema:
title: string - Content titleurl: string - Source URLcontent: string - Extracted/searched contentdescription: string - Content descriptionlinks: object - Optional links summary# Configuration types
ConfigDict = Mapping[str, Any]
# Airbyte CDK stream types
StreamState = Optional[Mapping[str, Any]] # Current state of data stream for incremental sync
StreamSlice = Optional[Mapping[str, Any]] # Current slice being processed for parallel execution
# Airbyte CDK message types
MessageRepository = InMemoryMessageRepository # Repository for storing connector messages
# Data record structure returned by both streams
class ContentRecord(TypedDict):
title: str # Content or search result title
url: str # Source URL of the content
content: str # Extracted text content
description: str # Brief description or summary
links: Dict[str, Any] # Optional links summary with dynamic properties
# Configuration specification
class ConfigSpec(TypedDict, total=False):
api_key: str # Optional Jina AI API key (marked as secret in manifest)
read_prompt: str # URL to read content from (default: "https://www.google.com")
search_prompt: str # URL-encoded search query (default: "Search%20airbyte")
gather_links: bool # Include links summary section (optional parameter)
gather_images: bool # Include images summary section (optional parameter)