Airbyte source connector for Jina AI Reader API enabling web content extraction and search through intelligent reading services
—
Quality
Pending
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Two main data streams providing web content extraction and search functionality through Jina AI's intelligent reading services. The connector exposes "reader" and "search" streams that transform Jina AI's APIs into structured Airbyte data streams.
Both streams share common characteristics:
Streams are defined in the manifest.yaml file using Airbyte's declarative framework:
# Stream definitions
streams:
- "#/definitions/reader_stream"
- "#/definitions/search_stream"
# Reader stream definition
reader_stream:
type: DeclarativeStream
name: "reader"
retriever:
type: SimpleRetriever
requester:
type: CustomRequester
class_name: source_jina_ai_reader.components.JinaAiHttpRequester
url_base: "https://r.jina.ai/{{ config['read_prompt'] }}"
http_method: "GET"
# Search stream definition
search_stream:
type: DeclarativeStream
name: "search"
retriever:
type: SimpleRetriever
requester:
type: CustomRequester
class_name: source_jina_ai_reader.components.JinaAiHttpRequester
url_base: "https://s.jina.ai/{{ config['search_prompt'] }}"
http_method: "GET"Extracts and processes content from specified URLs using Jina AI's Reader API.
Stream Configuration:
https://r.jina.ai/{read_prompt}# Reader stream access (via Airbyte framework)
# Stream is configured declaratively in manifest.yaml
class ReaderStreamConfig(TypedDict):
"""Configuration for the reader stream."""
read_prompt: str # URL to read content from
api_key: Optional[str] # Optional API key for authentication
gather_links: bool # Include links summary in response
gather_images: bool # Include images summary in responseRequest Headers:
{
"Accept": "application/json",
"X-With-Links-Summary": str(gather_links), # "true" or "false"
"X-With-Images-Summary": str(gather_images), # "true" or "false"
"Authorization": f"Bearer {api_key}" # Only if api_key provided
}Performs web searches and returns structured results using Jina AI's Search API.
Stream Configuration:
https://s.jina.ai/{search_prompt}# Search stream access (via Airbyte framework)
# Stream is configured declaratively in manifest.yaml
class SearchStreamConfig(TypedDict):
"""Configuration for the search stream."""
search_prompt: str # URL-encoded search query
api_key: Optional[str] # Optional API key for authentication
gather_links: bool # Include links summary in response
gather_images: bool # Include images summary in responseRequest Headers:
{
"Accept": "application/json",
"X-With-Links-Summary": str(gather_links), # "true" or "false"
"X-With-Images-Summary": str(gather_images), # "true" or "false"
"Authorization": f"Bearer {api_key}" # Only if api_key provided
}Both streams return records following the same JSON schema structure:
class ContentRecord(TypedDict):
"""
Data record structure returned by both reader and search streams.
This schema applies to both streams, representing extracted content
with metadata and optional link/image summaries.
"""
title: str # Page or result title
url: str # Source URL of the content
content: str # Main extracted text content
description: str # Brief description or summary
links: Dict[str, Any] # Optional links summary objecttitle (string)
url (string)
content (string)
description (string)
links (object)
{
"api_key": "jina_your_api_key",
"read_prompt": "https://news.example.com/article",
"search_prompt": "placeholder",
"gather_links": true,
"gather_images": false
}Expected Output:
{
"title": "Breaking News: AI Advances in 2024",
"url": "https://news.example.com/article",
"content": "Artificial intelligence continues to advance rapidly in 2024...",
"description": "Latest developments in AI technology and their impact on industry",
"links": {
"More information...": "https://related-article.com"
}
}{
"api_key": "jina_your_api_key",
"read_prompt": "placeholder",
"search_prompt": "machine%20learning%20tutorials",
"gather_links": false,
"gather_images": true
}Expected Output:
{
"title": "Complete Guide to Machine Learning",
"url": "https://ml-tutorials.com/guide",
"content": "This comprehensive guide covers machine learning fundamentals...",
"description": "Step-by-step machine learning tutorial for beginners",
"links": {}
}Both streams are discoverable through Airbyte's standard discovery process:
# Discover available streams
poetry run source-jina-ai-reader discover --config config.jsonDiscovery Output Structure:
{
"streams": [
{
"name": "reader",
"json_schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"url": {"type": "string"},
"content": {"type": "string"},
"description": {"type": "string"},
"links": {"type": "object", "additionalProperties": true}
}
}
},
{
"name": "search",
"json_schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"url": {"type": "string"},
"content": {"type": "string"},
"description": {"type": "string"},
"links": {"type": "object", "additionalProperties": true}
}
}
}
]
}Configure to use only one stream by providing appropriate prompts:
# Reader-only configuration
config = {
"read_prompt": "https://target-website.com",
"search_prompt": "placeholder", # Not used
"gather_links": True
}
# Search-only configuration
config = {
"read_prompt": "placeholder", # Not used
"search_prompt": "your%20search%20terms",
"gather_images": True
}Use both streams for comprehensive content analysis:
config = {
"read_prompt": "https://company.com/about",
"search_prompt": "company%20name%20news",
"gather_links": True,
"gather_images": True
}Install with Tessl CLI
npx tessl i tessl/pypi-source-jina-ai-reader