Airbyte source connector for Jina AI Reader API enabling web content extraction and search through intelligent reading services
npx @tessl/cli install tessl/pypi-source-jina-ai-reader@0.1.00
# Jina AI Reader Source Connector
1
2
An Airbyte source connector for the Jina AI Reader API that enables intelligent web content extraction, search, and reading through Jina AI's services. The connector provides two main data streams for reading web content from URLs and performing web searches with optional link and image summarization.
3
4
## Package Information
5
6
- **Package Name**: source-jina-ai-reader
7
- **Package Type**: pypi
8
- **Language**: Python
9
- **Installation**: `pip install source-jina-ai-reader` or `poetry add source-jina-ai-reader`
10
- **Framework**: Airbyte CDK Declarative (Low-Code)
11
12
## Core Imports
13
14
```python
15
from source_jina_ai_reader import SourceJinaAiReader
16
```
17
18
For running the connector:
19
20
```python
21
from source_jina_ai_reader.run import run
22
```
23
24
For config migration:
25
26
```python
27
from source_jina_ai_reader.config_migration import JinaAiReaderConfigMigration
28
```
29
30
For custom components:
31
32
```python
33
from source_jina_ai_reader.components import JinaAiHttpRequester
34
```
35
36
Airbyte CDK imports:
37
38
```python
39
from airbyte_cdk.sources.declarative.types import StreamSlice, StreamState
40
from airbyte_cdk.sources.message import MessageRepository, InMemoryMessageRepository
41
```
42
43
## Basic Usage
44
45
### As an Airbyte Source Connector
46
47
```python
48
from source_jina_ai_reader import SourceJinaAiReader
49
50
# Initialize the connector
51
source = SourceJinaAiReader()
52
53
# Use with Airbyte framework
54
# Configuration is handled through Airbyte's config system
55
```
56
57
### Command Line Usage
58
59
```bash
60
# Install via poetry
61
poetry install --with dev
62
63
# Run connector operations
64
poetry run source-jina-ai-reader spec
65
poetry run source-jina-ai-reader check --config config.json
66
poetry run source-jina-ai-reader discover --config config.json
67
poetry run source-jina-ai-reader read --config config.json --catalog catalog.json
68
```
69
70
### Configuration Example
71
72
```json
73
{
74
"api_key": "jina_your_api_key_here",
75
"read_prompt": "https://example.com",
76
"search_prompt": "AI%20powered%20search",
77
"gather_links": true,
78
"gather_images": true
79
}
80
```
81
82
## Architecture
83
84
The connector follows Airbyte's declarative source pattern using YAML configuration:
85
86
- **SourceJinaAiReader**: Main connector class inheriting from YamlDeclarativeSource
87
- **JinaAiHttpRequester**: Custom HTTP requester handling Bearer token authentication
88
- **JinaAiReaderConfigMigration**: Runtime configuration migration for URL encoding
89
- **Manifest Configuration**: YAML-based stream definitions with two data streams
90
- **CLI Interface**: Standard Airbyte operations (spec, check, discover, read)
91
92
The connector transforms Jina AI's web content extraction and search APIs into structured Airbyte data streams, handling authentication, request formatting, and data transformation automatically.
93
94
## Capabilities
95
96
### Core Connector Interface
97
98
Main connector class and entry point functions that provide the foundation for Airbyte integration and command-line usage.
99
100
```python { .api }
101
class SourceJinaAiReader(YamlDeclarativeSource):
102
def __init__(self): ...
103
104
def run() -> None: ...
105
```
106
107
[Core Interface](./core-interface.md)
108
109
### Configuration Management
110
111
Configuration handling including validation, migration, and URL encoding for search prompts to ensure proper API integration.
112
113
```python { .api }
114
class JinaAiReaderConfigMigration:
115
@classmethod
116
def should_migrate(cls, config: Mapping[str, Any]) -> bool: ...
117
118
@classmethod
119
def modify(cls, config: Mapping[str, Any]) -> Mapping[str, Any]: ...
120
121
@classmethod
122
def migrate(cls, args: List[str], source: Source) -> None: ...
123
```
124
125
[Configuration](./configuration.md)
126
127
### HTTP Request Handling
128
129
Custom HTTP requester with Bearer token authentication for secure API access to Jina AI services.
130
131
```python { .api }
132
class JinaAiHttpRequester(HttpRequester):
133
def get_request_headers(
134
self,
135
*,
136
stream_state: Optional[StreamState] = None,
137
stream_slice: Optional[StreamSlice] = None,
138
next_page_token: Optional[Mapping[str, Any]] = None,
139
) -> Mapping[str, Any]: ...
140
```
141
142
[HTTP Handling](./http-handling.md)
143
144
### Data Streams
145
146
Two main data streams providing web content extraction and search functionality through Jina AI's intelligent reading services.
147
148
**Stream Types:**
149
- `reader`: Extracts content from specified URLs
150
- `search`: Performs web searches with optional content summarization
151
152
**Data Schema:**
153
- `title`: string - Content title
154
- `url`: string - Source URL
155
- `content`: string - Extracted/searched content
156
- `description`: string - Content description
157
- `links`: object - Optional links summary
158
159
[Data Streams](./data-streams.md)
160
161
## Types
162
163
```python { .api }
164
# Configuration types
165
ConfigDict = Mapping[str, Any]
166
167
# Airbyte CDK stream types
168
StreamState = Optional[Mapping[str, Any]] # Current state of data stream for incremental sync
169
StreamSlice = Optional[Mapping[str, Any]] # Current slice being processed for parallel execution
170
171
# Airbyte CDK message types
172
MessageRepository = InMemoryMessageRepository # Repository for storing connector messages
173
174
# Data record structure returned by both streams
175
class ContentRecord(TypedDict):
176
title: str # Content or search result title
177
url: str # Source URL of the content
178
content: str # Extracted text content
179
description: str # Brief description or summary
180
links: Dict[str, Any] # Optional links summary with dynamic properties
181
182
# Configuration specification
183
class ConfigSpec(TypedDict, total=False):
184
api_key: str # Optional Jina AI API key (marked as secret in manifest)
185
read_prompt: str # URL to read content from (default: "https://www.google.com")
186
search_prompt: str # URL-encoded search query (default: "Search%20airbyte")
187
gather_links: bool # Include links summary section (optional parameter)
188
gather_images: bool # Include images summary section (optional parameter)
189
```