0
# Data Streams
1
2
Two main data streams providing web content extraction and search functionality through Jina AI's intelligent reading services. The connector exposes "reader" and "search" streams that transform Jina AI's APIs into structured Airbyte data streams.
3
4
## Stream Overview
5
6
Both streams share common characteristics:
7
- **Output Format**: JSON records with consistent schema
8
- **Authentication**: Optional Bearer token via api_key
9
- **Pagination**: No pagination (single request per stream)
10
- **Record Selection**: Data extracted from "data" field in API response
11
- **Configuration**: Defined declaratively in manifest.yaml using Airbyte Low-Code CDK
12
13
## Manifest Configuration
14
15
Streams are defined in the manifest.yaml file using Airbyte's declarative framework:
16
17
```yaml
18
# Stream definitions
19
streams:
20
- "#/definitions/reader_stream"
21
- "#/definitions/search_stream"
22
23
# Reader stream definition
24
reader_stream:
25
type: DeclarativeStream
26
name: "reader"
27
retriever:
28
type: SimpleRetriever
29
requester:
30
type: CustomRequester
31
class_name: source_jina_ai_reader.components.JinaAiHttpRequester
32
url_base: "https://r.jina.ai/{{ config['read_prompt'] }}"
33
http_method: "GET"
34
35
# Search stream definition
36
search_stream:
37
type: DeclarativeStream
38
name: "search"
39
retriever:
40
type: SimpleRetriever
41
requester:
42
type: CustomRequester
43
class_name: source_jina_ai_reader.components.JinaAiHttpRequester
44
url_base: "https://s.jina.ai/{{ config['search_prompt'] }}"
45
http_method: "GET"
46
```
47
48
## Capabilities
49
50
### Reader Stream
51
52
Extracts and processes content from specified URLs using Jina AI's Reader API.
53
54
**Stream Configuration:**
55
- **Name**: "reader"
56
- **Endpoint**: `https://r.jina.ai/{read_prompt}`
57
- **Method**: GET
58
- **Purpose**: Read and extract content from web pages
59
60
```python { .api }
61
# Reader stream access (via Airbyte framework)
62
# Stream is configured declaratively in manifest.yaml
63
64
class ReaderStreamConfig(TypedDict):
65
"""Configuration for the reader stream."""
66
read_prompt: str # URL to read content from
67
api_key: Optional[str] # Optional API key for authentication
68
gather_links: bool # Include links summary in response
69
gather_images: bool # Include images summary in response
70
```
71
72
**Request Headers:**
73
```python
74
{
75
"Accept": "application/json",
76
"X-With-Links-Summary": str(gather_links), # "true" or "false"
77
"X-With-Images-Summary": str(gather_images), # "true" or "false"
78
"Authorization": f"Bearer {api_key}" # Only if api_key provided
79
}
80
```
81
82
### Search Stream
83
84
Performs web searches and returns structured results using Jina AI's Search API.
85
86
**Stream Configuration:**
87
- **Name**: "search"
88
- **Endpoint**: `https://s.jina.ai/{search_prompt}`
89
- **Method**: GET
90
- **Purpose**: Search the web and return structured results
91
92
```python { .api }
93
# Search stream access (via Airbyte framework)
94
# Stream is configured declaratively in manifest.yaml
95
96
class SearchStreamConfig(TypedDict):
97
"""Configuration for the search stream."""
98
search_prompt: str # URL-encoded search query
99
api_key: Optional[str] # Optional API key for authentication
100
gather_links: bool # Include links summary in response
101
gather_images: bool # Include images summary in response
102
```
103
104
**Request Headers:**
105
```python
106
{
107
"Accept": "application/json",
108
"X-With-Links-Summary": str(gather_links), # "true" or "false"
109
"X-With-Images-Summary": str(gather_images), # "true" or "false"
110
"Authorization": f"Bearer {api_key}" # Only if api_key provided
111
}
112
```
113
114
## Data Schema
115
116
Both streams return records following the same JSON schema structure:
117
118
```python { .api }
119
class ContentRecord(TypedDict):
120
"""
121
Data record structure returned by both reader and search streams.
122
123
This schema applies to both streams, representing extracted content
124
with metadata and optional link/image summaries.
125
"""
126
title: str # Page or result title
127
url: str # Source URL of the content
128
content: str # Main extracted text content
129
description: str # Brief description or summary
130
links: Dict[str, Any] # Optional links summary object
131
```
132
133
### Schema Details
134
135
**title (string)**
136
- Page title for reader stream
137
- Search result title for search stream
138
- Always present in response
139
140
**url (string)**
141
- Original URL for reader stream
142
- Result URL for search stream
143
- Always present in response
144
145
**content (string)**
146
- Extracted text content from the page/result
147
- Main content body processed by Jina AI
148
- Always present in response
149
150
**description (string)**
151
- Brief description or summary of the content
152
- Generated by Jina AI's processing
153
- Always present in response
154
155
**links (object)**
156
- Additional properties with dynamic structure
157
- Contains link summaries when gather_links=true
158
- Structure varies based on content and API processing
159
- May include nested properties like "More information..."
160
161
## Stream Usage Examples
162
163
### Reader Stream Configuration
164
165
```json
166
{
167
"api_key": "jina_your_api_key",
168
"read_prompt": "https://news.example.com/article",
169
"search_prompt": "placeholder",
170
"gather_links": true,
171
"gather_images": false
172
}
173
```
174
175
**Expected Output:**
176
```json
177
{
178
"title": "Breaking News: AI Advances in 2024",
179
"url": "https://news.example.com/article",
180
"content": "Artificial intelligence continues to advance rapidly in 2024...",
181
"description": "Latest developments in AI technology and their impact on industry",
182
"links": {
183
"More information...": "https://related-article.com"
184
}
185
}
186
```
187
188
### Search Stream Configuration
189
190
```json
191
{
192
"api_key": "jina_your_api_key",
193
"read_prompt": "placeholder",
194
"search_prompt": "machine%20learning%20tutorials",
195
"gather_links": false,
196
"gather_images": true
197
}
198
```
199
200
**Expected Output:**
201
```json
202
{
203
"title": "Complete Guide to Machine Learning",
204
"url": "https://ml-tutorials.com/guide",
205
"content": "This comprehensive guide covers machine learning fundamentals...",
206
"description": "Step-by-step machine learning tutorial for beginners",
207
"links": {}
208
}
209
```
210
211
## Stream Discovery and Catalogs
212
213
Both streams are discoverable through Airbyte's standard discovery process:
214
215
```bash
216
# Discover available streams
217
poetry run source-jina-ai-reader discover --config config.json
218
```
219
220
**Discovery Output Structure:**
221
```json
222
{
223
"streams": [
224
{
225
"name": "reader",
226
"json_schema": {
227
"type": "object",
228
"properties": {
229
"title": {"type": "string"},
230
"url": {"type": "string"},
231
"content": {"type": "string"},
232
"description": {"type": "string"},
233
"links": {"type": "object", "additionalProperties": true}
234
}
235
}
236
},
237
{
238
"name": "search",
239
"json_schema": {
240
"type": "object",
241
"properties": {
242
"title": {"type": "string"},
243
"url": {"type": "string"},
244
"content": {"type": "string"},
245
"description": {"type": "string"},
246
"links": {"type": "object", "additionalProperties": true}
247
}
248
}
249
}
250
]
251
}
252
```
253
254
## Integration Patterns
255
256
### Single Stream Usage
257
258
Configure to use only one stream by providing appropriate prompts:
259
260
```python
261
# Reader-only configuration
262
config = {
263
"read_prompt": "https://target-website.com",
264
"search_prompt": "placeholder", # Not used
265
"gather_links": True
266
}
267
268
# Search-only configuration
269
config = {
270
"read_prompt": "placeholder", # Not used
271
"search_prompt": "your%20search%20terms",
272
"gather_images": True
273
}
274
```
275
276
### Dual Stream Usage
277
278
Use both streams for comprehensive content analysis:
279
280
```python
281
config = {
282
"read_prompt": "https://company.com/about",
283
"search_prompt": "company%20name%20news",
284
"gather_links": True,
285
"gather_images": True
286
}
287
```
288
289
## Error Handling and Limitations
290
291
- **API Rate Limits**: Requests subject to Jina AI's rate limiting
292
- **Content Processing**: Large pages may be truncated or summarized
293
- **URL Encoding**: Search prompts must be properly URL-encoded
294
- **Authentication**: Some features may require valid API key
295
- **Response Size**: Large responses may impact performance
296
- **Network Dependencies**: Requires internet access to Jina AI APIs