Tessl Tile for pypi/source-jina-ai-reader@0.1.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

configuration.md core-interface.md data-streams.md http-handling.md index.md

data-streams.mddocs/

0
# Data Streams
1

2
Two main data streams providing web content extraction and search functionality through Jina AI's intelligent reading services. The connector exposes "reader" and "search" streams that transform Jina AI's APIs into structured Airbyte data streams.
3

4
## Stream Overview
5

6
Both streams share common characteristics:
7
- **Output Format**: JSON records with consistent schema
8
- **Authentication**: Optional Bearer token via api_key
9
- **Pagination**: No pagination (single request per stream)
10
- **Record Selection**: Data extracted from "data" field in API response
11
- **Configuration**: Defined declaratively in manifest.yaml using Airbyte Low-Code CDK
12

13
## Manifest Configuration
14

15
Streams are defined in the manifest.yaml file using Airbyte's declarative framework:
16

17
```yaml
18
# Stream definitions
19
streams:
20
  - "#/definitions/reader_stream"
21
  - "#/definitions/search_stream"
22

23
# Reader stream definition
24
reader_stream:
25
  type: DeclarativeStream
26
  name: "reader"
27
  retriever:
28
    type: SimpleRetriever
29
    requester:
30
      type: CustomRequester
31
      class_name: source_jina_ai_reader.components.JinaAiHttpRequester
32
      url_base: "https://r.jina.ai/{{ config['read_prompt'] }}"
33
      http_method: "GET"
34

35
# Search stream definition  
36
search_stream:
37
  type: DeclarativeStream
38
  name: "search"
39
  retriever:
40
    type: SimpleRetriever
41
    requester:
42
      type: CustomRequester
43
      class_name: source_jina_ai_reader.components.JinaAiHttpRequester
44
      url_base: "https://s.jina.ai/{{ config['search_prompt'] }}"
45
      http_method: "GET"
46
```
47

48
## Capabilities
49

50
### Reader Stream
51

52
Extracts and processes content from specified URLs using Jina AI's Reader API.
53

54
**Stream Configuration:**
55
- **Name**: "reader"
56
- **Endpoint**: `https://r.jina.ai/{read_prompt}`
57
- **Method**: GET
58
- **Purpose**: Read and extract content from web pages
59

60
```python { .api }
61
# Reader stream access (via Airbyte framework)
62
# Stream is configured declaratively in manifest.yaml
63

64
class ReaderStreamConfig(TypedDict):
65
    """Configuration for the reader stream."""
66
    read_prompt: str  # URL to read content from
67
    api_key: Optional[str]  # Optional API key for authentication
68
    gather_links: bool  # Include links summary in response
69
    gather_images: bool  # Include images summary in response
70
```
71

72
**Request Headers:**
73
```python
74
{
75
    "Accept": "application/json",
76
    "X-With-Links-Summary": str(gather_links),  # "true" or "false"
77
    "X-With-Images-Summary": str(gather_images),  # "true" or "false"
78
    "Authorization": f"Bearer {api_key}"  # Only if api_key provided
79
}
80
```
81

82
### Search Stream
83

84
Performs web searches and returns structured results using Jina AI's Search API.
85

86
**Stream Configuration:**
87
- **Name**: "search"  
88
- **Endpoint**: `https://s.jina.ai/{search_prompt}`
89
- **Method**: GET
90
- **Purpose**: Search the web and return structured results
91

92
```python { .api }
93
# Search stream access (via Airbyte framework)
94
# Stream is configured declaratively in manifest.yaml
95

96
class SearchStreamConfig(TypedDict):
97
    """Configuration for the search stream."""
98
    search_prompt: str  # URL-encoded search query
99
    api_key: Optional[str]  # Optional API key for authentication
100
    gather_links: bool  # Include links summary in response
101
    gather_images: bool  # Include images summary in response
102
```
103

104
**Request Headers:**
105
```python
106
{
107
    "Accept": "application/json",
108
    "X-With-Links-Summary": str(gather_links),  # "true" or "false"
109
    "X-With-Images-Summary": str(gather_images),  # "true" or "false"
110
    "Authorization": f"Bearer {api_key}"  # Only if api_key provided
111
}
112
```
113

114
## Data Schema
115

116
Both streams return records following the same JSON schema structure:
117

118
```python { .api }
119
class ContentRecord(TypedDict):
120
    """
121
    Data record structure returned by both reader and search streams.
122
    
123
    This schema applies to both streams, representing extracted content
124
    with metadata and optional link/image summaries.
125
    """
126
    title: str  # Page or result title
127
    url: str  # Source URL of the content
128
    content: str  # Main extracted text content
129
    description: str  # Brief description or summary
130
    links: Dict[str, Any]  # Optional links summary object
131
```
132

133
### Schema Details
134

135
**title (string)**
136
- Page title for reader stream
137
- Search result title for search stream
138
- Always present in response
139

140
**url (string)**  
141
- Original URL for reader stream
142
- Result URL for search stream
143
- Always present in response
144

145
**content (string)**
146
- Extracted text content from the page/result
147
- Main content body processed by Jina AI
148
- Always present in response
149

150
**description (string)**
151
- Brief description or summary of the content
152
- Generated by Jina AI's processing
153
- Always present in response
154

155
**links (object)**
156
- Additional properties with dynamic structure
157
- Contains link summaries when gather_links=true
158
- Structure varies based on content and API processing
159
- May include nested properties like "More information..."
160

161
## Stream Usage Examples
162

163
### Reader Stream Configuration
164

165
```json
166
{
167
  "api_key": "jina_your_api_key",
168
  "read_prompt": "https://news.example.com/article",
169
  "search_prompt": "placeholder",
170
  "gather_links": true,
171
  "gather_images": false
172
}
173
```
174

175
**Expected Output:**
176
```json
177
{
178
  "title": "Breaking News: AI Advances in 2024",
179
  "url": "https://news.example.com/article", 
180
  "content": "Artificial intelligence continues to advance rapidly in 2024...",
181
  "description": "Latest developments in AI technology and their impact on industry",
182
  "links": {
183
    "More information...": "https://related-article.com"
184
  }
185
}
186
```
187

188
### Search Stream Configuration
189

190
```json
191
{
192
  "api_key": "jina_your_api_key",
193
  "read_prompt": "placeholder",
194
  "search_prompt": "machine%20learning%20tutorials",
195
  "gather_links": false,
196
  "gather_images": true
197
}
198
```
199

200
**Expected Output:**
201
```json
202
{
203
  "title": "Complete Guide to Machine Learning",
204
  "url": "https://ml-tutorials.com/guide",
205
  "content": "This comprehensive guide covers machine learning fundamentals...",
206
  "description": "Step-by-step machine learning tutorial for beginners",
207
  "links": {}
208
}
209
```
210

211
## Stream Discovery and Catalogs
212

213
Both streams are discoverable through Airbyte's standard discovery process:
214

215
```bash
216
# Discover available streams
217
poetry run source-jina-ai-reader discover --config config.json
218
```
219

220
**Discovery Output Structure:**
221
```json
222
{
223
  "streams": [
224
    {
225
      "name": "reader",
226
      "json_schema": {
227
        "type": "object",
228
        "properties": {
229
          "title": {"type": "string"},
230
          "url": {"type": "string"},
231
          "content": {"type": "string"},
232
          "description": {"type": "string"},
233
          "links": {"type": "object", "additionalProperties": true}
234
        }
235
      }
236
    },
237
    {
238
      "name": "search", 
239
      "json_schema": {
240
        "type": "object",
241
        "properties": {
242
          "title": {"type": "string"},
243
          "url": {"type": "string"},
244
          "content": {"type": "string"}, 
245
          "description": {"type": "string"},
246
          "links": {"type": "object", "additionalProperties": true}
247
        }
248
      }
249
    }
250
  ]
251
}
252
```
253

254
## Integration Patterns
255

256
### Single Stream Usage
257

258
Configure to use only one stream by providing appropriate prompts:
259

260
```python
261
# Reader-only configuration
262
config = {
263
    "read_prompt": "https://target-website.com",
264
    "search_prompt": "placeholder",  # Not used
265
    "gather_links": True
266
}
267

268
# Search-only configuration  
269
config = {
270
    "read_prompt": "placeholder",  # Not used
271
    "search_prompt": "your%20search%20terms",
272
    "gather_images": True
273
}
274
```
275

276
### Dual Stream Usage
277

278
Use both streams for comprehensive content analysis:
279

280
```python
281
config = {
282
    "read_prompt": "https://company.com/about",
283
    "search_prompt": "company%20name%20news",
284
    "gather_links": True,
285
    "gather_images": True
286
}
287
```
288

289
## Error Handling and Limitations
290

291
- **API Rate Limits**: Requests subject to Jina AI's rate limiting
292
- **Content Processing**: Large pages may be truncated or summarized
293
- **URL Encoding**: Search prompts must be properly URL-encoded
294
- **Authentication**: Some features may require valid API key
295
- **Response Size**: Large responses may impact performance
296
- **Network Dependencies**: Requires internet access to Jina AI APIs

Version

Tile

Files

data-streams.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

data-streams.mddocs/