or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

configuration.mdcore-interface.mddata-streams.mdhttp-handling.mdindex.md

data-streams.mddocs/

0

# Data Streams

1

2

Two main data streams providing web content extraction and search functionality through Jina AI's intelligent reading services. The connector exposes "reader" and "search" streams that transform Jina AI's APIs into structured Airbyte data streams.

3

4

## Stream Overview

5

6

Both streams share common characteristics:

7

- **Output Format**: JSON records with consistent schema

8

- **Authentication**: Optional Bearer token via api_key

9

- **Pagination**: No pagination (single request per stream)

10

- **Record Selection**: Data extracted from "data" field in API response

11

- **Configuration**: Defined declaratively in manifest.yaml using Airbyte Low-Code CDK

12

13

## Manifest Configuration

14

15

Streams are defined in the manifest.yaml file using Airbyte's declarative framework:

16

17

```yaml

18

# Stream definitions

19

streams:

20

- "#/definitions/reader_stream"

21

- "#/definitions/search_stream"

22

23

# Reader stream definition

24

reader_stream:

25

type: DeclarativeStream

26

name: "reader"

27

retriever:

28

type: SimpleRetriever

29

requester:

30

type: CustomRequester

31

class_name: source_jina_ai_reader.components.JinaAiHttpRequester

32

url_base: "https://r.jina.ai/{{ config['read_prompt'] }}"

33

http_method: "GET"

34

35

# Search stream definition

36

search_stream:

37

type: DeclarativeStream

38

name: "search"

39

retriever:

40

type: SimpleRetriever

41

requester:

42

type: CustomRequester

43

class_name: source_jina_ai_reader.components.JinaAiHttpRequester

44

url_base: "https://s.jina.ai/{{ config['search_prompt'] }}"

45

http_method: "GET"

46

```

47

48

## Capabilities

49

50

### Reader Stream

51

52

Extracts and processes content from specified URLs using Jina AI's Reader API.

53

54

**Stream Configuration:**

55

- **Name**: "reader"

56

- **Endpoint**: `https://r.jina.ai/{read_prompt}`

57

- **Method**: GET

58

- **Purpose**: Read and extract content from web pages

59

60

```python { .api }

61

# Reader stream access (via Airbyte framework)

62

# Stream is configured declaratively in manifest.yaml

63

64

class ReaderStreamConfig(TypedDict):

65

"""Configuration for the reader stream."""

66

read_prompt: str # URL to read content from

67

api_key: Optional[str] # Optional API key for authentication

68

gather_links: bool # Include links summary in response

69

gather_images: bool # Include images summary in response

70

```

71

72

**Request Headers:**

73

```python

74

{

75

"Accept": "application/json",

76

"X-With-Links-Summary": str(gather_links), # "true" or "false"

77

"X-With-Images-Summary": str(gather_images), # "true" or "false"

78

"Authorization": f"Bearer {api_key}" # Only if api_key provided

79

}

80

```

81

82

### Search Stream

83

84

Performs web searches and returns structured results using Jina AI's Search API.

85

86

**Stream Configuration:**

87

- **Name**: "search"

88

- **Endpoint**: `https://s.jina.ai/{search_prompt}`

89

- **Method**: GET

90

- **Purpose**: Search the web and return structured results

91

92

```python { .api }

93

# Search stream access (via Airbyte framework)

94

# Stream is configured declaratively in manifest.yaml

95

96

class SearchStreamConfig(TypedDict):

97

"""Configuration for the search stream."""

98

search_prompt: str # URL-encoded search query

99

api_key: Optional[str] # Optional API key for authentication

100

gather_links: bool # Include links summary in response

101

gather_images: bool # Include images summary in response

102

```

103

104

**Request Headers:**

105

```python

106

{

107

"Accept": "application/json",

108

"X-With-Links-Summary": str(gather_links), # "true" or "false"

109

"X-With-Images-Summary": str(gather_images), # "true" or "false"

110

"Authorization": f"Bearer {api_key}" # Only if api_key provided

111

}

112

```

113

114

## Data Schema

115

116

Both streams return records following the same JSON schema structure:

117

118

```python { .api }

119

class ContentRecord(TypedDict):

120

"""

121

Data record structure returned by both reader and search streams.

122

123

This schema applies to both streams, representing extracted content

124

with metadata and optional link/image summaries.

125

"""

126

title: str # Page or result title

127

url: str # Source URL of the content

128

content: str # Main extracted text content

129

description: str # Brief description or summary

130

links: Dict[str, Any] # Optional links summary object

131

```

132

133

### Schema Details

134

135

**title (string)**

136

- Page title for reader stream

137

- Search result title for search stream

138

- Always present in response

139

140

**url (string)**

141

- Original URL for reader stream

142

- Result URL for search stream

143

- Always present in response

144

145

**content (string)**

146

- Extracted text content from the page/result

147

- Main content body processed by Jina AI

148

- Always present in response

149

150

**description (string)**

151

- Brief description or summary of the content

152

- Generated by Jina AI's processing

153

- Always present in response

154

155

**links (object)**

156

- Additional properties with dynamic structure

157

- Contains link summaries when gather_links=true

158

- Structure varies based on content and API processing

159

- May include nested properties like "More information..."

160

161

## Stream Usage Examples

162

163

### Reader Stream Configuration

164

165

```json

166

{

167

"api_key": "jina_your_api_key",

168

"read_prompt": "https://news.example.com/article",

169

"search_prompt": "placeholder",

170

"gather_links": true,

171

"gather_images": false

172

}

173

```

174

175

**Expected Output:**

176

```json

177

{

178

"title": "Breaking News: AI Advances in 2024",

179

"url": "https://news.example.com/article",

180

"content": "Artificial intelligence continues to advance rapidly in 2024...",

181

"description": "Latest developments in AI technology and their impact on industry",

182

"links": {

183

"More information...": "https://related-article.com"

184

}

185

}

186

```

187

188

### Search Stream Configuration

189

190

```json

191

{

192

"api_key": "jina_your_api_key",

193

"read_prompt": "placeholder",

194

"search_prompt": "machine%20learning%20tutorials",

195

"gather_links": false,

196

"gather_images": true

197

}

198

```

199

200

**Expected Output:**

201

```json

202

{

203

"title": "Complete Guide to Machine Learning",

204

"url": "https://ml-tutorials.com/guide",

205

"content": "This comprehensive guide covers machine learning fundamentals...",

206

"description": "Step-by-step machine learning tutorial for beginners",

207

"links": {}

208

}

209

```

210

211

## Stream Discovery and Catalogs

212

213

Both streams are discoverable through Airbyte's standard discovery process:

214

215

```bash

216

# Discover available streams

217

poetry run source-jina-ai-reader discover --config config.json

218

```

219

220

**Discovery Output Structure:**

221

```json

222

{

223

"streams": [

224

{

225

"name": "reader",

226

"json_schema": {

227

"type": "object",

228

"properties": {

229

"title": {"type": "string"},

230

"url": {"type": "string"},

231

"content": {"type": "string"},

232

"description": {"type": "string"},

233

"links": {"type": "object", "additionalProperties": true}

234

}

235

}

236

},

237

{

238

"name": "search",

239

"json_schema": {

240

"type": "object",

241

"properties": {

242

"title": {"type": "string"},

243

"url": {"type": "string"},

244

"content": {"type": "string"},

245

"description": {"type": "string"},

246

"links": {"type": "object", "additionalProperties": true}

247

}

248

}

249

}

250

]

251

}

252

```

253

254

## Integration Patterns

255

256

### Single Stream Usage

257

258

Configure to use only one stream by providing appropriate prompts:

259

260

```python

261

# Reader-only configuration

262

config = {

263

"read_prompt": "https://target-website.com",

264

"search_prompt": "placeholder", # Not used

265

"gather_links": True

266

}

267

268

# Search-only configuration

269

config = {

270

"read_prompt": "placeholder", # Not used

271

"search_prompt": "your%20search%20terms",

272

"gather_images": True

273

}

274

```

275

276

### Dual Stream Usage

277

278

Use both streams for comprehensive content analysis:

279

280

```python

281

config = {

282

"read_prompt": "https://company.com/about",

283

"search_prompt": "company%20name%20news",

284

"gather_links": True,

285

"gather_images": True

286

}

287

```

288

289

## Error Handling and Limitations

290

291

- **API Rate Limits**: Requests subject to Jina AI's rate limiting

292

- **Content Processing**: Large pages may be truncated or summarized

293

- **URL Encoding**: Search prompts must be properly URL-encoded

294

- **Authentication**: Some features may require valid API key

295

- **Response Size**: Large responses may impact performance

296

- **Network Dependencies**: Requires internet access to Jina AI APIs