or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

authentication.mdconfiguration.mdfile-operations.mdindex.mdsource-connector.md

file-operations.mddocs/

0

# File Operations

1

2

File discovery, enumeration, and reading capabilities for Microsoft OneDrive files including support for nested folder structures, glob pattern matching, shared items access, and efficient streaming with metadata extraction.

3

4

## Capabilities

5

6

### Stream Reader

7

8

Primary class for handling file operations across OneDrive drives and shared items with lazy initialization and caching.

9

10

```python { .api }

11

class SourceMicrosoftOneDriveStreamReader(AbstractFileBasedStreamReader):

12

ROOT_PATH: List[str] = [".", "/"]

13

14

def __init__(self):

15

"""Initialize the stream reader with lazy-loaded clients."""

16

17

@property

18

def config(self) -> SourceMicrosoftOneDriveSpec:

19

"""Get the current configuration."""

20

21

@config.setter

22

def config(self, value: SourceMicrosoftOneDriveSpec):

23

"""

24

Set configuration with type validation.

25

26

Parameters:

27

- value: SourceMicrosoftOneDriveSpec - Must be valid configuration spec

28

"""

29

30

@property

31

def auth_client(self):

32

"""Lazy initialization of the authentication client."""

33

34

@property

35

def one_drive_client(self):

36

"""Lazy initialization of the Microsoft Graph client."""

37

38

def get_access_token(self):

39

"""Directly fetch a new access token from the auth_client."""

40

41

@property

42

def drives(self):

43

"""

44

Retrieves and caches OneDrive drives, including the user's drive.

45

Filters to only personal and business drive types.

46

47

Returns:

48

List of OneDrive drive objects accessible to authenticated user

49

"""

50

```

51

52

### File Discovery

53

54

Methods for discovering and filtering files across different OneDrive locations.

55

56

```python { .api }

57

def get_matching_files(

58

self,

59

globs: List[str],

60

prefix: Optional[str],

61

logger: logging.Logger

62

) -> Iterable[RemoteFile]:

63

"""

64

Retrieve all files matching the specified glob patterns in OneDrive.

65

Handles the special case where the drive might be empty by catching StopIteration.

66

67

Parameters:

68

- globs: List[str] - Glob patterns to match files against

69

- prefix: Optional[str] - Optional prefix filter (not used in OneDrive implementation)

70

- logger: logging.Logger - Logger for operation tracking

71

72

Returns:

73

Iterable[RemoteFile]: Iterator of MicrosoftOneDriveRemoteFile objects

74

75

Raises:

76

- AirbyteTracedException: If drive is empty or does not exist

77

78

Implementation:

79

Uses a special approach to handle empty drives by checking for StopIteration

80

from the files generator and yielding files in two phases.

81

"""

82

83

def get_all_files(self):

84

"""

85

Generator yielding all accessible files based on search scope configuration.

86

Handles both accessible drives and shared items based on search_scope setting.

87

88

Yields:

89

Tuple[str, str, datetime]: File path, download URL, and last modified time

90

"""

91

92

def get_files_by_drive_name(self, drive_name: str, folder_path: str):

93

"""

94

Yields files from the specified drive and folder path.

95

96

Parameters:

97

- drive_name: str - Name of the OneDrive drive to search

98

- folder_path: str - Path within the drive to search

99

100

Yields:

101

Tuple[str, str, str]: File path, download URL, and last modified datetime string

102

"""

103

```

104

105

### File Reading

106

107

Methods for opening and reading OneDrive files with proper encoding support.

108

109

```python { .api }

110

def open_file(

111

self,

112

file: RemoteFile,

113

mode: FileReadMode,

114

encoding: Optional[str],

115

logger: logging.Logger

116

) -> IOBase:

117

"""

118

Open a OneDrive file for reading using smart-open.

119

120

Parameters:

121

- file: RemoteFile - File object with download URL

122

- mode: FileReadMode - File reading mode (typically READ)

123

- encoding: Optional[str] - Text encoding (e.g., 'utf-8', 'latin-1')

124

- logger: logging.Logger - Logger for error tracking

125

126

Returns:

127

IOBase: Opened file-like object for reading

128

129

Raises:

130

- Exception: If file cannot be opened or accessed

131

"""

132

```

133

134

### Directory Operations

135

136

Methods for recursive directory traversal and file enumeration.

137

138

```python { .api }

139

def list_directories_and_files(self, root_folder, path: Optional[str] = None):

140

"""

141

Enumerates folders and files starting from a root folder recursively.

142

143

Parameters:

144

- root_folder: OneDrive folder object to start enumeration from

145

- path: Optional[str] - Current path for building full file paths

146

147

Returns:

148

List[Tuple[str, str, str]]: List of (file_path, download_url, last_modified)

149

"""

150

```

151

152

### Shared Items Access

153

154

Methods for accessing files shared with the authenticated user.

155

156

```python { .api }

157

def _get_shared_files_from_all_drives(self, parsed_drive_id: str):

158

"""

159

Get files from shared items across all drives.

160

161

Parameters:

162

- parsed_drive_id: str - Drive ID to exclude from results to avoid duplicates

163

164

Yields:

165

Tuple[str, str, datetime]: File path, download URL, and last modified time

166

"""

167

168

def _get_shared_drive_object(self, drive_id: str, object_id: str, path: str) -> List[Tuple[str, str, datetime]]:

169

"""

170

Retrieves a list of all nested files under the specified shared object.

171

172

Parameters:

173

- drive_id: str - The ID of the drive containing the object

174

- object_id: str - The ID of the object to start the search from

175

- path: str - Base path for building file paths

176

177

Returns:

178

List[Tuple[str, str, datetime]]: File information tuples

179

180

Raises:

181

- RuntimeError: If an error occurs during the Microsoft Graph API request

182

"""

183

```

184

185

### Remote File Model

186

187

File representation with OneDrive-specific attributes.

188

189

```python { .api }

190

class MicrosoftOneDriveRemoteFile(RemoteFile):

191

download_url: str

192

"""Direct download URL from Microsoft Graph API for file content access."""

193

```

194

195

## Usage Examples

196

197

### Basic File Discovery

198

199

```python

200

from source_microsoft_onedrive.stream_reader import SourceMicrosoftOneDriveStreamReader

201

from source_microsoft_onedrive.spec import SourceMicrosoftOneDriveSpec

202

import logging

203

204

# Configure stream reader

205

config = SourceMicrosoftOneDriveSpec(**{

206

"credentials": {

207

"auth_type": "Client",

208

"tenant_id": "your-tenant-id",

209

"client_id": "your-client-id",

210

"client_secret": "your-client-secret",

211

"refresh_token": "your-refresh-token"

212

},

213

"drive_name": "OneDrive",

214

"search_scope": "ACCESSIBLE_DRIVES",

215

"folder_path": "Documents"

216

})

217

218

reader = SourceMicrosoftOneDriveStreamReader()

219

reader.config = config

220

221

# Get files matching patterns

222

logger = logging.getLogger(__name__)

223

files = reader.get_matching_files(["*.pdf", "*.docx"], None, logger)

224

225

for file in files:

226

print(f"File: {file.uri}, Modified: {file.last_modified}")

227

```

228

229

### Reading File Content

230

231

```python

232

from airbyte_cdk.sources.file_based.file_based_stream_reader import FileReadMode

233

234

# Open and read a file

235

for file in files:

236

with reader.open_file(file, FileReadMode.READ, "utf-8", logger) as f:

237

content = f.read()

238

print(f"Content length: {len(content)}")

239

```

240

241

### Accessing All Files

242

243

```python

244

# Get all files based on search scope

245

all_files = reader.get_all_files()

246

247

for file_path, download_url, last_modified in all_files:

248

print(f"Path: {file_path}")

249

print(f"URL: {download_url}")

250

print(f"Modified: {last_modified}")

251

print("---")

252

```

253

254

### Error Handling Example

255

256

```python

257

from airbyte_cdk import AirbyteTracedException

258

259

try:

260

files = reader.get_matching_files(["*.txt"], None, logger)

261

file_list = list(files) # Convert iterator to list

262

print(f"Found {len(file_list)} files")

263

except AirbyteTracedException as e:

264

if "empty or does not exist" in e.message:

265

print("Drive is empty or inaccessible")

266

else:

267

print(f"Error: {e.message}")

268

```

269

270

### Drive Information

271

272

```python

273

# Access available drives

274

drives = reader.drives

275

276

for drive in drives:

277

print(f"Drive: {drive.name}, Type: {drive.drive_type}, ID: {drive.id}")

278

```

279

280

## Search Scope Behavior

281

282

- **ACCESSIBLE_DRIVES**: Only searches files in the specified drive_name using folder_path

283

- **SHARED_ITEMS**: Only searches shared items across all drives (ignores folder_path)

284

- **ALL**: Searches both accessible drives with folder_path AND all shared items

285

286

## File Metadata

287

288

Each discovered file includes:

289

- **File Path**: Relative path from search root

290

- **Download URL**: Direct Microsoft Graph API download URL

291

- **Last Modified**: Timestamp of last file modification

292

- **File Size**: Available through RemoteFile base class

293

294

## Error Handling

295

296

File operations include comprehensive error handling:

297

- **Authentication Errors**: Token refresh and permission issues

298

- **API Rate Limits**: Automatic retry with appropriate backoff

299

- **File Access Errors**: Graceful handling of missing or inaccessible files

300

- **Network Issues**: Retry logic for transient network failures

301

- **Drive Access**: Clear error messages for empty or non-existent drives

302

303

## Performance Optimizations

304

305

- **Lazy Initialization**: Clients are only created when needed

306

- **Caching**: Drive information is cached using @lru_cache

307

- **Streaming**: Files are opened as streams to handle large files efficiently

308

- **Batch Operations**: Bulk file discovery operations where possible