0
# File Operations
1
2
File discovery, enumeration, and reading capabilities for Microsoft OneDrive files including support for nested folder structures, glob pattern matching, shared items access, and efficient streaming with metadata extraction.
3
4
## Capabilities
5
6
### Stream Reader
7
8
Primary class for handling file operations across OneDrive drives and shared items with lazy initialization and caching.
9
10
```python { .api }
11
class SourceMicrosoftOneDriveStreamReader(AbstractFileBasedStreamReader):
12
ROOT_PATH: List[str] = [".", "/"]
13
14
def __init__(self):
15
"""Initialize the stream reader with lazy-loaded clients."""
16
17
@property
18
def config(self) -> SourceMicrosoftOneDriveSpec:
19
"""Get the current configuration."""
20
21
@config.setter
22
def config(self, value: SourceMicrosoftOneDriveSpec):
23
"""
24
Set configuration with type validation.
25
26
Parameters:
27
- value: SourceMicrosoftOneDriveSpec - Must be valid configuration spec
28
"""
29
30
@property
31
def auth_client(self):
32
"""Lazy initialization of the authentication client."""
33
34
@property
35
def one_drive_client(self):
36
"""Lazy initialization of the Microsoft Graph client."""
37
38
def get_access_token(self):
39
"""Directly fetch a new access token from the auth_client."""
40
41
@property
42
def drives(self):
43
"""
44
Retrieves and caches OneDrive drives, including the user's drive.
45
Filters to only personal and business drive types.
46
47
Returns:
48
List of OneDrive drive objects accessible to authenticated user
49
"""
50
```
51
52
### File Discovery
53
54
Methods for discovering and filtering files across different OneDrive locations.
55
56
```python { .api }
57
def get_matching_files(
58
self,
59
globs: List[str],
60
prefix: Optional[str],
61
logger: logging.Logger
62
) -> Iterable[RemoteFile]:
63
"""
64
Retrieve all files matching the specified glob patterns in OneDrive.
65
Handles the special case where the drive might be empty by catching StopIteration.
66
67
Parameters:
68
- globs: List[str] - Glob patterns to match files against
69
- prefix: Optional[str] - Optional prefix filter (not used in OneDrive implementation)
70
- logger: logging.Logger - Logger for operation tracking
71
72
Returns:
73
Iterable[RemoteFile]: Iterator of MicrosoftOneDriveRemoteFile objects
74
75
Raises:
76
- AirbyteTracedException: If drive is empty or does not exist
77
78
Implementation:
79
Uses a special approach to handle empty drives by checking for StopIteration
80
from the files generator and yielding files in two phases.
81
"""
82
83
def get_all_files(self):
84
"""
85
Generator yielding all accessible files based on search scope configuration.
86
Handles both accessible drives and shared items based on search_scope setting.
87
88
Yields:
89
Tuple[str, str, datetime]: File path, download URL, and last modified time
90
"""
91
92
def get_files_by_drive_name(self, drive_name: str, folder_path: str):
93
"""
94
Yields files from the specified drive and folder path.
95
96
Parameters:
97
- drive_name: str - Name of the OneDrive drive to search
98
- folder_path: str - Path within the drive to search
99
100
Yields:
101
Tuple[str, str, str]: File path, download URL, and last modified datetime string
102
"""
103
```
104
105
### File Reading
106
107
Methods for opening and reading OneDrive files with proper encoding support.
108
109
```python { .api }
110
def open_file(
111
self,
112
file: RemoteFile,
113
mode: FileReadMode,
114
encoding: Optional[str],
115
logger: logging.Logger
116
) -> IOBase:
117
"""
118
Open a OneDrive file for reading using smart-open.
119
120
Parameters:
121
- file: RemoteFile - File object with download URL
122
- mode: FileReadMode - File reading mode (typically READ)
123
- encoding: Optional[str] - Text encoding (e.g., 'utf-8', 'latin-1')
124
- logger: logging.Logger - Logger for error tracking
125
126
Returns:
127
IOBase: Opened file-like object for reading
128
129
Raises:
130
- Exception: If file cannot be opened or accessed
131
"""
132
```
133
134
### Directory Operations
135
136
Methods for recursive directory traversal and file enumeration.
137
138
```python { .api }
139
def list_directories_and_files(self, root_folder, path: Optional[str] = None):
140
"""
141
Enumerates folders and files starting from a root folder recursively.
142
143
Parameters:
144
- root_folder: OneDrive folder object to start enumeration from
145
- path: Optional[str] - Current path for building full file paths
146
147
Returns:
148
List[Tuple[str, str, str]]: List of (file_path, download_url, last_modified)
149
"""
150
```
151
152
### Shared Items Access
153
154
Methods for accessing files shared with the authenticated user.
155
156
```python { .api }
157
def _get_shared_files_from_all_drives(self, parsed_drive_id: str):
158
"""
159
Get files from shared items across all drives.
160
161
Parameters:
162
- parsed_drive_id: str - Drive ID to exclude from results to avoid duplicates
163
164
Yields:
165
Tuple[str, str, datetime]: File path, download URL, and last modified time
166
"""
167
168
def _get_shared_drive_object(self, drive_id: str, object_id: str, path: str) -> List[Tuple[str, str, datetime]]:
169
"""
170
Retrieves a list of all nested files under the specified shared object.
171
172
Parameters:
173
- drive_id: str - The ID of the drive containing the object
174
- object_id: str - The ID of the object to start the search from
175
- path: str - Base path for building file paths
176
177
Returns:
178
List[Tuple[str, str, datetime]]: File information tuples
179
180
Raises:
181
- RuntimeError: If an error occurs during the Microsoft Graph API request
182
"""
183
```
184
185
### Remote File Model
186
187
File representation with OneDrive-specific attributes.
188
189
```python { .api }
190
class MicrosoftOneDriveRemoteFile(RemoteFile):
191
download_url: str
192
"""Direct download URL from Microsoft Graph API for file content access."""
193
```
194
195
## Usage Examples
196
197
### Basic File Discovery
198
199
```python
200
from source_microsoft_onedrive.stream_reader import SourceMicrosoftOneDriveStreamReader
201
from source_microsoft_onedrive.spec import SourceMicrosoftOneDriveSpec
202
import logging
203
204
# Configure stream reader
205
config = SourceMicrosoftOneDriveSpec(**{
206
"credentials": {
207
"auth_type": "Client",
208
"tenant_id": "your-tenant-id",
209
"client_id": "your-client-id",
210
"client_secret": "your-client-secret",
211
"refresh_token": "your-refresh-token"
212
},
213
"drive_name": "OneDrive",
214
"search_scope": "ACCESSIBLE_DRIVES",
215
"folder_path": "Documents"
216
})
217
218
reader = SourceMicrosoftOneDriveStreamReader()
219
reader.config = config
220
221
# Get files matching patterns
222
logger = logging.getLogger(__name__)
223
files = reader.get_matching_files(["*.pdf", "*.docx"], None, logger)
224
225
for file in files:
226
print(f"File: {file.uri}, Modified: {file.last_modified}")
227
```
228
229
### Reading File Content
230
231
```python
232
from airbyte_cdk.sources.file_based.file_based_stream_reader import FileReadMode
233
234
# Open and read a file
235
for file in files:
236
with reader.open_file(file, FileReadMode.READ, "utf-8", logger) as f:
237
content = f.read()
238
print(f"Content length: {len(content)}")
239
```
240
241
### Accessing All Files
242
243
```python
244
# Get all files based on search scope
245
all_files = reader.get_all_files()
246
247
for file_path, download_url, last_modified in all_files:
248
print(f"Path: {file_path}")
249
print(f"URL: {download_url}")
250
print(f"Modified: {last_modified}")
251
print("---")
252
```
253
254
### Error Handling Example
255
256
```python
257
from airbyte_cdk import AirbyteTracedException
258
259
try:
260
files = reader.get_matching_files(["*.txt"], None, logger)
261
file_list = list(files) # Convert iterator to list
262
print(f"Found {len(file_list)} files")
263
except AirbyteTracedException as e:
264
if "empty or does not exist" in e.message:
265
print("Drive is empty or inaccessible")
266
else:
267
print(f"Error: {e.message}")
268
```
269
270
### Drive Information
271
272
```python
273
# Access available drives
274
drives = reader.drives
275
276
for drive in drives:
277
print(f"Drive: {drive.name}, Type: {drive.drive_type}, ID: {drive.id}")
278
```
279
280
## Search Scope Behavior
281
282
- **ACCESSIBLE_DRIVES**: Only searches files in the specified drive_name using folder_path
283
- **SHARED_ITEMS**: Only searches shared items across all drives (ignores folder_path)
284
- **ALL**: Searches both accessible drives with folder_path AND all shared items
285
286
## File Metadata
287
288
Each discovered file includes:
289
- **File Path**: Relative path from search root
290
- **Download URL**: Direct Microsoft Graph API download URL
291
- **Last Modified**: Timestamp of last file modification
292
- **File Size**: Available through RemoteFile base class
293
294
## Error Handling
295
296
File operations include comprehensive error handling:
297
- **Authentication Errors**: Token refresh and permission issues
298
- **API Rate Limits**: Automatic retry with appropriate backoff
299
- **File Access Errors**: Graceful handling of missing or inaccessible files
300
- **Network Issues**: Retry logic for transient network failures
301
- **Drive Access**: Clear error messages for empty or non-existent drives
302
303
## Performance Optimizations
304
305
- **Lazy Initialization**: Clients are only created when needed
306
- **Caching**: Drive information is cached using @lru_cache
307
- **Streaming**: Files are opened as streams to handle large files efficiently
308
- **Batch Operations**: Bulk file discovery operations where possible