0
# Encoder-Only Embedders
1
2
Embedders designed for encoder-only transformer models (BERT-like architectures). These models excel at understanding bidirectional context and are particularly effective for semantic similarity tasks and dense retrieval.
3
4
## Capabilities
5
6
### FlagModel (Base Encoder Embedder)
7
8
Standard embedder for encoder-only models using CLS token pooling by default. Supports all standard BERT-like architectures and provides a solid foundation for most embedding tasks.
9
10
```python { .api }
11
from typing import Union
12
13
class FlagModel(AbsEmbedder):
14
def __init__(
15
self,
16
model_name_or_path: str,
17
pooling_method: str = "cls",
18
normalize_embeddings: bool = True,
19
use_fp16: bool = True,
20
query_instruction_for_retrieval: Optional[str] = None,
21
query_instruction_format: str = "{}{}",
22
devices: Optional[Union[str, List[str]]] = None,
23
batch_size: int = 256,
24
query_max_length: int = 512,
25
passage_max_length: int = 512,
26
convert_to_numpy: bool = True,
27
trust_remote_code: bool = False,
28
cache_dir: Optional[str] = None,
29
**kwargs
30
):
31
"""
32
Initialize encoder-only embedder.
33
34
Args:
35
model_name_or_path: Path to model or HuggingFace model name
36
pooling_method: Pooling strategy ("cls", "mean")
37
normalize_embeddings: Whether to normalize output embeddings
38
use_fp16: Use half precision for inference
39
query_instruction_for_retrieval: Instruction prepended to queries
40
query_instruction_format: Format string for instructions
41
devices: List of devices for multi-GPU inference
42
batch_size: Default batch size for encoding
43
query_max_length: Maximum query token length
44
passage_max_length: Maximum passage token length
45
convert_to_numpy: Convert outputs to numpy arrays
46
trust_remote_code: Allow custom model code execution
47
cache_dir: Directory for model cache
48
**kwargs: Additional model parameters
49
"""
50
```
51
52
### BGEM3FlagModel (Specialized M3 Embedder)
53
54
Advanced embedder specifically designed for BGE-M3 models with support for dense, sparse, and ColBERT representations. Provides unified multi-vector embeddings for comprehensive retrieval scenarios.
55
56
```python { .api }
57
class BGEM3FlagModel(AbsEmbedder):
58
def __init__(
59
self,
60
model_name_or_path: str,
61
pooling_method: str = "cls",
62
normalize_embeddings: bool = True,
63
use_fp16: bool = True,
64
query_instruction_for_retrieval: Optional[str] = None,
65
query_instruction_format: str = "{}{}",
66
devices: Optional[Union[str, List[str]]] = None,
67
batch_size: int = 256,
68
query_max_length: int = 512,
69
passage_max_length: int = 512,
70
convert_to_numpy: bool = True,
71
colbert_dim: int = -1,
72
return_dense: bool = True,
73
return_sparse: bool = False,
74
return_colbert_vecs: bool = False,
75
**kwargs
76
):
77
"""
78
Initialize BGE-M3 specialized embedder.
79
80
Args:
81
model_name_or_path: Path to BGE-M3 model
82
pooling_method: Pooling strategy ("cls", "mean")
83
normalize_embeddings: Whether to normalize output embeddings
84
use_fp16: Use half precision for inference
85
query_instruction_for_retrieval: Instruction prepended to queries
86
query_instruction_format: Format string for instructions
87
devices: List of devices for multi-GPU inference
88
batch_size: Default batch size for encoding
89
query_max_length: Maximum query token length
90
passage_max_length: Maximum passage token length
91
convert_to_numpy: Convert outputs to numpy arrays
92
colbert_dim: ColBERT dimension (-1 for auto)
93
return_dense: Include dense embeddings in output
94
return_sparse: Include sparse embeddings in output
95
return_colbert_vecs: Include ColBERT vectors in output
96
**kwargs: Additional model parameters
97
"""
98
99
def compute_score(
100
self,
101
q_reps: Dict[str, Any],
102
p_reps: Dict[str, Any],
103
weights: Optional[List[float]] = None
104
) -> float:
105
"""
106
Compute similarity score between query and passage representations.
107
108
Args:
109
q_reps: Query representations (dense, sparse, colbert)
110
p_reps: Passage representations (dense, sparse, colbert)
111
weights: Weights for combining different representation types
112
113
Returns:
114
Combined similarity score
115
"""
116
117
def compute_lexical_matching_score(
118
self,
119
lexical_weights_1: Dict[int, float],
120
lexical_weights_2: Dict[int, float]
121
) -> float:
122
"""
123
Compute lexical matching score between sparse representations.
124
125
Args:
126
lexical_weights_1: First sparse representation weights
127
lexical_weights_2: Second sparse representation weights
128
129
Returns:
130
Lexical matching score
131
"""
132
133
def colbert_score(
134
self,
135
q_reps: torch.Tensor,
136
p_reps: torch.Tensor
137
) -> float:
138
"""
139
Compute ColBERT similarity score.
140
141
Args:
142
q_reps: Query ColBERT vectors
143
p_reps: Passage ColBERT vectors
144
145
Returns:
146
ColBERT similarity score
147
"""
148
149
def convert_id_to_token(
150
self,
151
lexical_weights: Dict[int, float]
152
) -> List[Dict[str, Any]]:
153
"""
154
Convert token IDs in sparse weights to actual tokens.
155
156
Args:
157
lexical_weights: Sparse weights with token IDs
158
159
Returns:
160
List of token-weight mappings
161
"""
162
```
163
164
## Usage Examples
165
166
### Basic Encoder Embedder
167
168
```python
169
from FlagEmbedding import FlagModel
170
171
# Initialize with CLS pooling
172
embedder = FlagModel(
173
'bge-large-en-v1.5',
174
pooling_method="cls",
175
use_fp16=True
176
)
177
178
# Encode queries and documents
179
queries = ["What is deep learning?", "How do transformers work?"]
180
documents = ["Deep learning is a subset of ML", "Transformers use attention mechanisms"]
181
182
query_embeddings = embedder.encode_queries(queries)
183
doc_embeddings = embedder.encode_corpus(documents)
184
185
print(f"Query embeddings shape: {query_embeddings.shape}")
186
print(f"Document embeddings shape: {doc_embeddings.shape}")
187
```
188
189
### Mean Pooling Strategy
190
191
```python
192
from FlagEmbedding import FlagModel
193
194
# Use mean pooling instead of CLS
195
embedder = FlagModel(
196
'bge-base-en-v1.5',
197
pooling_method="mean",
198
normalize_embeddings=True
199
)
200
201
texts = ["Example text for embedding"]
202
embeddings = embedder.encode(texts)
203
```
204
205
### BGE-M3 Multi-Vector Embeddings
206
207
```python
208
from FlagEmbedding import BGEM3FlagModel
209
210
# Initialize BGE-M3 with all representation types
211
embedder = BGEM3FlagModel(
212
'bge-m3',
213
return_dense=True,
214
return_sparse=True,
215
return_colbert_vecs=True,
216
use_fp16=True
217
)
218
219
# Encode with multiple representation types
220
query = ["machine learning applications"]
221
passage = ["ML is used in healthcare, finance, and technology"]
222
223
query_output = embedder.encode_queries(query)
224
passage_output = embedder.encode_corpus(passage)
225
226
# Access different representation types
227
if isinstance(query_output, dict):
228
dense_query = query_output.get('dense_vecs')
229
sparse_query = query_output.get('lexical_weights')
230
colbert_query = query_output.get('colbert_vecs')
231
```
232
233
### M3 Similarity Scoring
234
235
```python
236
from FlagEmbedding import BGEM3FlagModel
237
238
embedder = BGEM3FlagModel(
239
'bge-m3',
240
return_dense=True,
241
return_sparse=True,
242
return_colbert_vecs=True
243
)
244
245
# Get representations for scoring
246
query_reps = embedder.encode_queries(["machine learning"])
247
passage_reps = embedder.encode_corpus(["ML algorithms"])
248
249
# Compute combined similarity score
250
score = embedder.compute_score(query_reps[0], passage_reps[0])
251
print(f"Combined similarity: {score}")
252
253
# Compute individual scores if needed
254
if 'lexical_weights' in query_reps[0]:
255
lexical_score = embedder.compute_lexical_matching_score(
256
query_reps[0]['lexical_weights'][0],
257
passage_reps[0]['lexical_weights'][0]
258
)
259
print(f"Lexical similarity: {lexical_score}")
260
261
if 'colbert_vecs' in query_reps[0]:
262
colbert_score = embedder.colbert_score(
263
query_reps[0]['colbert_vecs'][0],
264
passage_reps[0]['colbert_vecs'][0]
265
)
266
print(f"ColBERT similarity: {colbert_score}")
267
```
268
269
### Custom Instructions for Retrieval
270
271
```python
272
from FlagEmbedding import FlagModel
273
274
# Add custom instruction for retrieval tasks
275
embedder = FlagModel(
276
'bge-large-en-v1.5',
277
query_instruction_for_retrieval="Represent this query for retrieving relevant documents: ",
278
query_instruction_format="{}{}"
279
)
280
281
# Queries will be prepended with instruction
282
queries = ["best practices for machine learning"]
283
embeddings = embedder.encode_queries(queries)
284
```
285
286
### Multi-GPU Processing
287
288
```python
289
from FlagEmbedding import FlagModel
290
291
# Use multiple GPUs for large-scale processing
292
embedder = FlagModel(
293
'bge-large-en-v1.5',
294
devices=['cuda:0', 'cuda:1', 'cuda:2'],
295
batch_size=128
296
)
297
298
# Process large corpus efficiently
299
large_corpus = [f"Document {i}" for i in range(50000)]
300
embeddings = embedder.encode_corpus(large_corpus)
301
```
302
303
## Supported Models
304
305
### BGE Models
306
- bge-large-en-v1.5, bge-base-en-v1.5, bge-small-en-v1.5
307
- bge-large-zh-v1.5, bge-base-zh-v1.5, bge-small-zh-v1.5
308
- bge-large-en, bge-base-en, bge-small-en
309
- bge-large-zh, bge-base-zh, bge-small-zh
310
- bge-m3 (requires BGEM3FlagModel)
311
312
### E5 Models
313
- e5-large-v2, e5-base-v2, e5-small-v2
314
- multilingual-e5-large, multilingual-e5-base, multilingual-e5-small
315
- e5-large, e5-base, e5-small
316
317
### GTE Models
318
- gte-multilingual-base, gte-large-en-v1.5, gte-base-en-v1.5
319
- gte-large, gte-base, gte-small
320
- gte-large-zh, gte-base-zh, gte-small-zh
321
322
## Types
323
324
```python { .api }
325
from typing import Dict, List, Optional, Union, Any
326
import torch
327
import numpy as np
328
329
# BGE-M3 specific types
330
M3Output = Dict[str, Union[torch.Tensor, np.ndarray, List[Dict[int, float]]]]
331
SparseWeights = Dict[int, float]
332
ColBERTVectors = torch.Tensor
333
DenseEmbedding = Union[torch.Tensor, np.ndarray]
334
335
# Pooling method types
336
PoolingMethod = Literal["cls", "mean"]
337
```