0
# Decoder-Only Embedders
1
2
Embedders designed for decoder-only transformer models (LLM-like architectures). These models leverage large language model capabilities for embedding generation, often using the last token for representation and supporting instruction-based formatting.
3
4
## Capabilities
5
6
### FlagLLMModel (Base LLM Embedder)
7
8
Standard embedder for decoder-only models using last token pooling. Designed for large language models that generate embeddings through their natural language understanding capabilities.
9
10
```python { .api }
11
from typing import Union
12
13
class FlagLLMModel(AbsEmbedder):
14
def __init__(
15
self,
16
model_name_or_path: str,
17
pooling_method: str = "last_token",
18
normalize_embeddings: bool = True,
19
use_fp16: bool = True,
20
query_instruction_for_retrieval: Optional[str] = None,
21
query_instruction_format: str = "Instruct: {}\nQuery: {}",
22
devices: Optional[Union[str, List[str]]] = None,
23
batch_size: int = 256,
24
query_max_length: int = 512,
25
passage_max_length: int = 512,
26
convert_to_numpy: bool = True,
27
**kwargs
28
):
29
"""
30
Initialize decoder-only LLM embedder.
31
32
Args:
33
model_name_or_path: Path to model or HuggingFace model name
34
pooling_method: Pooling strategy ("last_token")
35
normalize_embeddings: Whether to normalize output embeddings
36
use_fp16: Use half precision for inference
37
query_instruction_for_retrieval: Instruction for retrieval queries
38
query_instruction_format: Format string for instructions
39
devices: List of devices for multi-GPU inference
40
batch_size: Default batch size for encoding
41
query_max_length: Maximum query token length
42
passage_max_length: Maximum passage token length
43
convert_to_numpy: Convert outputs to numpy arrays
44
**kwargs: Additional model parameters
45
"""
46
```
47
48
### FlagICLModel (In-Context Learning Embedder)
49
50
Specialized embedder for in-context learning approaches with large language models. Leverages few-shot examples and context to generate high-quality embeddings.
51
52
```python { .api }
53
class FlagICLModel(AbsEmbedder):
54
def __init__(
55
self,
56
model_name_or_path: str,
57
pooling_method: str = "last_token",
58
normalize_embeddings: bool = True,
59
use_fp16: bool = True,
60
query_instruction_for_retrieval: Optional[str] = None,
61
query_instruction_format: str = "{}{}",
62
devices: Optional[Union[str, List[str]]] = None,
63
batch_size: int = 256,
64
query_max_length: int = 512,
65
passage_max_length: int = 512,
66
convert_to_numpy: bool = True,
67
**kwargs
68
):
69
"""
70
Initialize in-context learning embedder.
71
72
Args:
73
model_name_or_path: Path to ICL-capable model
74
pooling_method: Pooling strategy ("last_token")
75
normalize_embeddings: Whether to normalize output embeddings
76
use_fp16: Use half precision for inference
77
query_instruction_for_retrieval: Instruction for retrieval queries
78
query_instruction_format: Format string for instructions
79
devices: List of devices for multi-GPU inference
80
batch_size: Default batch size for encoding
81
query_max_length: Maximum query token length
82
passage_max_length: Maximum passage token length
83
convert_to_numpy: Convert outputs to numpy arrays
84
**kwargs: Additional model parameters
85
"""
86
```
87
88
## Usage Examples
89
90
### Basic LLM Embedder
91
92
```python
93
from FlagEmbedding import FlagLLMModel
94
95
# Initialize LLM embedder with last token pooling
96
embedder = FlagLLMModel(
97
'e5-mistral-7b-instruct',
98
pooling_method="last_token",
99
use_fp16=True
100
)
101
102
# Encode queries and passages
103
queries = ["What are the applications of machine learning?"]
104
passages = ["Machine learning is applied in healthcare, finance, and autonomous systems"]
105
106
query_embeddings = embedder.encode_queries(queries)
107
passage_embeddings = embedder.encode_corpus(passages)
108
109
print(f"Query embedding shape: {query_embeddings.shape}")
110
print(f"Passage embedding shape: {passage_embeddings.shape}")
111
```
112
113
### Custom Instruction Formatting
114
115
```python
116
from FlagEmbedding import FlagLLMModel
117
118
# Use custom instruction format for queries
119
embedder = FlagLLMModel(
120
'e5-mistral-7b-instruct',
121
query_instruction_for_retrieval="Given a question, retrieve relevant documents that answer the question",
122
query_instruction_format="Instruct: {}\\nQuery: {}",
123
use_fp16=True
124
)
125
126
# Queries will be formatted with custom instructions
127
queries = ["How do neural networks learn?"]
128
embeddings = embedder.encode_queries(queries)
129
```
130
131
### In-Context Learning Embedder
132
133
```python
134
from FlagEmbedding import FlagICLModel
135
136
# Initialize ICL embedder for few-shot learning
137
embedder = FlagICLModel(
138
'bge-en-icl',
139
use_fp16=True,
140
batch_size=64 # Smaller batch for memory efficiency
141
)
142
143
# ICL works well with examples in context
144
queries = [
145
"Example: 'What is AI?' -> AI concepts. Query: 'What is machine learning?'"
146
]
147
148
embeddings = embedder.encode_queries(queries)
149
```
150
151
### Multi-GPU LLM Processing
152
153
```python
154
from FlagEmbedding import FlagLLMModel
155
156
# Use multiple GPUs for large LLM models
157
embedder = FlagLLMModel(
158
'e5-mistral-7b-instruct',
159
devices=['cuda:0', 'cuda:1'],
160
batch_size=32, # Smaller batch size for large models
161
use_fp16=True
162
)
163
164
# Process documents efficiently across GPUs
165
documents = [f"Document {i} content" for i in range(1000)]
166
embeddings = embedder.encode_corpus(documents)
167
```
168
169
### Custom Max Length Settings
170
171
```python
172
from FlagEmbedding import FlagLLMModel
173
174
# Configure different max lengths for queries vs passages
175
embedder = FlagLLMModel(
176
'e5-mistral-7b-instruct',
177
query_max_length=256, # Shorter for queries
178
passage_max_length=1024, # Longer for passages
179
use_fp16=True
180
)
181
182
# Long passage encoding
183
long_passage = "Very long document content..." * 100
184
passage_embedding = embedder.encode_corpus([long_passage])
185
```
186
187
### Retrieval-Specific Instructions
188
189
```python
190
from FlagEmbedding import FlagLLMModel
191
192
# Specialized instructions for different retrieval tasks
193
qa_embedder = FlagLLMModel(
194
'e5-mistral-7b-instruct',
195
query_instruction_for_retrieval="Represent this question for retrieving relevant answers",
196
query_instruction_format="Task: {}\\nInput: {}"
197
)
198
199
semantic_embedder = FlagLLMModel(
200
'e5-mistral-7b-instruct',
201
query_instruction_for_retrieval="Encode this text for semantic similarity search",
202
query_instruction_format="{}: {}"
203
)
204
205
# Different use cases
206
qa_queries = ["What causes climate change?"]
207
semantic_queries = ["renewable energy technologies"]
208
209
qa_embeddings = qa_embedder.encode_queries(qa_queries)
210
semantic_embeddings = semantic_embedder.encode_queries(semantic_queries)
211
```
212
213
### Comparing Decoder vs Encoder Models
214
215
```python
216
from FlagEmbedding import FlagLLMModel, FlagModel
217
218
# Decoder-only model
219
llm_embedder = FlagLLMModel('e5-mistral-7b-instruct')
220
221
# Encoder-only model
222
encoder_embedder = FlagModel('bge-large-en-v1.5')
223
224
text = ["Machine learning algorithms"]
225
226
# Both produce embeddings but with different characteristics
227
llm_emb = llm_embedder.encode(text)
228
encoder_emb = encoder_embedder.encode(text)
229
230
print(f"LLM embedding shape: {llm_emb.shape}")
231
print(f"Encoder embedding shape: {encoder_emb.shape}")
232
```
233
234
### Memory-Efficient Processing
235
236
```python
237
from FlagEmbedding import FlagLLMModel
238
239
# Configure for memory-constrained environments
240
embedder = FlagLLMModel(
241
'e5-mistral-7b-instruct',
242
use_fp16=True,
243
batch_size=8, # Very small batch
244
devices=['cuda:0'], # Single GPU
245
convert_to_numpy=True # Free GPU memory faster
246
)
247
248
# Process in smaller chunks
249
large_corpus = [f"Document {i}" for i in range(10000)]
250
chunk_size = 100
251
252
all_embeddings = []
253
for i in range(0, len(large_corpus), chunk_size):
254
chunk = large_corpus[i:i+chunk_size]
255
chunk_embeddings = embedder.encode_corpus(chunk)
256
all_embeddings.append(chunk_embeddings)
257
258
# Combine results
259
import numpy as np
260
final_embeddings = np.vstack(all_embeddings)
261
```
262
263
## Supported Models
264
265
### E5 LLM Models
266
- e5-mistral-7b-instruct (instruction-tuned Mistral)
267
268
### BGE LLM Models
269
- bge-en-icl (in-context learning model)
270
- bge-multilingual-gemma2
271
272
### GTE LLM Models
273
- gte-Qwen2-7B-instruct
274
- gte-Qwen2-1.5B-instruct
275
- gte-Qwen1.5-7B-instruct
276
277
## Model Selection Guidelines
278
279
### When to Use FlagLLMModel
280
- Working with instruction-tuned language models
281
- Need natural language understanding in embeddings
282
- Have computational resources for larger models
283
- Want to leverage instruction following capabilities
284
285
### When to Use FlagICLModel
286
- Need few-shot learning capabilities
287
- Working with domain-specific tasks
288
- Want to provide examples in context
289
- Need adaptability without fine-tuning
290
291
## Types
292
293
```python { .api }
294
from typing import Optional, List, Union
295
import torch
296
import numpy as np
297
298
# Decoder-specific pooling (only last_token supported)
299
DecoderPoolingMethod = Literal["last_token"]
300
301
# Instruction format templates
302
InstructionTemplate = str # Format string with {} placeholders
303
```