0
# Kernels
1
2
DScribe's kernel methods provide similarity measures between atomic structures based on local atomic environment comparisons. These kernels are particularly useful for machine learning applications where you need to measure structural similarity or build kernel-based models.
3
4
## Capabilities
5
6
### AverageKernel
7
8
The AverageKernel computes global structural similarity as the average of local environment similarities. It provides a simple and intuitive way to measure how similar two structures are based on their local atomic environments.
9
10
```python { .api }
11
class AverageKernel:
12
def __init__(self, metric, gamma=None, degree=3, coef0=1,
13
kernel_params=None, normalize_kernel=True):
14
"""
15
Initialize Average Kernel.
16
17
Parameters:
18
- metric (str): Distance metric for local similarities:
19
- "linear": Linear kernel (dot product)
20
- "polynomial": Polynomial kernel
21
- "rbf": Radial basis function (Gaussian) kernel
22
- "laplacian": Laplacian kernel
23
- "sigmoid": Sigmoid kernel
24
- gamma (float): Kernel coefficient for rbf, polynomial, and sigmoid kernels
25
- degree (int): Degree for polynomial kernel
26
- coef0 (float): Independent term for polynomial and sigmoid kernels
27
- kernel_params (dict): Additional parameters for specific kernels
28
- normalize_kernel (bool): Whether to normalize the kernel matrix
29
"""
30
31
def create(self, x, y=None):
32
"""
33
Create kernel matrix from local descriptors.
34
35
Parameters:
36
- x: Local descriptors for first set of structures (list of arrays)
37
- y: Local descriptors for second set of structures (optional, defaults to x)
38
39
Returns:
40
numpy.ndarray: Kernel matrix with shape (n_structures_x, n_structures_y)
41
"""
42
43
def get_global_similarity(self, localkernel):
44
"""
45
Compute global similarity from local similarity matrix.
46
47
Parameters:
48
- localkernel: Local kernel matrix between environments
49
50
Returns:
51
float: Global similarity value (average of local similarities)
52
"""
53
```
54
55
**Usage Example:**
56
57
```python
58
from dscribe.kernels import AverageKernel
59
from dscribe.descriptors import SOAP
60
from ase.build import molecule
61
62
# Setup SOAP descriptor for local environments
63
soap = SOAP(species=["H", "O"], r_cut=5.0, n_max=8, l_max=6)
64
65
# Create local descriptors for molecules
66
molecules = [molecule("H2O"), molecule("H2O2")]
67
soap_descriptors = [soap.create(mol) for mol in molecules]
68
69
# Setup Average Kernel with RBF similarity metric
70
kernel = AverageKernel(metric="rbf", gamma=1.0)
71
72
# Compute kernel matrix
73
K = kernel.create(soap_descriptors) # Shape: (2, 2)
74
print(f"Self-similarity: {K[0,0]}")
75
print(f"Cross-similarity: {K[0,1]}")
76
77
# Compare against different molecules
78
other_molecules = [molecule("NH3"), molecule("CH4")]
79
other_descriptors = [soap.create(mol) for mol in other_molecules]
80
K_cross = kernel.create(soap_descriptors, other_descriptors) # Shape: (2, 2)
81
```
82
83
### REMatchKernel
84
85
The REMatchKernel (Regularized-Entropy Match Kernel) uses optimal transport theory to find the best matching between local environments of two structures. This provides a more sophisticated similarity measure that accounts for the optimal assignment of local environments.
86
87
```python { .api }
88
class REMatchKernel:
89
def __init__(self, alpha=0.1, threshold=1e-6, metric="linear", gamma=None,
90
degree=3, coef0=1, kernel_params=None, normalize_kernel=True):
91
"""
92
Initialize REMatch Kernel.
93
94
Parameters:
95
- alpha (float): Entropy regularization parameter (controls transport cost)
96
- threshold (float): Convergence threshold for Sinkhorn algorithm
97
- metric (str): Distance metric for local similarities:
98
- "linear": Linear kernel (dot product)
99
- "polynomial": Polynomial kernel
100
- "rbf": Radial basis function (Gaussian) kernel
101
- "laplacian": Laplacian kernel
102
- "sigmoid": Sigmoid kernel
103
- gamma (float): Kernel coefficient for rbf, polynomial, and sigmoid kernels
104
- degree (int): Degree for polynomial kernel
105
- coef0 (float): Independent term for polynomial and sigmoid kernels
106
- kernel_params (dict): Additional parameters for specific kernels
107
- normalize_kernel (bool): Whether to normalize the kernel matrix
108
"""
109
110
def create(self, x, y=None):
111
"""
112
Create REMatch kernel matrix from local descriptors.
113
114
Parameters:
115
- x: Local descriptors for first set of structures (list of arrays)
116
- y: Local descriptors for second set of structures (optional, defaults to x)
117
118
Returns:
119
numpy.ndarray: REMatch kernel matrix with shape (n_structures_x, n_structures_y)
120
"""
121
122
def get_global_similarity(self, localkernel):
123
"""
124
Compute global similarity using optimal transport matching.
125
126
Parameters:
127
- localkernel: Local kernel matrix between environments
128
129
Returns:
130
float: Global similarity value from optimal transport solution
131
"""
132
```
133
134
**Usage Example:**
135
136
```python
137
from dscribe.kernels import REMatchKernel
138
from dscribe.descriptors import SOAP
139
from ase.build import molecule
140
141
# Setup SOAP descriptor
142
soap = SOAP(species=["H", "O"], r_cut=5.0, n_max=8, l_max=6)
143
144
# Create local descriptors
145
molecules = [molecule("H2O"), molecule("H2O2")]
146
soap_descriptors = [soap.create(mol) for mol in molecules]
147
148
# Setup REMatch Kernel with custom parameters
149
rematch = REMatchKernel(
150
metric="rbf",
151
gamma=1.0,
152
alpha=0.1, # Lower alpha = more regularization
153
threshold=1e-8 # Higher precision convergence
154
)
155
156
# Compute REMatch kernel matrix
157
K_rematch = rematch.create(soap_descriptors) # Shape: (2, 2)
158
print(f"REMatch similarity: {K_rematch[0,1]}")
159
160
# Compare with different alpha values
161
rematch_low_reg = REMatchKernel(metric="rbf", gamma=1.0, alpha=0.01)
162
rematch_high_reg = REMatchKernel(metric="rbf", gamma=1.0, alpha=1.0)
163
164
K_low = rematch_low_reg.create(soap_descriptors)
165
K_high = rematch_high_reg.create(soap_descriptors)
166
```
167
168
## Kernel Theory and Applications
169
170
### Local Similarity Foundation
171
172
Both kernels build on the concept of local atomic environment similarity:
173
174
1. **Local descriptors** are computed for each atomic environment in each structure
175
2. **Local kernel matrix** is computed between all environment pairs using the specified metric
176
3. **Global similarity** is derived from the local similarities using different aggregation methods
177
178
### AverageKernel vs REMatchKernel
179
180
**AverageKernel**:
181
- Simple average of all local environment similarities
182
- Computationally efficient
183
- Good for structures with similar local environment counts
184
- Formula: K(A,B) = (1/NM) * Σᵢⱼ Cᵢⱼ(A,B)
185
186
**REMatchKernel**:
187
- Uses optimal transport to find best environment matching
188
- More sophisticated but computationally intensive
189
- Better for structures with different sizes or environment distributions
190
- Uses Sinkhorn algorithm to solve regularized optimal transport
191
192
### Kernel Metrics
193
194
All kernels support various similarity metrics:
195
196
```python
197
# Linear kernel (fastest)
198
kernel = AverageKernel(metric="linear")
199
200
# RBF (Gaussian) kernel - most common
201
kernel = AverageKernel(metric="rbf", gamma=1.0)
202
203
# Polynomial kernel
204
kernel = AverageKernel(metric="polynomial", degree=3, gamma=1.0, coef0=1.0)
205
206
# Laplacian kernel
207
kernel = AverageKernel(metric="laplacian", gamma=1.0)
208
209
# Sigmoid kernel
210
kernel = AverageKernel(metric="sigmoid", gamma=1.0, coef0=1.0)
211
```
212
213
## Usage with Machine Learning
214
215
### Kernel Matrices for Classification
216
217
```python
218
from sklearn.svm import SVC
219
from dscribe.kernels import AverageKernel
220
from dscribe.descriptors import SOAP
221
222
# Prepare data
223
soap = SOAP(species=["C", "H", "O"], r_cut=5.0, n_max=8, l_max=6)
224
structures = [...] # List of ASE Atoms objects
225
labels = [...] # Target labels
226
227
# Compute local descriptors
228
local_descriptors = [soap.create(struct) for struct in structures]
229
230
# Compute kernel matrix
231
kernel = AverageKernel(metric="rbf", gamma=1.0)
232
K_train = kernel.create(local_descriptors)
233
234
# Use precomputed kernel in scikit-learn
235
svm = SVC(kernel="precomputed")
236
svm.fit(K_train, labels)
237
238
# Predict on new data
239
new_descriptors = [soap.create(new_struct) for new_struct in test_structures]
240
K_test = kernel.create(new_descriptors, local_descriptors)
241
predictions = svm.predict(K_test)
242
```
243
244
### Similarity Analysis
245
246
```python
247
# Compute pairwise similarities
248
similarities = kernel.create(local_descriptors)
249
250
# Find most similar structures
251
import numpy as np
252
most_similar_pairs = np.unravel_index(
253
np.argsort(similarities.ravel())[-10:],
254
similarities.shape
255
)
256
257
# Cluster structures based on kernel similarities
258
from sklearn.cluster import SpectralClustering
259
clustering = SpectralClustering(
260
n_clusters=3,
261
affinity="precomputed",
262
random_state=42
263
)
264
cluster_labels = clustering.fit_predict(similarities)
265
```
266
267
## Parameter Selection Guidelines
268
269
### AverageKernel Parameters
270
271
- **metric="rbf", gamma=1.0**: Good default for most applications
272
- **Higher gamma**: More sensitive to local differences
273
- **Lower gamma**: More tolerant of local differences
274
- **normalize_kernel=True**: Usually recommended for consistent scaling
275
276
### REMatchKernel Parameters
277
278
- **alpha=0.1**: Balanced regularization (good default)
279
- **Lower alpha (0.01-0.05)**: More regularization, smoother transport
280
- **Higher alpha (0.5-1.0)**: Less regularization, sharper transport
281
- **threshold=1e-6**: Sufficient precision for most applications
282
283
### Computational Considerations
284
285
- **AverageKernel**: Fast, scales linearly with number of environments
286
- **REMatchKernel**: Slower, requires iterative optimization
287
- **Local descriptor size**: Affects both kernel computation time and memory usage
288
- **Number of structures**: Kernel matrix size scales as O(N²)
289
290
Choose AverageKernel for large datasets or when computational efficiency is critical. Use REMatchKernel when maximum accuracy is needed and computational resources are available.