0
# Copy-Paste Detection
1
2
Copy-paste detection (CPD) capabilities provide Scalameta-based tokenization for identifying code duplication in Scala projects. The CPD system integrates with PMD's duplicate code detection framework to analyze Scala source files for similar code patterns.
3
4
## Core CPD Components
5
6
### ScalaCpdLexer
7
8
Primary tokenizer for copy-paste detection that converts Scala source code into tokens for duplication analysis.
9
10
```java { .api }
11
public class ScalaCpdLexer implements CpdLexer {
12
public ScalaCpdLexer(LanguagePropertyBundle bundle);
13
public void tokenize(TextDocument document, TokenFactory tokenEntries) throws IOException;
14
}
15
```
16
17
**Usage Example**:
18
19
```java
20
// Create CPD lexer with language properties
21
LanguagePropertyBundle bundle = LanguagePropertyBundle.create();
22
ScalaCpdLexer lexer = new ScalaCpdLexer(bundle);
23
24
// Tokenize a Scala source file
25
TextDocument document = TextDocument.readOnlyString("Example.scala", sourceCode);
26
List<CpdToken> tokens = new ArrayList<>();
27
TokenFactory tokenFactory = new TokenFactory(tokens);
28
29
try {
30
lexer.tokenize(document, tokenFactory);
31
System.out.println("Generated " + tokens.size() + " tokens for CPD analysis");
32
} catch (IOException e) {
33
System.err.println("Tokenization failed: " + e.getMessage());
34
}
35
```
36
37
### ScalaTokenAdapter
38
39
Adapter class that bridges Scalameta tokens to PMD's CPD token interface.
40
41
```java { .api }
42
public class ScalaTokenAdapter {
43
// Internal adapter implementation
44
// Converts scala.meta.Token to PMD CpdToken format
45
}
46
```
47
48
**Internal Usage**:
49
50
```java
51
// Used internally by ScalaCpdLexer
52
scala.meta.Token scalametaToken = // ... from Scalameta parsing
53
CpdToken pmdToken = ScalaTokenAdapter.adapt(scalametaToken, document);
54
tokenFactory.recordToken(pmdToken);
55
```
56
57
## Tokenization Process
58
59
### Token Generation Strategy
60
61
The CPD lexer processes Scala source code through the following steps:
62
63
1. **Scalameta Parsing**: Parse source code using Scalameta's tokenizer
64
2. **Token Filtering**: Filter out comments and whitespace tokens
65
3. **Token Adaptation**: Convert Scalameta tokens to PMD CPD format
66
4. **Position Mapping**: Maintain accurate source position information
67
68
```java
69
// Internal tokenization process
70
public void tokenize(TextDocument document, TokenFactory tokenEntries) throws IOException {
71
try {
72
// Parse with Scalameta
73
Input input = Input.String(document.getText().toString());
74
Tokens tokens = input.tokenize().get();
75
76
// Filter and adapt tokens
77
for (Token token : tokens) {
78
if (shouldIncludeToken(token)) {
79
CpdToken cpdToken = adaptToken(token, document);
80
tokenEntries.recordToken(cpdToken);
81
}
82
}
83
} catch (Exception e) {
84
throw new IOException("Scala tokenization failed", e);
85
}
86
}
87
```
88
89
### Token Filtering Rules
90
91
The tokenizer applies filtering rules to focus on semantically meaningful tokens:
92
93
```java
94
private boolean shouldIncludeToken(Token token) {
95
// Include identifiers, keywords, literals, operators
96
// Exclude comments, whitespace, formatting tokens
97
return !(token instanceof Token.Comment ||
98
token instanceof Token.Space ||
99
token instanceof Token.Tab ||
100
token instanceof Token.LF ||
101
token instanceof Token.CRLF ||
102
token instanceof Token.FF);
103
}
104
```
105
106
## Integration with PMD CPD
107
108
### Language Module Integration
109
110
CPD integration is handled through the language module:
111
112
```java { .api }
113
public class ScalaLanguageModule extends SimpleLanguageModuleBase {
114
@Override
115
public CpdLexer createCpdLexer(LanguagePropertyBundle bundle) {
116
return new ScalaCpdLexer(bundle);
117
}
118
}
119
```
120
121
**Usage Example**:
122
123
```java
124
// Get CPD lexer from language module
125
ScalaLanguageModule module = ScalaLanguageModule.getInstance();
126
LanguagePropertyBundle bundle = // ... configure properties
127
CpdLexer lexer = module.createCpdLexer(bundle);
128
129
// Use lexer for CPD analysis
130
lexer.tokenize(document, tokenEntries);
131
```
132
133
### CPD Configuration
134
135
Configure CPD analysis parameters for Scala projects:
136
137
```java
138
// CPD configuration for Scala
139
CpdConfiguration config = CpdConfiguration.builder()
140
.setMinimumTileSize(50) // Minimum tokens for duplication
141
.setLanguage("scala") // Use Scala CPD lexer
142
.setIgnoreAnnotations(true) // Ignore annotation differences
143
.setIgnoreIdentifiers(false) // Consider identifier names
144
.setIgnoreLiterals(true) // Ignore literal value differences
145
.build();
146
```
147
148
## Duplication Detection Patterns
149
150
### Class-Level Duplication
151
152
CPD can detect duplicated code patterns at various levels:
153
154
```scala
155
// Example: Similar class structures
156
class UserService {
157
def findById(id: Long): Option[User] = {
158
val query = "SELECT * FROM users WHERE id = ?"
159
executeQuery(query, id).map(parseUser)
160
}
161
162
def save(user: User): Boolean = {
163
val query = "INSERT INTO users (name, email) VALUES (?, ?)"
164
executeUpdate(query, user.name, user.email) > 0
165
}
166
}
167
168
class ProductService {
169
def findById(id: Long): Option[Product] = {
170
val query = "SELECT * FROM products WHERE id = ?"
171
executeQuery(query, id).map(parseProduct) // Similar pattern
172
}
173
174
def save(product: Product): Boolean = {
175
val query = "INSERT INTO products (name, price) VALUES (?, ?)"
176
executeUpdate(query, product.name, product.price) > 0 // Similar pattern
177
}
178
}
179
```
180
181
### Method-Level Duplication
182
183
```scala
184
// Example: Similar method implementations
185
def processUsers(users: List[User]): List[ProcessedUser] = {
186
users.map { user =>
187
val validated = validateUser(user)
188
val normalized = normalizeUser(validated)
189
val enriched = enrichUser(normalized)
190
ProcessedUser(enriched)
191
}
192
}
193
194
def processProducts(products: List[Product]): List[ProcessedProduct] = {
195
products.map { product =>
196
val validated = validateProduct(product) // Similar structure
197
val normalized = normalizeProduct(validated) // Similar structure
198
val enriched = enrichProduct(normalized) // Similar structure
199
ProcessedProduct(enriched)
200
}
201
}
202
```
203
204
### Expression-Level Duplication
205
206
```scala
207
// Example: Similar expression patterns
208
val userResult = Try {
209
val data = fetchUserData(id)
210
val parsed = parseUserData(data)
211
val validated = validateUserData(parsed)
212
validated
213
}.recover {
214
case _: NetworkException => DefaultUser
215
case _: ParseException => DefaultUser
216
}.get
217
218
val productResult = Try {
219
val data = fetchProductData(id) // Similar pattern
220
val parsed = parseProductData(data) // Similar pattern
221
val validated = validateProductData(parsed) // Similar pattern
222
validated
223
}.recover {
224
case _: NetworkException => DefaultProduct // Similar pattern
225
case _: ParseException => DefaultProduct // Similar pattern
226
}.get
227
```
228
229
## CPD Analysis Results
230
231
### Duplication Report Format
232
233
CPD generates reports identifying duplicated code blocks:
234
235
```xml
236
<!-- Example CPD report for Scala -->
237
<pmd-cpd>
238
<duplication lines="12" tokens="45">
239
<file line="15" path="src/main/scala/UserService.scala"/>
240
<file line="28" path="src/main/scala/ProductService.scala"/>
241
<codefragment><![CDATA[
242
def findById(id: Long): Option[T] = {
243
val query = "SELECT * FROM table WHERE id = ?"
244
executeQuery(query, id).map(parseEntity)
245
}
246
]]></codefragment>
247
</duplication>
248
</pmd-cpd>
249
```
250
251
### Programmatic Analysis
252
253
```java
254
// Analyze CPD results programmatically
255
public void analyzeCpdResults(List<Match> duplications) {
256
for (Match duplication : duplications) {
257
System.out.println("Found duplication:");
258
System.out.println(" Tokens: " + duplication.getTokenCount());
259
System.out.println(" Lines: " + duplication.getLineCount());
260
261
for (Mark mark : duplication.getMarkSet()) {
262
System.out.println(" File: " + mark.getFilename() +
263
" (line " + mark.getBeginLine() + ")");
264
}
265
266
System.out.println(" Code fragment:");
267
System.out.println(" " + duplication.getSourceCodeSlice());
268
}
269
}
270
```
271
272
## Advanced CPD Features
273
274
### Token Normalization
275
276
CPD can normalize tokens to detect semantic duplications that differ in naming:
277
278
```java
279
// Configuration for token normalization
280
CpdConfiguration config = CpdConfiguration.builder()
281
.setIgnoreIdentifiers(true) // user/product → identifier
282
.setIgnoreLiterals(true) // "users"/"products" → string_literal
283
.setIgnoreAnnotations(true) // @Entity/@Component → annotation
284
.build();
285
```
286
287
This allows detection of functionally identical code with different names:
288
289
```scala
290
// These would be detected as duplicates with normalization
291
def saveUser(user: User) = repository.save(user)
292
def saveProduct(product: Product) = repository.save(product)
293
294
// Normalized tokens: save(identifier) = identifier.save(identifier)
295
```
296
297
### Custom Token Filtering
298
299
```java
300
public class CustomScalaCpdLexer extends ScalaCpdLexer {
301
public CustomScalaCpdLexer(LanguagePropertyBundle bundle) {
302
super(bundle);
303
}
304
305
@Override
306
protected boolean shouldIncludeToken(Token token) {
307
// Custom filtering logic
308
if (token instanceof Token.KwPrivate || token instanceof Token.KwProtected) {
309
return false; // Ignore visibility modifiers
310
}
311
312
if (token instanceof Token.Ident && isTestMethodName(token)) {
313
return false; // Ignore test method names
314
}
315
316
return super.shouldIncludeToken(token);
317
}
318
319
private boolean isTestMethodName(Token.Ident token) {
320
String name = token.value();
321
return name.startsWith("test") || name.contains("should");
322
}
323
}
324
```
325
326
### Integration with Build Tools
327
328
#### Maven Integration
329
330
```xml
331
<plugin>
332
<groupId>com.github.spotbugs</groupId>
333
<artifactId>spotbugs-maven-plugin</artifactId>
334
<configuration>
335
<xmlOutput>true</xmlOutput>
336
<includeLanguages>
337
<language>scala</language>
338
</includeLanguages>
339
<cpdMinimumTokens>50</cpdMinimumTokens>
340
</configuration>
341
</plugin>
342
```
343
344
#### SBT Integration
345
346
```scala
347
// build.sbt
348
libraryDependencies += "net.sourceforge.pmd" % "pmd-scala_2.12" % "7.13.0"
349
350
// Custom CPD task
351
lazy val cpd = taskKey[Unit]("Run copy-paste detection")
352
353
cpd := {
354
val classpath = (Compile / dependencyClasspath).value
355
val sourceDir = (Compile / scalaSource).value
356
357
// Run CPD analysis on Scala sources
358
runCpdAnalysis(sourceDir, classpath)
359
}
360
```
361
362
## Performance Considerations
363
364
### Tokenization Performance
365
366
```java
367
// Optimize tokenization for large codebases
368
public class OptimizedScalaCpdLexer extends ScalaCpdLexer {
369
private final Cache<String, Tokens> tokenCache =
370
CacheBuilder.newBuilder()
371
.maximumSize(1000)
372
.expireAfterWrite(10, TimeUnit.MINUTES)
373
.build();
374
375
@Override
376
public void tokenize(TextDocument document, TokenFactory tokenEntries) throws IOException {
377
String content = document.getText().toString();
378
379
try {
380
Tokens tokens = tokenCache.get(content, () -> {
381
Input input = Input.String(content);
382
return input.tokenize().get();
383
});
384
385
processTokens(tokens, document, tokenEntries);
386
} catch (ExecutionException e) {
387
throw new IOException("Tokenization failed", e.getCause());
388
}
389
}
390
}
391
```
392
393
### Memory Management
394
395
```java
396
// Stream-based processing for large files
397
public void tokenizeLargeFile(TextDocument document, TokenFactory tokenEntries) throws IOException {
398
try (Stream<String> lines = document.getText().lines()) {
399
lines.forEach(line -> {
400
try {
401
tokenizeLine(line, tokenEntries);
402
} catch (IOException e) {
403
throw new RuntimeException(e);
404
}
405
});
406
}
407
}
408
```
409
410
## Best Practices
411
412
### CPD Configuration Guidelines
413
414
1. **Minimum Token Size**: Set appropriate threshold (50-100 tokens)
415
2. **Ignore Settings**: Configure based on project needs
416
3. **File Filtering**: Exclude generated code and test fixtures
417
4. **Report Format**: Choose appropriate output format (XML, JSON, CSV)
418
419
### Integration Strategies
420
421
1. **CI/CD Integration**: Include CPD in build pipeline
422
2. **Quality Gates**: Set duplication thresholds
423
3. **Trend Analysis**: Track duplication metrics over time
424
4. **Refactoring Guidance**: Use results to guide code improvements
425
426
The copy-paste detection system provides comprehensive duplication analysis capabilities for Scala codebases, enabling teams to identify and eliminate code duplication effectively.