Tessl Tile for maven/net.sourceforge.pmd/pmd-scala_2.12@7.13.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

ast-parsing.md ast-traversal.md copy-paste-detection.md index.md language-module.md rule-development.md

copy-paste-detection.mddocs/

0
# Copy-Paste Detection
1

2
Copy-paste detection (CPD) capabilities provide Scalameta-based tokenization for identifying code duplication in Scala projects. The CPD system integrates with PMD's duplicate code detection framework to analyze Scala source files for similar code patterns.
3

4
## Core CPD Components
5

6
### ScalaCpdLexer
7

8
Primary tokenizer for copy-paste detection that converts Scala source code into tokens for duplication analysis.
9

10
```java { .api }
11
public class ScalaCpdLexer implements CpdLexer {
12
    public ScalaCpdLexer(LanguagePropertyBundle bundle);
13
    public void tokenize(TextDocument document, TokenFactory tokenEntries) throws IOException;
14
}
15
```
16

17
**Usage Example**:
18

19
```java
20
// Create CPD lexer with language properties
21
LanguagePropertyBundle bundle = LanguagePropertyBundle.create();
22
ScalaCpdLexer lexer = new ScalaCpdLexer(bundle);
23

24
// Tokenize a Scala source file
25
TextDocument document = TextDocument.readOnlyString("Example.scala", sourceCode);
26
List<CpdToken> tokens = new ArrayList<>();
27
TokenFactory tokenFactory = new TokenFactory(tokens);
28

29
try {
30
    lexer.tokenize(document, tokenFactory);
31
    System.out.println("Generated " + tokens.size() + " tokens for CPD analysis");
32
} catch (IOException e) {
33
    System.err.println("Tokenization failed: " + e.getMessage());
34
}
35
```
36

37
### ScalaTokenAdapter
38

39
Adapter class that bridges Scalameta tokens to PMD's CPD token interface.
40

41
```java { .api }
42
public class ScalaTokenAdapter {
43
    // Internal adapter implementation
44
    // Converts scala.meta.Token to PMD CpdToken format
45
}
46
```
47

48
**Internal Usage**:
49

50
```java
51
// Used internally by ScalaCpdLexer
52
scala.meta.Token scalametaToken = // ... from Scalameta parsing
53
CpdToken pmdToken = ScalaTokenAdapter.adapt(scalametaToken, document);
54
tokenFactory.recordToken(pmdToken);
55
```
56

57
## Tokenization Process
58

59
### Token Generation Strategy
60

61
The CPD lexer processes Scala source code through the following steps:
62

63
1. **Scalameta Parsing**: Parse source code using Scalameta's tokenizer
64
2. **Token Filtering**: Filter out comments and whitespace tokens  
65
3. **Token Adaptation**: Convert Scalameta tokens to PMD CPD format
66
4. **Position Mapping**: Maintain accurate source position information
67

68
```java
69
// Internal tokenization process
70
public void tokenize(TextDocument document, TokenFactory tokenEntries) throws IOException {
71
    try {
72
        // Parse with Scalameta
73
        Input input = Input.String(document.getText().toString());
74
        Tokens tokens = input.tokenize().get();
75
        
76
        // Filter and adapt tokens
77
        for (Token token : tokens) {
78
            if (shouldIncludeToken(token)) {
79
                CpdToken cpdToken = adaptToken(token, document);
80
                tokenEntries.recordToken(cpdToken);
81
            }
82
        }
83
    } catch (Exception e) {
84
        throw new IOException("Scala tokenization failed", e);
85
    }
86
}
87
```
88

89
### Token Filtering Rules
90

91
The tokenizer applies filtering rules to focus on semantically meaningful tokens:
92

93
```java
94
private boolean shouldIncludeToken(Token token) {
95
    // Include identifiers, keywords, literals, operators
96
    // Exclude comments, whitespace, formatting tokens
97
    return !(token instanceof Token.Comment ||
98
             token instanceof Token.Space ||
99
             token instanceof Token.Tab ||
100
             token instanceof Token.LF ||
101
             token instanceof Token.CRLF ||
102
             token instanceof Token.FF);
103
}
104
```
105

106
## Integration with PMD CPD
107

108
### Language Module Integration
109

110
CPD integration is handled through the language module:
111

112
```java { .api }
113
public class ScalaLanguageModule extends SimpleLanguageModuleBase {
114
    @Override
115
    public CpdLexer createCpdLexer(LanguagePropertyBundle bundle) {
116
        return new ScalaCpdLexer(bundle);
117
    }
118
}
119
```
120

121
**Usage Example**:
122

123
```java
124
// Get CPD lexer from language module
125
ScalaLanguageModule module = ScalaLanguageModule.getInstance();
126
LanguagePropertyBundle bundle = // ... configure properties
127
CpdLexer lexer = module.createCpdLexer(bundle);
128

129
// Use lexer for CPD analysis
130
lexer.tokenize(document, tokenEntries);
131
```
132

133
### CPD Configuration
134

135
Configure CPD analysis parameters for Scala projects:
136

137
```java
138
// CPD configuration for Scala
139
CpdConfiguration config = CpdConfiguration.builder()
140
    .setMinimumTileSize(50)        // Minimum tokens for duplication
141
    .setLanguage("scala")          // Use Scala CPD lexer
142
    .setIgnoreAnnotations(true)    // Ignore annotation differences
143
    .setIgnoreIdentifiers(false)   // Consider identifier names
144
    .setIgnoreLiterals(true)       // Ignore literal value differences
145
    .build();
146
```
147

148
## Duplication Detection Patterns
149

150
### Class-Level Duplication
151

152
CPD can detect duplicated code patterns at various levels:
153

154
```scala
155
// Example: Similar class structures
156
class UserService {
157
    def findById(id: Long): Option[User] = {
158
        val query = "SELECT * FROM users WHERE id = ?"
159
        executeQuery(query, id).map(parseUser)
160
    }
161
    
162
    def save(user: User): Boolean = {
163
        val query = "INSERT INTO users (name, email) VALUES (?, ?)"
164
        executeUpdate(query, user.name, user.email) > 0
165
    }
166
}
167

168
class ProductService {
169
    def findById(id: Long): Option[Product] = {
170
        val query = "SELECT * FROM products WHERE id = ?"
171
        executeQuery(query, id).map(parseProduct)  // Similar pattern
172
    }
173
    
174
    def save(product: Product): Boolean = {
175
        val query = "INSERT INTO products (name, price) VALUES (?, ?)"
176
        executeUpdate(query, product.name, product.price) > 0  // Similar pattern
177
    }
178
}
179
```
180

181
### Method-Level Duplication
182

183
```scala
184
// Example: Similar method implementations
185
def processUsers(users: List[User]): List[ProcessedUser] = {
186
    users.map { user =>
187
        val validated = validateUser(user)
188
        val normalized = normalizeUser(validated)
189
        val enriched = enrichUser(normalized)
190
        ProcessedUser(enriched)
191
    }
192
}
193

194
def processProducts(products: List[Product]): List[ProcessedProduct] = {
195
    products.map { product =>
196
        val validated = validateProduct(product)      // Similar structure
197
        val normalized = normalizeProduct(validated)  // Similar structure
198
        val enriched = enrichProduct(normalized)      // Similar structure
199
        ProcessedProduct(enriched)
200
    }
201
}
202
```
203

204
### Expression-Level Duplication
205

206
```scala
207
// Example: Similar expression patterns
208
val userResult = Try {
209
    val data = fetchUserData(id)
210
    val parsed = parseUserData(data)
211
    val validated = validateUserData(parsed)
212
    validated
213
}.recover {
214
    case _: NetworkException => DefaultUser
215
    case _: ParseException => DefaultUser
216
}.get
217

218
val productResult = Try {
219
    val data = fetchProductData(id)           // Similar pattern
220
    val parsed = parseProductData(data)       // Similar pattern  
221
    val validated = validateProductData(parsed) // Similar pattern
222
    validated
223
}.recover {
224
    case _: NetworkException => DefaultProduct  // Similar pattern
225
    case _: ParseException => DefaultProduct    // Similar pattern
226
}.get
227
```
228

229
## CPD Analysis Results
230

231
### Duplication Report Format
232

233
CPD generates reports identifying duplicated code blocks:
234

235
```xml
236
<!-- Example CPD report for Scala -->
237
<pmd-cpd>
238
    <duplication lines="12" tokens="45">
239
        <file line="15" path="src/main/scala/UserService.scala"/>
240
        <file line="28" path="src/main/scala/ProductService.scala"/>
241
        <codefragment><![CDATA[
242
def findById(id: Long): Option[T] = {
243
    val query = "SELECT * FROM table WHERE id = ?"
244
    executeQuery(query, id).map(parseEntity)
245
}
246
        ]]></codefragment>
247
    </duplication>
248
</pmd-cpd>
249
```
250

251
### Programmatic Analysis
252

253
```java
254
// Analyze CPD results programmatically
255
public void analyzeCpdResults(List<Match> duplications) {
256
    for (Match duplication : duplications) {
257
        System.out.println("Found duplication:");
258
        System.out.println("  Tokens: " + duplication.getTokenCount());
259
        System.out.println("  Lines: " + duplication.getLineCount());
260
        
261
        for (Mark mark : duplication.getMarkSet()) {
262
            System.out.println("  File: " + mark.getFilename() + 
263
                             " (line " + mark.getBeginLine() + ")");
264
        }
265
        
266
        System.out.println("  Code fragment:");
267
        System.out.println("  " + duplication.getSourceCodeSlice());
268
    }
269
}
270
```
271

272
## Advanced CPD Features
273

274
### Token Normalization
275

276
CPD can normalize tokens to detect semantic duplications that differ in naming:
277

278
```java
279
// Configuration for token normalization
280
CpdConfiguration config = CpdConfiguration.builder()
281
    .setIgnoreIdentifiers(true)    // user/product → identifier
282
    .setIgnoreLiterals(true)       // "users"/"products" → string_literal
283
    .setIgnoreAnnotations(true)    // @Entity/@Component → annotation
284
    .build();
285
```
286

287
This allows detection of functionally identical code with different names:
288

289
```scala
290
// These would be detected as duplicates with normalization
291
def saveUser(user: User) = repository.save(user)
292
def saveProduct(product: Product) = repository.save(product)
293

294
// Normalized tokens: save(identifier) = identifier.save(identifier)
295
```
296

297
### Custom Token Filtering
298

299
```java
300
public class CustomScalaCpdLexer extends ScalaCpdLexer {
301
    public CustomScalaCpdLexer(LanguagePropertyBundle bundle) {
302
        super(bundle);
303
    }
304
    
305
    @Override
306
    protected boolean shouldIncludeToken(Token token) {
307
        // Custom filtering logic
308
        if (token instanceof Token.KwPrivate || token instanceof Token.KwProtected) {
309
            return false; // Ignore visibility modifiers
310
        }
311
        
312
        if (token instanceof Token.Ident && isTestMethodName(token)) {
313
            return false; // Ignore test method names
314
        }
315
        
316
        return super.shouldIncludeToken(token);
317
    }
318
    
319
    private boolean isTestMethodName(Token.Ident token) {
320
        String name = token.value();
321
        return name.startsWith("test") || name.contains("should");
322
    }
323
}
324
```
325

326
### Integration with Build Tools
327

328
#### Maven Integration
329

330
```xml
331
<plugin>
332
    <groupId>com.github.spotbugs</groupId>
333
    <artifactId>spotbugs-maven-plugin</artifactId>
334
    <configuration>
335
        <xmlOutput>true</xmlOutput>
336
        <includeLanguages>
337
            <language>scala</language>
338
        </includeLanguages>
339
        <cpdMinimumTokens>50</cpdMinimumTokens>
340
    </configuration>
341
</plugin>
342
```
343

344
#### SBT Integration
345

346
```scala
347
// build.sbt
348
libraryDependencies += "net.sourceforge.pmd" % "pmd-scala_2.12" % "7.13.0"
349

350
// Custom CPD task
351
lazy val cpd = taskKey[Unit]("Run copy-paste detection")
352

353
cpd := {
354
    val classpath = (Compile / dependencyClasspath).value
355
    val sourceDir = (Compile / scalaSource).value
356
    
357
    // Run CPD analysis on Scala sources
358
    runCpdAnalysis(sourceDir, classpath)
359
}
360
```
361

362
## Performance Considerations
363

364
### Tokenization Performance
365

366
```java
367
// Optimize tokenization for large codebases
368
public class OptimizedScalaCpdLexer extends ScalaCpdLexer {
369
    private final Cache<String, Tokens> tokenCache = 
370
        CacheBuilder.newBuilder()
371
            .maximumSize(1000)
372
            .expireAfterWrite(10, TimeUnit.MINUTES)
373
            .build();
374
    
375
    @Override
376
    public void tokenize(TextDocument document, TokenFactory tokenEntries) throws IOException {
377
        String content = document.getText().toString();
378
        
379
        try {
380
            Tokens tokens = tokenCache.get(content, () -> {
381
                Input input = Input.String(content);
382
                return input.tokenize().get();
383
            });
384
            
385
            processTokens(tokens, document, tokenEntries);
386
        } catch (ExecutionException e) {
387
            throw new IOException("Tokenization failed", e.getCause());
388
        }
389
    }
390
}
391
```
392

393
### Memory Management
394

395
```java
396
// Stream-based processing for large files
397
public void tokenizeLargeFile(TextDocument document, TokenFactory tokenEntries) throws IOException {
398
    try (Stream<String> lines = document.getText().lines()) {
399
        lines.forEach(line -> {
400
            try {
401
                tokenizeLine(line, tokenEntries);
402
            } catch (IOException e) {
403
                throw new RuntimeException(e);
404
            }
405
        });
406
    }
407
}
408
```
409

410
## Best Practices
411

412
### CPD Configuration Guidelines
413

414
1. **Minimum Token Size**: Set appropriate threshold (50-100 tokens)
415
2. **Ignore Settings**: Configure based on project needs
416
3. **File Filtering**: Exclude generated code and test fixtures
417
4. **Report Format**: Choose appropriate output format (XML, JSON, CSV)
418

419
### Integration Strategies
420

421
1. **CI/CD Integration**: Include CPD in build pipeline
422
2. **Quality Gates**: Set duplication thresholds
423
3. **Trend Analysis**: Track duplication metrics over time
424
4. **Refactoring Guidance**: Use results to guide code improvements
425

426
The copy-paste detection system provides comprehensive duplication analysis capabilities for Scala codebases, enabling teams to identify and eliminate code duplication effectively.

Version

Tile

Files

copy-paste-detection.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

copy-paste-detection.mddocs/