0
# Copy-Paste Detection
1
2
Scala tokenization support for PMD's copy-paste detection (CPD) system, providing language-specific tokenization and filtering for duplicate code analysis. The CPD system identifies duplicated code blocks across Scala source files.
3
4
## Capabilities
5
6
### Scala Language Registration
7
8
Registers Scala language support with PMD's copy-paste detection system.
9
10
```java { .api }
11
/**
12
* Language implementation for Scala CPD
13
*/
14
public class ScalaLanguage extends AbstractLanguage {
15
/**
16
* Creates a new Scala Language instance for CPD
17
* Registers with name "Scala", terse name "scala", and file extension ".scala"
18
*/
19
public ScalaLanguage();
20
}
21
```
22
23
**Usage Examples:**
24
25
```java
26
// Language is automatically registered via SPI
27
// Located in META-INF/services/net.sourceforge.pmd.cpd.Language
28
29
// Manual instantiation (not typically needed)
30
ScalaLanguage scalaLang = new ScalaLanguage();
31
System.out.println("Language: " + scalaLang.getName()); // "Scala"
32
System.out.println("Extension: " + scalaLang.getExtension()); // ".scala"
33
```
34
35
### Scala Tokenizer
36
37
Main tokenizer implementation that processes Scala source code into tokens for duplicate detection.
38
39
```java { .api }
40
/**
41
* Scala Tokenizer class. Uses the Scala Meta Tokenizer, but adapts it for use with generic filtering
42
*/
43
public class ScalaTokenizer implements Tokenizer {
44
/**
45
* Property key for specifying Scala version dialect
46
* Valid values: "2.10", "2.11", "2.12", "2.13"
47
*/
48
public static final String SCALA_VERSION_PROPERTY = "net.sourceforge.pmd.scala.version";
49
50
/**
51
* Create the Tokenizer using properties from the system environment
52
* Uses default Scala version if SCALA_VERSION_PROPERTY not set
53
*/
54
public ScalaTokenizer();
55
56
/**
57
* Create the Tokenizer given a set of properties
58
* @param properties the Properties object containing configuration
59
*/
60
public ScalaTokenizer(Properties properties);
61
62
/**
63
* Tokenize source code for copy-paste detection
64
* @param sourceCode the source code to tokenize
65
* @param tokenEntries output collection for generated tokens
66
* @throws IOException if source reading fails
67
*/
68
@Override
69
public void tokenize(SourceCode sourceCode, Tokens tokenEntries) throws IOException;
70
}
71
```
72
73
**Usage Examples:**
74
75
```java
76
import net.sourceforge.pmd.cpd.*;
77
import java.util.Properties;
78
79
// Create tokenizer with default settings
80
ScalaTokenizer tokenizer = new ScalaTokenizer();
81
82
// Create tokenizer with specific Scala version
83
Properties props = new Properties();
84
props.setProperty(ScalaTokenizer.SCALA_VERSION_PROPERTY, "2.12");
85
ScalaTokenizer tokenizer212 = new ScalaTokenizer(props);
86
87
// Tokenize source code
88
SourceCode sourceCode = new SourceCode("HelloWorld.scala", "object HelloWorld { def main(args: Array[String]): Unit = println(\"Hello\") }");
89
Tokens tokens = new Tokens();
90
91
try {
92
tokenizer.tokenize(sourceCode, tokens);
93
94
// Process tokens for duplicate detection
95
for (TokenEntry token : tokens.getTokens()) {
96
System.out.println("Token: " + token.getValue() +
97
" at line " + token.getBeginLine() +
98
", column " + token.getBeginColumn());
99
}
100
} catch (IOException e) {
101
System.err.println("Tokenization failed: " + e.getMessage());
102
}
103
```
104
105
### Token Filtering and Processing
106
107
The tokenizer includes sophisticated filtering to identify meaningful tokens for duplicate detection while ignoring irrelevant elements.
108
109
**Filtered Token Types:**
110
111
The tokenizer automatically filters out these Scalameta token types:
112
- `Token.Space` - Whitespace characters
113
- `Token.Tab` - Tab characters
114
- `Token.CR` - Carriage return
115
- `Token.LF` - Line feed
116
- `Token.FF` - Form feed
117
- `Token.LFLF` - Double line feed
118
- `Token.EOF` - End of file markers
119
- `Token.Comment` - Comments (handled separately)
120
121
**Comment Handling:**
122
123
Comments receive special processing to support PMD's comment-based features:
124
125
```java { .api }
126
/**
127
* Internal token adapter that wraps Scalameta tokens for PMD compatibility
128
*/
129
public class ScalaTokenAdapter implements GenericToken {
130
/**
131
* Create token adapter with optional previous comment context
132
* @param scalaToken the underlying Scalameta token
133
* @param previousComment the most recent comment token for context
134
*/
135
public ScalaTokenAdapter(Token scalaToken, GenericToken previousComment);
136
137
@Override
138
public String getImage();
139
@Override
140
public int getBeginLine();
141
@Override
142
public int getBeginColumn();
143
@Override
144
public int getEndLine();
145
@Override
146
public int getEndColumn();
147
@Override
148
public GenericToken getPreviousComment();
149
}
150
```
151
152
### Version-Specific Tokenization
153
154
The tokenizer supports different Scala versions through dialect configuration, ensuring accurate tokenization for version-specific syntax.
155
156
**Supported Dialects:**
157
158
```java
159
// Version selection through properties
160
Properties versionProps = new Properties();
161
162
// Scala 2.10
163
versionProps.setProperty(ScalaTokenizer.SCALA_VERSION_PROPERTY, "2.10");
164
ScalaTokenizer tokenizer210 = new ScalaTokenizer(versionProps);
165
166
// Scala 2.11
167
versionProps.setProperty(ScalaTokenizer.SCALA_VERSION_PROPERTY, "2.11");
168
ScalaTokenizer tokenizer211 = new ScalaTokenizer(versionProps);
169
170
// Scala 2.12
171
versionProps.setProperty(ScalaTokenizer.SCALA_VERSION_PROPERTY, "2.12");
172
ScalaTokenizer tokenizer212 = new ScalaTokenizer(versionProps);
173
174
// Scala 2.13 (default)
175
versionProps.setProperty(ScalaTokenizer.SCALA_VERSION_PROPERTY, "2.13");
176
ScalaTokenizer tokenizer213 = new ScalaTokenizer(versionProps);
177
178
// Or use system property
179
System.setProperty(ScalaTokenizer.SCALA_VERSION_PROPERTY, "2.12");
180
ScalaTokenizer systemTokenizer = new ScalaTokenizer();
181
```
182
183
### Error Handling in Tokenization
184
185
The tokenizer handles various error conditions during tokenization:
186
187
```java { .api }
188
// Exception types thrown by tokenizer
189
import net.sourceforge.pmd.lang.ast.TokenMgrError;
190
import scala.meta.tokenizers.TokenizeException;
191
```
192
193
**Error Handling Examples:**
194
195
```java
196
try {
197
tokenizer.tokenize(sourceCode, tokens);
198
} catch (IOException e) {
199
// I/O errors reading source
200
System.err.println("Failed to read source: " + e.getMessage());
201
} catch (TokenMgrError e) {
202
// Tokenization errors from Scalameta
203
System.err.println("Tokenization error at line " + e.getLine() +
204
", column " + e.getColumn() + ": " + e.getMessage());
205
206
// Original Scalameta exception available as cause
207
if (e.getCause() instanceof TokenizeException) {
208
TokenizeException originalError = (TokenizeException) e.getCause();
209
System.err.println("Scalameta error: " + originalError.getMessage());
210
}
211
} catch (Exception e) {
212
// Other unexpected errors
213
System.err.println("Unexpected tokenization error: " + e.getMessage());
214
}
215
```
216
217
### Integration with PMD CPD
218
219
The Scala tokenizer integrates seamlessly with PMD's copy-paste detection pipeline:
220
221
**Complete CPD Integration Example:**
222
223
```java
224
import net.sourceforge.pmd.cpd.*;
225
226
// Create CPD configuration for Scala
227
CPDConfiguration config = new CPDConfiguration();
228
config.setMinimumTileSize(50); // Minimum duplicate size
229
config.setLanguage(new ScalaLanguage());
230
231
// Add Scala source files
232
SourceCode source1 = new SourceCode("File1.scala", scalaCode1);
233
SourceCode source2 = new SourceCode("File2.scala", scalaCode2);
234
235
// Run CPD analysis
236
CPD cpd = new CPD(config);
237
cpd.add(source1);
238
cpd.add(source2);
239
cpd.go();
240
241
// Process results
242
Iterator<Match> matches = cpd.getMatches();
243
while (matches.hasNext()) {
244
Match match = matches.next();
245
System.out.println("Duplicate found:");
246
System.out.println(" Size: " + match.getTokenCount() + " tokens");
247
System.out.println(" Lines: " + match.getLineCount());
248
249
for (Mark mark : match.getMarkSet()) {
250
System.out.println(" Location: " + mark.getFilename() +
251
" (line " + mark.getBeginLine() + ")");
252
}
253
}
254
```
255
256
### Advanced Tokenization Features
257
258
**Custom Token Processing:**
259
260
```java
261
// Access internal token manager for advanced use cases
262
public class CustomScalaTokenProcessor {
263
public void processTokens(SourceCode sourceCode) {
264
Properties props = new Properties();
265
props.setProperty(ScalaTokenizer.SCALA_VERSION_PROPERTY, "2.12");
266
267
ScalaTokenizer tokenizer = new ScalaTokenizer(props);
268
Tokens tokens = new Tokens();
269
270
try {
271
tokenizer.tokenize(sourceCode, tokens);
272
273
// Custom processing of token stream
274
for (TokenEntry token : tokens.getTokens()) {
275
if (token.getValue().matches("[A-Z][a-zA-Z]*")) {
276
// Process class/object names
277
System.out.println("Potential type name: " + token.getValue());
278
} else if (token.getValue().matches("[a-z][a-zA-Z]*")) {
279
// Process method/variable names
280
System.out.println("Potential member name: " + token.getValue());
281
}
282
}
283
} catch (IOException e) {
284
System.err.println("Processing failed: " + e.getMessage());
285
}
286
}
287
}
288
```
289
290
**Token Statistics:**
291
292
```java
293
public class TokenStatistics {
294
public void analyzeTokens(SourceCode sourceCode) throws IOException {
295
ScalaTokenizer tokenizer = new ScalaTokenizer();
296
Tokens tokens = new Tokens();
297
tokenizer.tokenize(sourceCode, tokens);
298
299
Map<String, Integer> tokenCounts = new HashMap<>();
300
int totalTokens = 0;
301
302
for (TokenEntry token : tokens.getTokens()) {
303
if (!token.getValue().equals(TokenEntry.EOF.getValue())) {
304
tokenCounts.merge(token.getValue(), 1, Integer::sum);
305
totalTokens++;
306
}
307
}
308
309
System.out.println("Total tokens: " + totalTokens);
310
System.out.println("Unique tokens: " + tokenCounts.size());
311
312
// Most frequent tokens
313
tokenCounts.entrySet().stream()
314
.sorted(Map.Entry.<String, Integer>comparingByValue().reversed())
315
.limit(10)
316
.forEach(entry -> System.out.println(entry.getKey() + ": " + entry.getValue()));
317
}
318
}
319
```