or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

ast-nodes.mdcpd.mdindex.mdlanguage-parsing.mdrule-development.mdvisitor-pattern.md

cpd.mddocs/

0

# Copy-Paste Detection

1

2

Scala tokenization support for PMD's copy-paste detection (CPD) system, providing language-specific tokenization and filtering for duplicate code analysis. The CPD system identifies duplicated code blocks across Scala source files.

3

4

## Capabilities

5

6

### Scala Language Registration

7

8

Registers Scala language support with PMD's copy-paste detection system.

9

10

```java { .api }

11

/**

12

* Language implementation for Scala CPD

13

*/

14

public class ScalaLanguage extends AbstractLanguage {

15

/**

16

* Creates a new Scala Language instance for CPD

17

* Registers with name "Scala", terse name "scala", and file extension ".scala"

18

*/

19

public ScalaLanguage();

20

}

21

```

22

23

**Usage Examples:**

24

25

```java

26

// Language is automatically registered via SPI

27

// Located in META-INF/services/net.sourceforge.pmd.cpd.Language

28

29

// Manual instantiation (not typically needed)

30

ScalaLanguage scalaLang = new ScalaLanguage();

31

System.out.println("Language: " + scalaLang.getName()); // "Scala"

32

System.out.println("Extension: " + scalaLang.getExtension()); // ".scala"

33

```

34

35

### Scala Tokenizer

36

37

Main tokenizer implementation that processes Scala source code into tokens for duplicate detection.

38

39

```java { .api }

40

/**

41

* Scala Tokenizer class. Uses the Scala Meta Tokenizer, but adapts it for use with generic filtering

42

*/

43

public class ScalaTokenizer implements Tokenizer {

44

/**

45

* Property key for specifying Scala version dialect

46

* Valid values: "2.10", "2.11", "2.12", "2.13"

47

*/

48

public static final String SCALA_VERSION_PROPERTY = "net.sourceforge.pmd.scala.version";

49

50

/**

51

* Create the Tokenizer using properties from the system environment

52

* Uses default Scala version if SCALA_VERSION_PROPERTY not set

53

*/

54

public ScalaTokenizer();

55

56

/**

57

* Create the Tokenizer given a set of properties

58

* @param properties the Properties object containing configuration

59

*/

60

public ScalaTokenizer(Properties properties);

61

62

/**

63

* Tokenize source code for copy-paste detection

64

* @param sourceCode the source code to tokenize

65

* @param tokenEntries output collection for generated tokens

66

* @throws IOException if source reading fails

67

*/

68

@Override

69

public void tokenize(SourceCode sourceCode, Tokens tokenEntries) throws IOException;

70

}

71

```

72

73

**Usage Examples:**

74

75

```java

76

import net.sourceforge.pmd.cpd.*;

77

import java.util.Properties;

78

79

// Create tokenizer with default settings

80

ScalaTokenizer tokenizer = new ScalaTokenizer();

81

82

// Create tokenizer with specific Scala version

83

Properties props = new Properties();

84

props.setProperty(ScalaTokenizer.SCALA_VERSION_PROPERTY, "2.12");

85

ScalaTokenizer tokenizer212 = new ScalaTokenizer(props);

86

87

// Tokenize source code

88

SourceCode sourceCode = new SourceCode("HelloWorld.scala", "object HelloWorld { def main(args: Array[String]): Unit = println(\"Hello\") }");

89

Tokens tokens = new Tokens();

90

91

try {

92

tokenizer.tokenize(sourceCode, tokens);

93

94

// Process tokens for duplicate detection

95

for (TokenEntry token : tokens.getTokens()) {

96

System.out.println("Token: " + token.getValue() +

97

" at line " + token.getBeginLine() +

98

", column " + token.getBeginColumn());

99

}

100

} catch (IOException e) {

101

System.err.println("Tokenization failed: " + e.getMessage());

102

}

103

```

104

105

### Token Filtering and Processing

106

107

The tokenizer includes sophisticated filtering to identify meaningful tokens for duplicate detection while ignoring irrelevant elements.

108

109

**Filtered Token Types:**

110

111

The tokenizer automatically filters out these Scalameta token types:

112

- `Token.Space` - Whitespace characters

113

- `Token.Tab` - Tab characters

114

- `Token.CR` - Carriage return

115

- `Token.LF` - Line feed

116

- `Token.FF` - Form feed

117

- `Token.LFLF` - Double line feed

118

- `Token.EOF` - End of file markers

119

- `Token.Comment` - Comments (handled separately)

120

121

**Comment Handling:**

122

123

Comments receive special processing to support PMD's comment-based features:

124

125

```java { .api }

126

/**

127

* Internal token adapter that wraps Scalameta tokens for PMD compatibility

128

*/

129

public class ScalaTokenAdapter implements GenericToken {

130

/**

131

* Create token adapter with optional previous comment context

132

* @param scalaToken the underlying Scalameta token

133

* @param previousComment the most recent comment token for context

134

*/

135

public ScalaTokenAdapter(Token scalaToken, GenericToken previousComment);

136

137

@Override

138

public String getImage();

139

@Override

140

public int getBeginLine();

141

@Override

142

public int getBeginColumn();

143

@Override

144

public int getEndLine();

145

@Override

146

public int getEndColumn();

147

@Override

148

public GenericToken getPreviousComment();

149

}

150

```

151

152

### Version-Specific Tokenization

153

154

The tokenizer supports different Scala versions through dialect configuration, ensuring accurate tokenization for version-specific syntax.

155

156

**Supported Dialects:**

157

158

```java

159

// Version selection through properties

160

Properties versionProps = new Properties();

161

162

// Scala 2.10

163

versionProps.setProperty(ScalaTokenizer.SCALA_VERSION_PROPERTY, "2.10");

164

ScalaTokenizer tokenizer210 = new ScalaTokenizer(versionProps);

165

166

// Scala 2.11

167

versionProps.setProperty(ScalaTokenizer.SCALA_VERSION_PROPERTY, "2.11");

168

ScalaTokenizer tokenizer211 = new ScalaTokenizer(versionProps);

169

170

// Scala 2.12

171

versionProps.setProperty(ScalaTokenizer.SCALA_VERSION_PROPERTY, "2.12");

172

ScalaTokenizer tokenizer212 = new ScalaTokenizer(versionProps);

173

174

// Scala 2.13 (default)

175

versionProps.setProperty(ScalaTokenizer.SCALA_VERSION_PROPERTY, "2.13");

176

ScalaTokenizer tokenizer213 = new ScalaTokenizer(versionProps);

177

178

// Or use system property

179

System.setProperty(ScalaTokenizer.SCALA_VERSION_PROPERTY, "2.12");

180

ScalaTokenizer systemTokenizer = new ScalaTokenizer();

181

```

182

183

### Error Handling in Tokenization

184

185

The tokenizer handles various error conditions during tokenization:

186

187

```java { .api }

188

// Exception types thrown by tokenizer

189

import net.sourceforge.pmd.lang.ast.TokenMgrError;

190

import scala.meta.tokenizers.TokenizeException;

191

```

192

193

**Error Handling Examples:**

194

195

```java

196

try {

197

tokenizer.tokenize(sourceCode, tokens);

198

} catch (IOException e) {

199

// I/O errors reading source

200

System.err.println("Failed to read source: " + e.getMessage());

201

} catch (TokenMgrError e) {

202

// Tokenization errors from Scalameta

203

System.err.println("Tokenization error at line " + e.getLine() +

204

", column " + e.getColumn() + ": " + e.getMessage());

205

206

// Original Scalameta exception available as cause

207

if (e.getCause() instanceof TokenizeException) {

208

TokenizeException originalError = (TokenizeException) e.getCause();

209

System.err.println("Scalameta error: " + originalError.getMessage());

210

}

211

} catch (Exception e) {

212

// Other unexpected errors

213

System.err.println("Unexpected tokenization error: " + e.getMessage());

214

}

215

```

216

217

### Integration with PMD CPD

218

219

The Scala tokenizer integrates seamlessly with PMD's copy-paste detection pipeline:

220

221

**Complete CPD Integration Example:**

222

223

```java

224

import net.sourceforge.pmd.cpd.*;

225

226

// Create CPD configuration for Scala

227

CPDConfiguration config = new CPDConfiguration();

228

config.setMinimumTileSize(50); // Minimum duplicate size

229

config.setLanguage(new ScalaLanguage());

230

231

// Add Scala source files

232

SourceCode source1 = new SourceCode("File1.scala", scalaCode1);

233

SourceCode source2 = new SourceCode("File2.scala", scalaCode2);

234

235

// Run CPD analysis

236

CPD cpd = new CPD(config);

237

cpd.add(source1);

238

cpd.add(source2);

239

cpd.go();

240

241

// Process results

242

Iterator<Match> matches = cpd.getMatches();

243

while (matches.hasNext()) {

244

Match match = matches.next();

245

System.out.println("Duplicate found:");

246

System.out.println(" Size: " + match.getTokenCount() + " tokens");

247

System.out.println(" Lines: " + match.getLineCount());

248

249

for (Mark mark : match.getMarkSet()) {

250

System.out.println(" Location: " + mark.getFilename() +

251

" (line " + mark.getBeginLine() + ")");

252

}

253

}

254

```

255

256

### Advanced Tokenization Features

257

258

**Custom Token Processing:**

259

260

```java

261

// Access internal token manager for advanced use cases

262

public class CustomScalaTokenProcessor {

263

public void processTokens(SourceCode sourceCode) {

264

Properties props = new Properties();

265

props.setProperty(ScalaTokenizer.SCALA_VERSION_PROPERTY, "2.12");

266

267

ScalaTokenizer tokenizer = new ScalaTokenizer(props);

268

Tokens tokens = new Tokens();

269

270

try {

271

tokenizer.tokenize(sourceCode, tokens);

272

273

// Custom processing of token stream

274

for (TokenEntry token : tokens.getTokens()) {

275

if (token.getValue().matches("[A-Z][a-zA-Z]*")) {

276

// Process class/object names

277

System.out.println("Potential type name: " + token.getValue());

278

} else if (token.getValue().matches("[a-z][a-zA-Z]*")) {

279

// Process method/variable names

280

System.out.println("Potential member name: " + token.getValue());

281

}

282

}

283

} catch (IOException e) {

284

System.err.println("Processing failed: " + e.getMessage());

285

}

286

}

287

}

288

```

289

290

**Token Statistics:**

291

292

```java

293

public class TokenStatistics {

294

public void analyzeTokens(SourceCode sourceCode) throws IOException {

295

ScalaTokenizer tokenizer = new ScalaTokenizer();

296

Tokens tokens = new Tokens();

297

tokenizer.tokenize(sourceCode, tokens);

298

299

Map<String, Integer> tokenCounts = new HashMap<>();

300

int totalTokens = 0;

301

302

for (TokenEntry token : tokens.getTokens()) {

303

if (!token.getValue().equals(TokenEntry.EOF.getValue())) {

304

tokenCounts.merge(token.getValue(), 1, Integer::sum);

305

totalTokens++;

306

}

307

}

308

309

System.out.println("Total tokens: " + totalTokens);

310

System.out.println("Unique tokens: " + tokenCounts.size());

311

312

// Most frequent tokens

313

tokenCounts.entrySet().stream()

314

.sorted(Map.Entry.<String, Integer>comparingByValue().reversed())

315

.limit(10)

316

.forEach(entry -> System.out.println(entry.getKey() + ": " + entry.getValue()));

317

}

318

}

319

```