or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

ast-parsing.mdast-traversal.mdcopy-paste-detection.mdindex.mdlanguage-module.mdrule-development.md

copy-paste-detection.mddocs/

0

# Copy-Paste Detection

1

2

Copy-paste detection (CPD) capabilities provide Scalameta-based tokenization for identifying code duplication in Scala projects. The CPD system integrates with PMD's duplicate code detection framework to analyze Scala source files for similar code patterns.

3

4

## Core CPD Components

5

6

### ScalaCpdLexer

7

8

Primary tokenizer for copy-paste detection that converts Scala source code into tokens for duplication analysis.

9

10

```java { .api }

11

public class ScalaCpdLexer implements CpdLexer {

12

public ScalaCpdLexer(LanguagePropertyBundle bundle);

13

public void tokenize(TextDocument document, TokenFactory tokenEntries) throws IOException;

14

}

15

```

16

17

**Usage Example**:

18

19

```java

20

// Create CPD lexer with language properties

21

LanguagePropertyBundle bundle = LanguagePropertyBundle.create();

22

ScalaCpdLexer lexer = new ScalaCpdLexer(bundle);

23

24

// Tokenize a Scala source file

25

TextDocument document = TextDocument.readOnlyString("Example.scala", sourceCode);

26

List<CpdToken> tokens = new ArrayList<>();

27

TokenFactory tokenFactory = new TokenFactory(tokens);

28

29

try {

30

lexer.tokenize(document, tokenFactory);

31

System.out.println("Generated " + tokens.size() + " tokens for CPD analysis");

32

} catch (IOException e) {

33

System.err.println("Tokenization failed: " + e.getMessage());

34

}

35

```

36

37

### ScalaTokenAdapter

38

39

Adapter class that bridges Scalameta tokens to PMD's CPD token interface.

40

41

```java { .api }

42

public class ScalaTokenAdapter {

43

// Internal adapter implementation

44

// Converts scala.meta.Token to PMD CpdToken format

45

}

46

```

47

48

**Internal Usage**:

49

50

```java

51

// Used internally by ScalaCpdLexer

52

scala.meta.Token scalametaToken = // ... from Scalameta parsing

53

CpdToken pmdToken = ScalaTokenAdapter.adapt(scalametaToken, document);

54

tokenFactory.recordToken(pmdToken);

55

```

56

57

## Tokenization Process

58

59

### Token Generation Strategy

60

61

The CPD lexer processes Scala source code through the following steps:

62

63

1. **Scalameta Parsing**: Parse source code using Scalameta's tokenizer

64

2. **Token Filtering**: Filter out comments and whitespace tokens

65

3. **Token Adaptation**: Convert Scalameta tokens to PMD CPD format

66

4. **Position Mapping**: Maintain accurate source position information

67

68

```java

69

// Internal tokenization process

70

public void tokenize(TextDocument document, TokenFactory tokenEntries) throws IOException {

71

try {

72

// Parse with Scalameta

73

Input input = Input.String(document.getText().toString());

74

Tokens tokens = input.tokenize().get();

75

76

// Filter and adapt tokens

77

for (Token token : tokens) {

78

if (shouldIncludeToken(token)) {

79

CpdToken cpdToken = adaptToken(token, document);

80

tokenEntries.recordToken(cpdToken);

81

}

82

}

83

} catch (Exception e) {

84

throw new IOException("Scala tokenization failed", e);

85

}

86

}

87

```

88

89

### Token Filtering Rules

90

91

The tokenizer applies filtering rules to focus on semantically meaningful tokens:

92

93

```java

94

private boolean shouldIncludeToken(Token token) {

95

// Include identifiers, keywords, literals, operators

96

// Exclude comments, whitespace, formatting tokens

97

return !(token instanceof Token.Comment ||

98

token instanceof Token.Space ||

99

token instanceof Token.Tab ||

100

token instanceof Token.LF ||

101

token instanceof Token.CRLF ||

102

token instanceof Token.FF);

103

}

104

```

105

106

## Integration with PMD CPD

107

108

### Language Module Integration

109

110

CPD integration is handled through the language module:

111

112

```java { .api }

113

public class ScalaLanguageModule extends SimpleLanguageModuleBase {

114

@Override

115

public CpdLexer createCpdLexer(LanguagePropertyBundle bundle) {

116

return new ScalaCpdLexer(bundle);

117

}

118

}

119

```

120

121

**Usage Example**:

122

123

```java

124

// Get CPD lexer from language module

125

ScalaLanguageModule module = ScalaLanguageModule.getInstance();

126

LanguagePropertyBundle bundle = // ... configure properties

127

CpdLexer lexer = module.createCpdLexer(bundle);

128

129

// Use lexer for CPD analysis

130

lexer.tokenize(document, tokenEntries);

131

```

132

133

### CPD Configuration

134

135

Configure CPD analysis parameters for Scala projects:

136

137

```java

138

// CPD configuration for Scala

139

CpdConfiguration config = CpdConfiguration.builder()

140

.setMinimumTileSize(50) // Minimum tokens for duplication

141

.setLanguage("scala") // Use Scala CPD lexer

142

.setIgnoreAnnotations(true) // Ignore annotation differences

143

.setIgnoreIdentifiers(false) // Consider identifier names

144

.setIgnoreLiterals(true) // Ignore literal value differences

145

.build();

146

```

147

148

## Duplication Detection Patterns

149

150

### Class-Level Duplication

151

152

CPD can detect duplicated code patterns at various levels:

153

154

```scala

155

// Example: Similar class structures

156

class UserService {

157

def findById(id: Long): Option[User] = {

158

val query = "SELECT * FROM users WHERE id = ?"

159

executeQuery(query, id).map(parseUser)

160

}

161

162

def save(user: User): Boolean = {

163

val query = "INSERT INTO users (name, email) VALUES (?, ?)"

164

executeUpdate(query, user.name, user.email) > 0

165

}

166

}

167

168

class ProductService {

169

def findById(id: Long): Option[Product] = {

170

val query = "SELECT * FROM products WHERE id = ?"

171

executeQuery(query, id).map(parseProduct) // Similar pattern

172

}

173

174

def save(product: Product): Boolean = {

175

val query = "INSERT INTO products (name, price) VALUES (?, ?)"

176

executeUpdate(query, product.name, product.price) > 0 // Similar pattern

177

}

178

}

179

```

180

181

### Method-Level Duplication

182

183

```scala

184

// Example: Similar method implementations

185

def processUsers(users: List[User]): List[ProcessedUser] = {

186

users.map { user =>

187

val validated = validateUser(user)

188

val normalized = normalizeUser(validated)

189

val enriched = enrichUser(normalized)

190

ProcessedUser(enriched)

191

}

192

}

193

194

def processProducts(products: List[Product]): List[ProcessedProduct] = {

195

products.map { product =>

196

val validated = validateProduct(product) // Similar structure

197

val normalized = normalizeProduct(validated) // Similar structure

198

val enriched = enrichProduct(normalized) // Similar structure

199

ProcessedProduct(enriched)

200

}

201

}

202

```

203

204

### Expression-Level Duplication

205

206

```scala

207

// Example: Similar expression patterns

208

val userResult = Try {

209

val data = fetchUserData(id)

210

val parsed = parseUserData(data)

211

val validated = validateUserData(parsed)

212

validated

213

}.recover {

214

case _: NetworkException => DefaultUser

215

case _: ParseException => DefaultUser

216

}.get

217

218

val productResult = Try {

219

val data = fetchProductData(id) // Similar pattern

220

val parsed = parseProductData(data) // Similar pattern

221

val validated = validateProductData(parsed) // Similar pattern

222

validated

223

}.recover {

224

case _: NetworkException => DefaultProduct // Similar pattern

225

case _: ParseException => DefaultProduct // Similar pattern

226

}.get

227

```

228

229

## CPD Analysis Results

230

231

### Duplication Report Format

232

233

CPD generates reports identifying duplicated code blocks:

234

235

```xml

236

<!-- Example CPD report for Scala -->

237

<pmd-cpd>

238

<duplication lines="12" tokens="45">

239

<file line="15" path="src/main/scala/UserService.scala"/>

240

<file line="28" path="src/main/scala/ProductService.scala"/>

241

<codefragment><![CDATA[

242

def findById(id: Long): Option[T] = {

243

val query = "SELECT * FROM table WHERE id = ?"

244

executeQuery(query, id).map(parseEntity)

245

}

246

]]></codefragment>

247

</duplication>

248

</pmd-cpd>

249

```

250

251

### Programmatic Analysis

252

253

```java

254

// Analyze CPD results programmatically

255

public void analyzeCpdResults(List<Match> duplications) {

256

for (Match duplication : duplications) {

257

System.out.println("Found duplication:");

258

System.out.println(" Tokens: " + duplication.getTokenCount());

259

System.out.println(" Lines: " + duplication.getLineCount());

260

261

for (Mark mark : duplication.getMarkSet()) {

262

System.out.println(" File: " + mark.getFilename() +

263

" (line " + mark.getBeginLine() + ")");

264

}

265

266

System.out.println(" Code fragment:");

267

System.out.println(" " + duplication.getSourceCodeSlice());

268

}

269

}

270

```

271

272

## Advanced CPD Features

273

274

### Token Normalization

275

276

CPD can normalize tokens to detect semantic duplications that differ in naming:

277

278

```java

279

// Configuration for token normalization

280

CpdConfiguration config = CpdConfiguration.builder()

281

.setIgnoreIdentifiers(true) // user/product → identifier

282

.setIgnoreLiterals(true) // "users"/"products" → string_literal

283

.setIgnoreAnnotations(true) // @Entity/@Component → annotation

284

.build();

285

```

286

287

This allows detection of functionally identical code with different names:

288

289

```scala

290

// These would be detected as duplicates with normalization

291

def saveUser(user: User) = repository.save(user)

292

def saveProduct(product: Product) = repository.save(product)

293

294

// Normalized tokens: save(identifier) = identifier.save(identifier)

295

```

296

297

### Custom Token Filtering

298

299

```java

300

public class CustomScalaCpdLexer extends ScalaCpdLexer {

301

public CustomScalaCpdLexer(LanguagePropertyBundle bundle) {

302

super(bundle);

303

}

304

305

@Override

306

protected boolean shouldIncludeToken(Token token) {

307

// Custom filtering logic

308

if (token instanceof Token.KwPrivate || token instanceof Token.KwProtected) {

309

return false; // Ignore visibility modifiers

310

}

311

312

if (token instanceof Token.Ident && isTestMethodName(token)) {

313

return false; // Ignore test method names

314

}

315

316

return super.shouldIncludeToken(token);

317

}

318

319

private boolean isTestMethodName(Token.Ident token) {

320

String name = token.value();

321

return name.startsWith("test") || name.contains("should");

322

}

323

}

324

```

325

326

### Integration with Build Tools

327

328

#### Maven Integration

329

330

```xml

331

<plugin>

332

<groupId>com.github.spotbugs</groupId>

333

<artifactId>spotbugs-maven-plugin</artifactId>

334

<configuration>

335

<xmlOutput>true</xmlOutput>

336

<includeLanguages>

337

<language>scala</language>

338

</includeLanguages>

339

<cpdMinimumTokens>50</cpdMinimumTokens>

340

</configuration>

341

</plugin>

342

```

343

344

#### SBT Integration

345

346

```scala

347

// build.sbt

348

libraryDependencies += "net.sourceforge.pmd" % "pmd-scala_2.12" % "7.13.0"

349

350

// Custom CPD task

351

lazy val cpd = taskKey[Unit]("Run copy-paste detection")

352

353

cpd := {

354

val classpath = (Compile / dependencyClasspath).value

355

val sourceDir = (Compile / scalaSource).value

356

357

// Run CPD analysis on Scala sources

358

runCpdAnalysis(sourceDir, classpath)

359

}

360

```

361

362

## Performance Considerations

363

364

### Tokenization Performance

365

366

```java

367

// Optimize tokenization for large codebases

368

public class OptimizedScalaCpdLexer extends ScalaCpdLexer {

369

private final Cache<String, Tokens> tokenCache =

370

CacheBuilder.newBuilder()

371

.maximumSize(1000)

372

.expireAfterWrite(10, TimeUnit.MINUTES)

373

.build();

374

375

@Override

376

public void tokenize(TextDocument document, TokenFactory tokenEntries) throws IOException {

377

String content = document.getText().toString();

378

379

try {

380

Tokens tokens = tokenCache.get(content, () -> {

381

Input input = Input.String(content);

382

return input.tokenize().get();

383

});

384

385

processTokens(tokens, document, tokenEntries);

386

} catch (ExecutionException e) {

387

throw new IOException("Tokenization failed", e.getCause());

388

}

389

}

390

}

391

```

392

393

### Memory Management

394

395

```java

396

// Stream-based processing for large files

397

public void tokenizeLargeFile(TextDocument document, TokenFactory tokenEntries) throws IOException {

398

try (Stream<String> lines = document.getText().lines()) {

399

lines.forEach(line -> {

400

try {

401

tokenizeLine(line, tokenEntries);

402

} catch (IOException e) {

403

throw new RuntimeException(e);

404

}

405

});

406

}

407

}

408

```

409

410

## Best Practices

411

412

### CPD Configuration Guidelines

413

414

1. **Minimum Token Size**: Set appropriate threshold (50-100 tokens)

415

2. **Ignore Settings**: Configure based on project needs

416

3. **File Filtering**: Exclude generated code and test fixtures

417

4. **Report Format**: Choose appropriate output format (XML, JSON, CSV)

418

419

### Integration Strategies

420

421

1. **CI/CD Integration**: Include CPD in build pipeline

422

2. **Quality Gates**: Set duplication thresholds

423

3. **Trend Analysis**: Track duplication metrics over time

424

4. **Refactoring Guidance**: Use results to guide code improvements

425

426

The copy-paste detection system provides comprehensive duplication analysis capabilities for Scala codebases, enabling teams to identify and eliminate code duplication effectively.