C++ language module for PMD's Copy-Paste Detector providing lexical analysis and tokenization support for C++ source code
npx @tessl/cli install tessl/maven-net-sourceforge-pmd--pmd-cpp@7.13.0PMD C++ Language Module provides C++ language support for PMD's Copy-Paste Detector (CPD). This library enables duplicate code detection in C++ source files by providing lexical analysis and tokenization capabilities specifically tailored for C++ syntax, including preprocessor directive handling and configurable code block filtering.
<dependency>
<groupId>net.sourceforge.pmd</groupId>
<artifactId>pmd-cpp</artifactId>
<version>7.13.0</version>
</dependency>import net.sourceforge.pmd.lang.cpp.CppLanguageModule;
import net.sourceforge.pmd.lang.cpp.cpd.CppCpdLexer;
import net.sourceforge.pmd.lang.cpp.cpd.CppEscapeTranslator;
import net.sourceforge.pmd.cpd.CpdLexer;
import net.sourceforge.pmd.lang.LanguagePropertyBundle;
import net.sourceforge.pmd.cpd.CpdLanguageProperties;import net.sourceforge.pmd.lang.cpp.CppLanguageModule;
import net.sourceforge.pmd.cpd.CpdLexer;
import net.sourceforge.pmd.lang.LanguagePropertyBundle;
// Get the C++ language module instance
CppLanguageModule cppModule = CppLanguageModule.getInstance();
// Create a property bundle with default settings
LanguagePropertyBundle properties = cppModule.newPropertyBundle();
// Create a CPD lexer for C++ tokenization
CpdLexer lexer = cppModule.createCpdLexer(properties);
// The lexer can now be used to tokenize C++ source files for duplicate detectionThe PMD C++ module is built around several key components:
Core language module functionality for registering and configuring C++ support within the PMD framework.
public class CppLanguageModule extends CpdOnlyLanguageModuleBase {
public CppLanguageModule();
public static CppLanguageModule getInstance();
public LanguagePropertyBundle newPropertyBundle();
public CpdLexer createCpdLexer(LanguagePropertyBundle bundle);
}Properties:
public static final PropertyDescriptor<String> CPD_SKIP_BLOCKS;Advanced C++ lexical analysis with support for preprocessor directives, line continuations, and configurable filtering.
public class CppCpdLexer extends JavaccCpdLexer {
public CppCpdLexer(LanguagePropertyBundle cppProperties);
protected TokenManager<JavaccToken> makeLexerImpl(TextDocument doc);
protected TokenManager<JavaccToken> filterTokenStream(TokenManager<JavaccToken> tokenManager);
protected void processToken(TokenFactory tokenEntries, JavaccToken currentToken);
}Handles C++ backslash line continuation processing according to C++ language standards.
public class CppEscapeTranslator extends BackslashEscapeTranslator {
public CppEscapeTranslator(TextDocument input);
protected int handleBackslash(int maxOff, int backSlashOff);
}The module supports several CPD configuration properties that can be set via the LanguagePropertyBundle:
// Skip code blocks matching start/end patterns (pipe-separated)
CppLanguageModule.CPD_SKIP_BLOCKSDefault: Skips conditionally compiled code (#if 0|#endif)
Usage: Set to empty string to disable, or provide custom start|end patterns
// Ignore literal sequences in duplicate detection
CpdLanguageProperties.CPD_IGNORE_LITERAL_SEQUENCES
// Ignore literal and identifier sequences
CpdLanguageProperties.CPD_IGNORE_LITERAL_AND_IDENTIFIER_SEQUENCES
// Replace identifiers with generic placeholders
CpdLanguageProperties.CPD_ANONYMIZE_IDENTIFIERS
// Replace literals with generic placeholders
CpdLanguageProperties.CPD_ANONYMIZE_LITERALSThe C++ language module automatically recognizes the following file extensions:
.h - C/C++ header files.hpp - C++ header files.hxx - C++ header files.c - C source files.cpp - C++ source files.cxx - C++ source files.cc - C++ source files.C - C++ source filesThe lexer recognizes and processes the following C++ token categories:
+, -, *, /, %, etc.&&, ||, !, etc.==, !=, <, >, <=, >==, +=, -=, etc.{, }, (, ), [, ], ;, ,, etc.// comment text/* comment text */\n) and Windows (\r\n) line endingsThe module provides robust error handling for:
Errors are reported through PMD's standard error reporting mechanisms with accurate source location information.