Python language support module for PMD's Copy-Paste Detector (CPD), enabling detection of duplicated code in Python source files
npx @tessl/cli install tessl/maven-net-sourceforge-pmd--pmd-python@7.13.0PMD Python is a Java library that provides Python language support for PMD's Copy-Paste Detector (CPD). It implements a CPD lexer specifically for Python source files, enabling detection of duplicated code in Python projects as part of PMD's comprehensive static analysis capabilities.
<dependency>
<groupId>net.sourceforge.pmd</groupId>
<artifactId>pmd-python</artifactId>
<version>7.13.0</version>
</dependency>import net.sourceforge.pmd.lang.python.PythonLanguageModule;
import net.sourceforge.pmd.lang.python.cpd.PythonCpdLexer;
import net.sourceforge.pmd.lang.python.ast.PythonTokenKinds;
import net.sourceforge.pmd.cpd.CpdLexer;
import net.sourceforge.pmd.lang.LanguagePropertyBundle;
import net.sourceforge.pmd.lang.LanguageRegistry;
import net.sourceforge.pmd.lang.TokenManager;
import net.sourceforge.pmd.lang.ast.impl.javacc.JavaccToken;
import net.sourceforge.pmd.lang.ast.impl.javacc.CharStream;
import net.sourceforge.pmd.lang.ast.impl.javacc.JavaccTokenDocument.TokenDocumentBehavior;
import net.sourceforge.pmd.lang.document.TextDocument;
import java.util.regex.Pattern;// Get the Python language module instance
PythonLanguageModule pythonModule = PythonLanguageModule.getInstance();
// Create a CPD lexer for Python code analysis
PythonCpdLexer lexer = new PythonCpdLexer();
// The lexer can be used with PMD's CPD framework to detect duplicate codePMD Python integrates with PMD's language module framework through several key components:
PythonLanguageModule registers Python as a supported language for copy-paste detectionPythonCpdLexer provides Python-specific tokenization for duplicate code detection// Core language module base class
abstract class CpdOnlyLanguageModuleBase {
protected CpdOnlyLanguageModuleBase(LanguageMetadata metadata);
}
// Language property configuration
interface LanguagePropertyBundle {
// PMD language property bundle for configuration
}
// CPD lexer interface
interface CpdLexer {
// Copy-paste detector lexer interface
}
// JavaCC-based CPD lexer implementation
abstract class JavaccCpdLexer implements CpdLexer {
protected abstract TokenManager<JavaccToken> makeLexerImpl(TextDocument doc);
protected abstract String getImage(JavaccToken token);
}
// Token manager for JavaCC tokens
interface TokenManager<T extends JavaccToken> {
// Manages tokens for JavaCC-based lexers
}
// JavaCC token representation
class JavaccToken {
public int kind;
public String getImage();
}
// Text document for processing
interface TextDocument {
// Represents a text document for lexical analysis
}
// Language metadata builder
class LanguageMetadata {
public static LanguageMetadata withId(String id);
public LanguageMetadata name(String name);
public LanguageMetadata extensions(String... extensions);
}
// Character stream for JavaCC parsing
class CharStream {
public static CharStream create(TextDocument doc, TokenDocumentBehavior behavior);
}
// Token document behavior configuration
class TokenDocumentBehavior {
public TokenDocumentBehavior(String[] tokenNames);
}The main entry point for integrating Python language support into PMD's language registry system.
public class PythonLanguageModule extends CpdOnlyLanguageModuleBase {
public PythonLanguageModule();
public static PythonLanguageModule getInstance();
public CpdLexer createCpdLexer(LanguagePropertyBundle bundle);
}PythonLanguageModule provides:
PythonLanguageModule() - Creates the language module with Python metadata (language ID "python", name "Python", file extension ".py")getInstance() - Returns the singleton instance from PMD's language registrycreateCpdLexer(LanguagePropertyBundle bundle) - Creates a Python CPD lexer instance for tokenizationPython-specific tokenizer that integrates with PMD's Copy-Paste Detector to identify duplicate code patterns in Python source files.
public class PythonCpdLexer extends JavaccCpdLexer {
public PythonCpdLexer();
protected TokenManager<JavaccToken> makeLexerImpl(TextDocument doc);
protected String getImage(JavaccToken token);
}PythonCpdLexer provides:
PythonCpdLexer() - Creates Python CPD lexer instancemakeLexerImpl(TextDocument doc) - Creates token manager for processing Python source documents using PythonTokenKindsgetImage(JavaccToken token) - Normalizes token images, particularly handling Python string literals by removing line break escapes (\\r?\\n) from single-quoted stringsInternal Implementation Details:
Pattern.compile("\\\\\\r?\\n") for string escape normalizationSINGLE_STRING, SINGLE_STRING2, SINGLE_BSTRING, SINGLE_BSTRING2, SINGLE_USTRING, SINGLE_USTRING2The lexer recognizes comprehensive Python language tokens through the generated PythonTokenKinds class:
// Python token constants - generated from Python.jj grammar
class PythonTokenKinds {
// String literal token types (handled specially for escape processing)
public static final int SINGLE_STRING; // Single-quoted strings: 'text'
public static final int SINGLE_STRING2; // Alternative single-quoted strings
public static final int SINGLE_BSTRING; // Byte strings: b'text'
public static final int SINGLE_BSTRING2; // Alternative byte strings
public static final int SINGLE_USTRING; // Unicode strings: u'text'
public static final int SINGLE_USTRING2; // Alternative unicode strings
// Token creation and management
public static TokenManager<JavaccToken> newTokenManager(CharStream charStream);
public static final String[] TOKEN_NAMES;
}Token Processing Features:
\\r?\\n) from single-quoted string tokens// Service provider registration (META-INF/services/net.sourceforge.pmd.lang.Language)
net.sourceforge.pmd.lang.python.PythonLanguageModule
// Language metadata
String LANGUAGE_ID = "python";
String LANGUAGE_NAME = "Python";
String[] FILE_EXTENSIONS = {".py"};Integration Features:
LanguageRegistry.CPD.getLanguageById("python")Basic PMD Integration:
// Retrieve Python language module
PythonLanguageModule module = PythonLanguageModule.getInstance();
// Create lexer for Python code analysis
PythonCpdLexer lexer = module.createCpdLexer(languagePropertyBundle);
// The lexer is now ready for use with PMD's CPD engineToken Processing:
// The lexer handles Python-specific token processing automatically
// Including normalization of string literals with escape sequences
PythonCpdLexer lexer = new PythonCpdLexer();
// Token processing happens internally during CPD analysisThis module requires PMD core dependencies:
net.sourceforge.pmd:pmd-core - Core PMD functionality