tessl/maven-net-sourceforge-pmd--pmd-python

Python language support module for PMD's Copy-Paste Detector (CPD), enabling detection of duplicated code in Python source files

—

Pending

Overview

Eval results

Files

PMD Python Language Module

Name: tessl/maven-net-sourceforge-pmd--pmd-python
Author: tessl

PMD Python is a Java library that provides Python language support for PMD's Copy-Paste Detector (CPD). It implements a CPD lexer specifically for Python source files, enabling detection of duplicated code in Python projects as part of PMD's comprehensive static analysis capabilities.

Package Information

Package Name: pmd-python
Package Type: Maven
Language: Java
Group ID: net.sourceforge.pmd
Artifact ID: pmd-python
Version: 7.13.0
Installation: Add dependency to Maven pom.xml:

<dependency>
    <groupId>net.sourceforge.pmd</groupId>
    <artifactId>pmd-python</artifactId>
    <version>7.13.0</version>
</dependency>

Core Imports

import net.sourceforge.pmd.lang.python.PythonLanguageModule;
import net.sourceforge.pmd.lang.python.cpd.PythonCpdLexer;
import net.sourceforge.pmd.lang.python.ast.PythonTokenKinds;
import net.sourceforge.pmd.cpd.CpdLexer;
import net.sourceforge.pmd.lang.LanguagePropertyBundle;
import net.sourceforge.pmd.lang.LanguageRegistry;
import net.sourceforge.pmd.lang.TokenManager;
import net.sourceforge.pmd.lang.ast.impl.javacc.JavaccToken;
import net.sourceforge.pmd.lang.ast.impl.javacc.CharStream;
import net.sourceforge.pmd.lang.ast.impl.javacc.JavaccTokenDocument.TokenDocumentBehavior;
import net.sourceforge.pmd.lang.document.TextDocument;
import java.util.regex.Pattern;

Basic Usage

// Get the Python language module instance
PythonLanguageModule pythonModule = PythonLanguageModule.getInstance();

// Create a CPD lexer for Python code analysis
PythonCpdLexer lexer = new PythonCpdLexer();

// The lexer can be used with PMD's CPD framework to detect duplicate code

Architecture

PMD Python integrates with PMD's language module framework through several key components:

Language Module: PythonLanguageModule registers Python as a supported language for copy-paste detection
CPD Lexer: PythonCpdLexer provides Python-specific tokenization for duplicate code detection
Grammar Definition: JavaCC-based Python 2.7 grammar for lexical analysis
Service Registration: Automatic discovery through Java service provider interface

Types

PMD Framework Types

// Core language module base class
abstract class CpdOnlyLanguageModuleBase {
    protected CpdOnlyLanguageModuleBase(LanguageMetadata metadata);
}

// Language property configuration
interface LanguagePropertyBundle {
    // PMD language property bundle for configuration
}

// CPD lexer interface
interface CpdLexer {
    // Copy-paste detector lexer interface
}

// JavaCC-based CPD lexer implementation
abstract class JavaccCpdLexer implements CpdLexer {
    protected abstract TokenManager<JavaccToken> makeLexerImpl(TextDocument doc);
    protected abstract String getImage(JavaccToken token);
}

// Token manager for JavaCC tokens
interface TokenManager<T extends JavaccToken> {
    // Manages tokens for JavaCC-based lexers
}

// JavaCC token representation  
class JavaccToken {
    public int kind;
    public String getImage();
}

// Text document for processing
interface TextDocument {
    // Represents a text document for lexical analysis
}

// Language metadata builder
class LanguageMetadata {
    public static LanguageMetadata withId(String id);
    public LanguageMetadata name(String name);
    public LanguageMetadata extensions(String... extensions);
}

// Character stream for JavaCC parsing
class CharStream {
    public static CharStream create(TextDocument doc, TokenDocumentBehavior behavior);
}

// Token document behavior configuration
class TokenDocumentBehavior {
    public TokenDocumentBehavior(String[] tokenNames);
}

Capabilities

Language Module Registration

The main entry point for integrating Python language support into PMD's language registry system.

public class PythonLanguageModule extends CpdOnlyLanguageModuleBase {
    public PythonLanguageModule();
    public static PythonLanguageModule getInstance();
    public CpdLexer createCpdLexer(LanguagePropertyBundle bundle);
}

PythonLanguageModule provides:

Constructor: PythonLanguageModule() - Creates the language module with Python metadata (language ID "python", name "Python", file extension ".py")
Static method: getInstance() - Returns the singleton instance from PMD's language registry
Factory method: createCpdLexer(LanguagePropertyBundle bundle) - Creates a Python CPD lexer instance for tokenization

Copy-Paste Detection Lexer

Python-specific tokenizer that integrates with PMD's Copy-Paste Detector to identify duplicate code patterns in Python source files.

public class PythonCpdLexer extends JavaccCpdLexer {
    public PythonCpdLexer();
    protected TokenManager<JavaccToken> makeLexerImpl(TextDocument doc);
    protected String getImage(JavaccToken token);
}

PythonCpdLexer provides:

Constructor: PythonCpdLexer() - Creates Python CPD lexer instance
Token manager factory: makeLexerImpl(TextDocument doc) - Creates token manager for processing Python source documents using PythonTokenKinds
Token image processing: getImage(JavaccToken token) - Normalizes token images, particularly handling Python string literals by removing line break escapes (\\r?\\n) from single-quoted strings

Internal Implementation Details:

Uses Pattern.compile("\\\\\\r?\\n") for string escape normalization
Processes specific token kinds: SINGLE_STRING, SINGLE_STRING2, SINGLE_BSTRING, SINGLE_BSTRING2, SINGLE_USTRING, SINGLE_USTRING2
Returns normalized token images for consistent duplicate detection

Python Token Types

The lexer recognizes comprehensive Python language tokens through the generated PythonTokenKinds class:

// Python token constants - generated from Python.jj grammar
class PythonTokenKinds {
    // String literal token types (handled specially for escape processing)
    public static final int SINGLE_STRING;     // Single-quoted strings: 'text'
    public static final int SINGLE_STRING2;    // Alternative single-quoted strings
    public static final int SINGLE_BSTRING;    // Byte strings: b'text'
    public static final int SINGLE_BSTRING2;   // Alternative byte strings
    public static final int SINGLE_USTRING;    // Unicode strings: u'text'  
    public static final int SINGLE_USTRING2;   // Alternative unicode strings
    
    // Token creation and management
    public static TokenManager<JavaccToken> newTokenManager(CharStream charStream);
    public static final String[] TOKEN_NAMES;
}

Token Processing Features:

String normalization: Removes line break escapes (\\r?\\n) from single-quoted string tokens
Token classification: Full recognition of Python operators, keywords, separators, and literals
Grammar compliance: Based on Python 2.7 language specification from PyDev project
Generated constants: Token kind integers are generated at build time from Python.jj grammar

Integration Points

// Service provider registration (META-INF/services/net.sourceforge.pmd.lang.Language)
net.sourceforge.pmd.lang.python.PythonLanguageModule

// Language metadata
String LANGUAGE_ID = "python";
String LANGUAGE_NAME = "Python";  
String[] FILE_EXTENSIONS = {".py"};

Integration Features:

Automatic discovery: Registered as Java service provider for PMD language system
CPD framework: Full integration with PMD's copy-paste detection engine
Token management: Uses PMD's JavaCC-based token processing infrastructure
Language registry: Available through LanguageRegistry.CPD.getLanguageById("python")

Usage Examples

Basic PMD Integration:

// Retrieve Python language module
PythonLanguageModule module = PythonLanguageModule.getInstance();

// Create lexer for Python code analysis  
PythonCpdLexer lexer = module.createCpdLexer(languagePropertyBundle);

// The lexer is now ready for use with PMD's CPD engine

Token Processing:

// The lexer handles Python-specific token processing automatically
// Including normalization of string literals with escape sequences
PythonCpdLexer lexer = new PythonCpdLexer();
// Token processing happens internally during CPD analysis

Dependencies

This module requires PMD core dependencies:

net.sourceforge.pmd:pmd-core - Core PMD functionality
Java 8+ runtime environment
JavaCC token processing support

Technical Notes

Grammar: Based on Python 2.7 language specification from PyDev project (originally from PyDev commit 32950d534139f286e03d34795aec99edab09c04c)
Token Processing: Special handling for Python string literals including escape sequence normalization for line breaks
CPD Integration: Designed specifically for copy-paste detection rather than full parsing
Language Support: Currently supports Python 2.7 syntax patterns including Unicode strings, byte strings, and backtick expressions
Performance: Optimized for large-scale code analysis across Python codebases
Testing: Comprehensive test coverage including special comments, Unicode handling, tab width processing, variable names with dollar signs, and backtick expressions
Build Integration: Uses JavaCC Maven plugin with ant wrapper for grammar processing during build phase

Install with Tessl CLI

npx tessl i tessl/maven-net-sourceforge-pmd--pmd-python

Workspace: tessl
Visibility: Public
Created: 6 months ago
Last updated: about 1 month ago
Describes: pkg:maven/net.sourceforge.pmd/pmd-python@7.13.x
Badge