CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/maven-net-sourceforge-pmd--pmd-python

Python language support module for PMD's Copy-Paste Detector (CPD), enabling detection of duplicated code in Python source files

Pending
Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Pending

The risk profile of this skill

Overview
Eval results
Files

PMD Python Language Module

PMD Python is a Java library that provides Python language support for PMD's Copy-Paste Detector (CPD). It implements a CPD lexer specifically for Python source files, enabling detection of duplicated code in Python projects as part of PMD's comprehensive static analysis capabilities.

Package Information

  • Package Name: pmd-python
  • Package Type: Maven
  • Language: Java
  • Group ID: net.sourceforge.pmd
  • Artifact ID: pmd-python
  • Version: 7.13.0
  • Installation: Add dependency to Maven pom.xml:
<dependency>
    <groupId>net.sourceforge.pmd</groupId>
    <artifactId>pmd-python</artifactId>
    <version>7.13.0</version>
</dependency>

Core Imports

import net.sourceforge.pmd.lang.python.PythonLanguageModule;
import net.sourceforge.pmd.lang.python.cpd.PythonCpdLexer;
import net.sourceforge.pmd.lang.python.ast.PythonTokenKinds;
import net.sourceforge.pmd.cpd.CpdLexer;
import net.sourceforge.pmd.lang.LanguagePropertyBundle;
import net.sourceforge.pmd.lang.LanguageRegistry;
import net.sourceforge.pmd.lang.TokenManager;
import net.sourceforge.pmd.lang.ast.impl.javacc.JavaccToken;
import net.sourceforge.pmd.lang.ast.impl.javacc.CharStream;
import net.sourceforge.pmd.lang.ast.impl.javacc.JavaccTokenDocument.TokenDocumentBehavior;
import net.sourceforge.pmd.lang.document.TextDocument;
import java.util.regex.Pattern;

Basic Usage

// Get the Python language module instance
PythonLanguageModule pythonModule = PythonLanguageModule.getInstance();

// Create a CPD lexer for Python code analysis
PythonCpdLexer lexer = new PythonCpdLexer();

// The lexer can be used with PMD's CPD framework to detect duplicate code

Architecture

PMD Python integrates with PMD's language module framework through several key components:

  • Language Module: PythonLanguageModule registers Python as a supported language for copy-paste detection
  • CPD Lexer: PythonCpdLexer provides Python-specific tokenization for duplicate code detection
  • Grammar Definition: JavaCC-based Python 2.7 grammar for lexical analysis
  • Service Registration: Automatic discovery through Java service provider interface

Types

PMD Framework Types

// Core language module base class
abstract class CpdOnlyLanguageModuleBase {
    protected CpdOnlyLanguageModuleBase(LanguageMetadata metadata);
}

// Language property configuration
interface LanguagePropertyBundle {
    // PMD language property bundle for configuration
}

// CPD lexer interface
interface CpdLexer {
    // Copy-paste detector lexer interface
}

// JavaCC-based CPD lexer implementation
abstract class JavaccCpdLexer implements CpdLexer {
    protected abstract TokenManager<JavaccToken> makeLexerImpl(TextDocument doc);
    protected abstract String getImage(JavaccToken token);
}

// Token manager for JavaCC tokens
interface TokenManager<T extends JavaccToken> {
    // Manages tokens for JavaCC-based lexers
}

// JavaCC token representation  
class JavaccToken {
    public int kind;
    public String getImage();
}

// Text document for processing
interface TextDocument {
    // Represents a text document for lexical analysis
}

// Language metadata builder
class LanguageMetadata {
    public static LanguageMetadata withId(String id);
    public LanguageMetadata name(String name);
    public LanguageMetadata extensions(String... extensions);
}

// Character stream for JavaCC parsing
class CharStream {
    public static CharStream create(TextDocument doc, TokenDocumentBehavior behavior);
}

// Token document behavior configuration
class TokenDocumentBehavior {
    public TokenDocumentBehavior(String[] tokenNames);
}

Capabilities

Language Module Registration

The main entry point for integrating Python language support into PMD's language registry system.

public class PythonLanguageModule extends CpdOnlyLanguageModuleBase {
    public PythonLanguageModule();
    public static PythonLanguageModule getInstance();
    public CpdLexer createCpdLexer(LanguagePropertyBundle bundle);
}

PythonLanguageModule provides:

  • Constructor: PythonLanguageModule() - Creates the language module with Python metadata (language ID "python", name "Python", file extension ".py")
  • Static method: getInstance() - Returns the singleton instance from PMD's language registry
  • Factory method: createCpdLexer(LanguagePropertyBundle bundle) - Creates a Python CPD lexer instance for tokenization

Copy-Paste Detection Lexer

Python-specific tokenizer that integrates with PMD's Copy-Paste Detector to identify duplicate code patterns in Python source files.

public class PythonCpdLexer extends JavaccCpdLexer {
    public PythonCpdLexer();
    protected TokenManager<JavaccToken> makeLexerImpl(TextDocument doc);
    protected String getImage(JavaccToken token);
}

PythonCpdLexer provides:

  • Constructor: PythonCpdLexer() - Creates Python CPD lexer instance
  • Token manager factory: makeLexerImpl(TextDocument doc) - Creates token manager for processing Python source documents using PythonTokenKinds
  • Token image processing: getImage(JavaccToken token) - Normalizes token images, particularly handling Python string literals by removing line break escapes (\\r?\\n) from single-quoted strings

Internal Implementation Details:

  • Uses Pattern.compile("\\\\\\r?\\n") for string escape normalization
  • Processes specific token kinds: SINGLE_STRING, SINGLE_STRING2, SINGLE_BSTRING, SINGLE_BSTRING2, SINGLE_USTRING, SINGLE_USTRING2
  • Returns normalized token images for consistent duplicate detection

Python Token Types

The lexer recognizes comprehensive Python language tokens through the generated PythonTokenKinds class:

// Python token constants - generated from Python.jj grammar
class PythonTokenKinds {
    // String literal token types (handled specially for escape processing)
    public static final int SINGLE_STRING;     // Single-quoted strings: 'text'
    public static final int SINGLE_STRING2;    // Alternative single-quoted strings
    public static final int SINGLE_BSTRING;    // Byte strings: b'text'
    public static final int SINGLE_BSTRING2;   // Alternative byte strings
    public static final int SINGLE_USTRING;    // Unicode strings: u'text'  
    public static final int SINGLE_USTRING2;   // Alternative unicode strings
    
    // Token creation and management
    public static TokenManager<JavaccToken> newTokenManager(CharStream charStream);
    public static final String[] TOKEN_NAMES;
}

Token Processing Features:

  • String normalization: Removes line break escapes (\\r?\\n) from single-quoted string tokens
  • Token classification: Full recognition of Python operators, keywords, separators, and literals
  • Grammar compliance: Based on Python 2.7 language specification from PyDev project
  • Generated constants: Token kind integers are generated at build time from Python.jj grammar

Integration Points

// Service provider registration (META-INF/services/net.sourceforge.pmd.lang.Language)
net.sourceforge.pmd.lang.python.PythonLanguageModule

// Language metadata
String LANGUAGE_ID = "python";
String LANGUAGE_NAME = "Python";  
String[] FILE_EXTENSIONS = {".py"};

Integration Features:

  • Automatic discovery: Registered as Java service provider for PMD language system
  • CPD framework: Full integration with PMD's copy-paste detection engine
  • Token management: Uses PMD's JavaCC-based token processing infrastructure
  • Language registry: Available through LanguageRegistry.CPD.getLanguageById("python")

Usage Examples

Basic PMD Integration:

// Retrieve Python language module
PythonLanguageModule module = PythonLanguageModule.getInstance();

// Create lexer for Python code analysis  
PythonCpdLexer lexer = module.createCpdLexer(languagePropertyBundle);

// The lexer is now ready for use with PMD's CPD engine

Token Processing:

// The lexer handles Python-specific token processing automatically
// Including normalization of string literals with escape sequences
PythonCpdLexer lexer = new PythonCpdLexer();
// Token processing happens internally during CPD analysis

Dependencies

This module requires PMD core dependencies:

  • net.sourceforge.pmd:pmd-core - Core PMD functionality
  • Java 8+ runtime environment
  • JavaCC token processing support

Technical Notes

  • Grammar: Based on Python 2.7 language specification from PyDev project (originally from PyDev commit 32950d534139f286e03d34795aec99edab09c04c)
  • Token Processing: Special handling for Python string literals including escape sequence normalization for line breaks
  • CPD Integration: Designed specifically for copy-paste detection rather than full parsing
  • Language Support: Currently supports Python 2.7 syntax patterns including Unicode strings, byte strings, and backtick expressions
  • Performance: Optimized for large-scale code analysis across Python codebases
  • Testing: Comprehensive test coverage including special comments, Unicode handling, tab width processing, variable names with dollar signs, and backtick expressions
  • Build Integration: Uses JavaCC Maven plugin with ant wrapper for grammar processing during build phase
Workspace
tessl
Visibility
Public
Created
Last updated
Describes
mavenpkg:maven/net.sourceforge.pmd/pmd-python@7.13.x
Publish Source
CLI
Badge
tessl/maven-net-sourceforge-pmd--pmd-python badge