Tessl Tile for maven/net.sourceforge.pmd/pmd-python@7.13.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

index.md

index.mddocs/

0
# PMD Python Language Module
1

2
PMD Python is a Java library that provides Python language support for PMD's Copy-Paste Detector (CPD). It implements a CPD lexer specifically for Python source files, enabling detection of duplicated code in Python projects as part of PMD's comprehensive static analysis capabilities.
3

4
## Package Information
5

6
- **Package Name**: pmd-python
7
- **Package Type**: Maven
8
- **Language**: Java
9
- **Group ID**: net.sourceforge.pmd
10
- **Artifact ID**: pmd-python
11
- **Version**: 7.13.0
12
- **Installation**: Add dependency to Maven pom.xml:
13

14
```xml
15
<dependency>
16
    <groupId>net.sourceforge.pmd</groupId>
17
    <artifactId>pmd-python</artifactId>
18
    <version>7.13.0</version>
19
</dependency>
20
```
21

22
## Core Imports
23

24
```java
25
import net.sourceforge.pmd.lang.python.PythonLanguageModule;
26
import net.sourceforge.pmd.lang.python.cpd.PythonCpdLexer;
27
import net.sourceforge.pmd.lang.python.ast.PythonTokenKinds;
28
import net.sourceforge.pmd.cpd.CpdLexer;
29
import net.sourceforge.pmd.lang.LanguagePropertyBundle;
30
import net.sourceforge.pmd.lang.LanguageRegistry;
31
import net.sourceforge.pmd.lang.TokenManager;
32
import net.sourceforge.pmd.lang.ast.impl.javacc.JavaccToken;
33
import net.sourceforge.pmd.lang.ast.impl.javacc.CharStream;
34
import net.sourceforge.pmd.lang.ast.impl.javacc.JavaccTokenDocument.TokenDocumentBehavior;
35
import net.sourceforge.pmd.lang.document.TextDocument;
36
import java.util.regex.Pattern;
37
```
38

39
## Basic Usage
40

41
```java
42
// Get the Python language module instance
43
PythonLanguageModule pythonModule = PythonLanguageModule.getInstance();
44

45
// Create a CPD lexer for Python code analysis
46
PythonCpdLexer lexer = new PythonCpdLexer();
47

48
// The lexer can be used with PMD's CPD framework to detect duplicate code
49
```
50

51
## Architecture
52

53
PMD Python integrates with PMD's language module framework through several key components:
54

55
- **Language Module**: `PythonLanguageModule` registers Python as a supported language for copy-paste detection
56
- **CPD Lexer**: `PythonCpdLexer` provides Python-specific tokenization for duplicate code detection
57
- **Grammar Definition**: JavaCC-based Python 2.7 grammar for lexical analysis
58
- **Service Registration**: Automatic discovery through Java service provider interface
59

60
## Types
61

62
### PMD Framework Types
63

64
```java { .api }
65
// Core language module base class
66
abstract class CpdOnlyLanguageModuleBase {
67
    protected CpdOnlyLanguageModuleBase(LanguageMetadata metadata);
68
}
69

70
// Language property configuration
71
interface LanguagePropertyBundle {
72
    // PMD language property bundle for configuration
73
}
74

75
// CPD lexer interface
76
interface CpdLexer {
77
    // Copy-paste detector lexer interface
78
}
79

80
// JavaCC-based CPD lexer implementation
81
abstract class JavaccCpdLexer implements CpdLexer {
82
    protected abstract TokenManager<JavaccToken> makeLexerImpl(TextDocument doc);
83
    protected abstract String getImage(JavaccToken token);
84
}
85

86
// Token manager for JavaCC tokens
87
interface TokenManager<T extends JavaccToken> {
88
    // Manages tokens for JavaCC-based lexers
89
}
90

91
// JavaCC token representation  
92
class JavaccToken {
93
    public int kind;
94
    public String getImage();
95
}
96

97
// Text document for processing
98
interface TextDocument {
99
    // Represents a text document for lexical analysis
100
}
101

102
// Language metadata builder
103
class LanguageMetadata {
104
    public static LanguageMetadata withId(String id);
105
    public LanguageMetadata name(String name);
106
    public LanguageMetadata extensions(String... extensions);
107
}
108

109
// Character stream for JavaCC parsing
110
class CharStream {
111
    public static CharStream create(TextDocument doc, TokenDocumentBehavior behavior);
112
}
113

114
// Token document behavior configuration
115
class TokenDocumentBehavior {
116
    public TokenDocumentBehavior(String[] tokenNames);
117
}
118
```
119

120
## Capabilities
121

122
### Language Module Registration
123

124
The main entry point for integrating Python language support into PMD's language registry system.
125

126
```java { .api }
127
public class PythonLanguageModule extends CpdOnlyLanguageModuleBase {
128
    public PythonLanguageModule();
129
    public static PythonLanguageModule getInstance();
130
    public CpdLexer createCpdLexer(LanguagePropertyBundle bundle);
131
}
132
```
133

134
**PythonLanguageModule** provides:
135
- **Constructor**: `PythonLanguageModule()` - Creates the language module with Python metadata (language ID "python", name "Python", file extension ".py")
136
- **Static method**: `getInstance()` - Returns the singleton instance from PMD's language registry
137
- **Factory method**: `createCpdLexer(LanguagePropertyBundle bundle)` - Creates a Python CPD lexer instance for tokenization
138

139
### Copy-Paste Detection Lexer
140

141
Python-specific tokenizer that integrates with PMD's Copy-Paste Detector to identify duplicate code patterns in Python source files.
142

143
```java { .api }
144
public class PythonCpdLexer extends JavaccCpdLexer {
145
    public PythonCpdLexer();
146
    protected TokenManager<JavaccToken> makeLexerImpl(TextDocument doc);
147
    protected String getImage(JavaccToken token);
148
}
149
```
150

151
**PythonCpdLexer** provides:
152
- **Constructor**: `PythonCpdLexer()` - Creates Python CPD lexer instance
153
- **Token manager factory**: `makeLexerImpl(TextDocument doc)` - Creates token manager for processing Python source documents using PythonTokenKinds
154
- **Token image processing**: `getImage(JavaccToken token)` - Normalizes token images, particularly handling Python string literals by removing line break escapes (`\\r?\\n`) from single-quoted strings
155

156
**Internal Implementation Details**:
157
- Uses `Pattern.compile("\\\\\\r?\\n")` for string escape normalization
158
- Processes specific token kinds: `SINGLE_STRING`, `SINGLE_STRING2`, `SINGLE_BSTRING`, `SINGLE_BSTRING2`, `SINGLE_USTRING`, `SINGLE_USTRING2`
159
- Returns normalized token images for consistent duplicate detection
160

161
### Python Token Types
162

163
The lexer recognizes comprehensive Python language tokens through the generated PythonTokenKinds class:
164

165
```java { .api }
166
// Python token constants - generated from Python.jj grammar
167
class PythonTokenKinds {
168
    // String literal token types (handled specially for escape processing)
169
    public static final int SINGLE_STRING;     // Single-quoted strings: 'text'
170
    public static final int SINGLE_STRING2;    // Alternative single-quoted strings
171
    public static final int SINGLE_BSTRING;    // Byte strings: b'text'
172
    public static final int SINGLE_BSTRING2;   // Alternative byte strings
173
    public static final int SINGLE_USTRING;    // Unicode strings: u'text'  
174
    public static final int SINGLE_USTRING2;   // Alternative unicode strings
175
    
176
    // Token creation and management
177
    public static TokenManager<JavaccToken> newTokenManager(CharStream charStream);
178
    public static final String[] TOKEN_NAMES;
179
}
180
```
181

182
**Token Processing Features**:
183
- **String normalization**: Removes line break escapes (`\\r?\\n`) from single-quoted string tokens
184
- **Token classification**: Full recognition of Python operators, keywords, separators, and literals
185
- **Grammar compliance**: Based on Python 2.7 language specification from PyDev project
186
- **Generated constants**: Token kind integers are generated at build time from Python.jj grammar
187

188
### Integration Points
189

190
```java { .api }
191
// Service provider registration (META-INF/services/net.sourceforge.pmd.lang.Language)
192
net.sourceforge.pmd.lang.python.PythonLanguageModule
193

194
// Language metadata
195
String LANGUAGE_ID = "python";
196
String LANGUAGE_NAME = "Python";  
197
String[] FILE_EXTENSIONS = {".py"};
198
```
199

200
**Integration Features**:
201
- **Automatic discovery**: Registered as Java service provider for PMD language system
202
- **CPD framework**: Full integration with PMD's copy-paste detection engine
203
- **Token management**: Uses PMD's JavaCC-based token processing infrastructure
204
- **Language registry**: Available through `LanguageRegistry.CPD.getLanguageById("python")`
205

206
### Usage Examples
207

208
**Basic PMD Integration**:
209
```java
210
// Retrieve Python language module
211
PythonLanguageModule module = PythonLanguageModule.getInstance();
212

213
// Create lexer for Python code analysis  
214
PythonCpdLexer lexer = module.createCpdLexer(languagePropertyBundle);
215

216
// The lexer is now ready for use with PMD's CPD engine
217
```
218

219
**Token Processing**:
220
```java
221
// The lexer handles Python-specific token processing automatically
222
// Including normalization of string literals with escape sequences
223
PythonCpdLexer lexer = new PythonCpdLexer();
224
// Token processing happens internally during CPD analysis
225
```
226

227
## Dependencies
228

229
This module requires PMD core dependencies:
230

231
- `net.sourceforge.pmd:pmd-core` - Core PMD functionality
232
- Java 8+ runtime environment
233
- JavaCC token processing support
234

235
## Technical Notes
236

237
- **Grammar**: Based on Python 2.7 language specification from PyDev project (originally from PyDev commit 32950d534139f286e03d34795aec99edab09c04c)
238
- **Token Processing**: Special handling for Python string literals including escape sequence normalization for line breaks
239
- **CPD Integration**: Designed specifically for copy-paste detection rather than full parsing
240
- **Language Support**: Currently supports Python 2.7 syntax patterns including Unicode strings, byte strings, and backtick expressions
241
- **Performance**: Optimized for large-scale code analysis across Python codebases
242
- **Testing**: Comprehensive test coverage including special comments, Unicode handling, tab width processing, variable names with dollar signs, and backtick expressions
243
- **Build Integration**: Uses JavaCC Maven plugin with ant wrapper for grammar processing during build phase

Version

Tile

Files

index.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

index.mddocs/