0
# PMD Python Language Module
1
2
PMD Python is a Java library that provides Python language support for PMD's Copy-Paste Detector (CPD). It implements a CPD lexer specifically for Python source files, enabling detection of duplicated code in Python projects as part of PMD's comprehensive static analysis capabilities.
3
4
## Package Information
5
6
- **Package Name**: pmd-python
7
- **Package Type**: Maven
8
- **Language**: Java
9
- **Group ID**: net.sourceforge.pmd
10
- **Artifact ID**: pmd-python
11
- **Version**: 7.13.0
12
- **Installation**: Add dependency to Maven pom.xml:
13
14
```xml
15
<dependency>
16
<groupId>net.sourceforge.pmd</groupId>
17
<artifactId>pmd-python</artifactId>
18
<version>7.13.0</version>
19
</dependency>
20
```
21
22
## Core Imports
23
24
```java
25
import net.sourceforge.pmd.lang.python.PythonLanguageModule;
26
import net.sourceforge.pmd.lang.python.cpd.PythonCpdLexer;
27
import net.sourceforge.pmd.lang.python.ast.PythonTokenKinds;
28
import net.sourceforge.pmd.cpd.CpdLexer;
29
import net.sourceforge.pmd.lang.LanguagePropertyBundle;
30
import net.sourceforge.pmd.lang.LanguageRegistry;
31
import net.sourceforge.pmd.lang.TokenManager;
32
import net.sourceforge.pmd.lang.ast.impl.javacc.JavaccToken;
33
import net.sourceforge.pmd.lang.ast.impl.javacc.CharStream;
34
import net.sourceforge.pmd.lang.ast.impl.javacc.JavaccTokenDocument.TokenDocumentBehavior;
35
import net.sourceforge.pmd.lang.document.TextDocument;
36
import java.util.regex.Pattern;
37
```
38
39
## Basic Usage
40
41
```java
42
// Get the Python language module instance
43
PythonLanguageModule pythonModule = PythonLanguageModule.getInstance();
44
45
// Create a CPD lexer for Python code analysis
46
PythonCpdLexer lexer = new PythonCpdLexer();
47
48
// The lexer can be used with PMD's CPD framework to detect duplicate code
49
```
50
51
## Architecture
52
53
PMD Python integrates with PMD's language module framework through several key components:
54
55
- **Language Module**: `PythonLanguageModule` registers Python as a supported language for copy-paste detection
56
- **CPD Lexer**: `PythonCpdLexer` provides Python-specific tokenization for duplicate code detection
57
- **Grammar Definition**: JavaCC-based Python 2.7 grammar for lexical analysis
58
- **Service Registration**: Automatic discovery through Java service provider interface
59
60
## Types
61
62
### PMD Framework Types
63
64
```java { .api }
65
// Core language module base class
66
abstract class CpdOnlyLanguageModuleBase {
67
protected CpdOnlyLanguageModuleBase(LanguageMetadata metadata);
68
}
69
70
// Language property configuration
71
interface LanguagePropertyBundle {
72
// PMD language property bundle for configuration
73
}
74
75
// CPD lexer interface
76
interface CpdLexer {
77
// Copy-paste detector lexer interface
78
}
79
80
// JavaCC-based CPD lexer implementation
81
abstract class JavaccCpdLexer implements CpdLexer {
82
protected abstract TokenManager<JavaccToken> makeLexerImpl(TextDocument doc);
83
protected abstract String getImage(JavaccToken token);
84
}
85
86
// Token manager for JavaCC tokens
87
interface TokenManager<T extends JavaccToken> {
88
// Manages tokens for JavaCC-based lexers
89
}
90
91
// JavaCC token representation
92
class JavaccToken {
93
public int kind;
94
public String getImage();
95
}
96
97
// Text document for processing
98
interface TextDocument {
99
// Represents a text document for lexical analysis
100
}
101
102
// Language metadata builder
103
class LanguageMetadata {
104
public static LanguageMetadata withId(String id);
105
public LanguageMetadata name(String name);
106
public LanguageMetadata extensions(String... extensions);
107
}
108
109
// Character stream for JavaCC parsing
110
class CharStream {
111
public static CharStream create(TextDocument doc, TokenDocumentBehavior behavior);
112
}
113
114
// Token document behavior configuration
115
class TokenDocumentBehavior {
116
public TokenDocumentBehavior(String[] tokenNames);
117
}
118
```
119
120
## Capabilities
121
122
### Language Module Registration
123
124
The main entry point for integrating Python language support into PMD's language registry system.
125
126
```java { .api }
127
public class PythonLanguageModule extends CpdOnlyLanguageModuleBase {
128
public PythonLanguageModule();
129
public static PythonLanguageModule getInstance();
130
public CpdLexer createCpdLexer(LanguagePropertyBundle bundle);
131
}
132
```
133
134
**PythonLanguageModule** provides:
135
- **Constructor**: `PythonLanguageModule()` - Creates the language module with Python metadata (language ID "python", name "Python", file extension ".py")
136
- **Static method**: `getInstance()` - Returns the singleton instance from PMD's language registry
137
- **Factory method**: `createCpdLexer(LanguagePropertyBundle bundle)` - Creates a Python CPD lexer instance for tokenization
138
139
### Copy-Paste Detection Lexer
140
141
Python-specific tokenizer that integrates with PMD's Copy-Paste Detector to identify duplicate code patterns in Python source files.
142
143
```java { .api }
144
public class PythonCpdLexer extends JavaccCpdLexer {
145
public PythonCpdLexer();
146
protected TokenManager<JavaccToken> makeLexerImpl(TextDocument doc);
147
protected String getImage(JavaccToken token);
148
}
149
```
150
151
**PythonCpdLexer** provides:
152
- **Constructor**: `PythonCpdLexer()` - Creates Python CPD lexer instance
153
- **Token manager factory**: `makeLexerImpl(TextDocument doc)` - Creates token manager for processing Python source documents using PythonTokenKinds
154
- **Token image processing**: `getImage(JavaccToken token)` - Normalizes token images, particularly handling Python string literals by removing line break escapes (`\\r?\\n`) from single-quoted strings
155
156
**Internal Implementation Details**:
157
- Uses `Pattern.compile("\\\\\\r?\\n")` for string escape normalization
158
- Processes specific token kinds: `SINGLE_STRING`, `SINGLE_STRING2`, `SINGLE_BSTRING`, `SINGLE_BSTRING2`, `SINGLE_USTRING`, `SINGLE_USTRING2`
159
- Returns normalized token images for consistent duplicate detection
160
161
### Python Token Types
162
163
The lexer recognizes comprehensive Python language tokens through the generated PythonTokenKinds class:
164
165
```java { .api }
166
// Python token constants - generated from Python.jj grammar
167
class PythonTokenKinds {
168
// String literal token types (handled specially for escape processing)
169
public static final int SINGLE_STRING; // Single-quoted strings: 'text'
170
public static final int SINGLE_STRING2; // Alternative single-quoted strings
171
public static final int SINGLE_BSTRING; // Byte strings: b'text'
172
public static final int SINGLE_BSTRING2; // Alternative byte strings
173
public static final int SINGLE_USTRING; // Unicode strings: u'text'
174
public static final int SINGLE_USTRING2; // Alternative unicode strings
175
176
// Token creation and management
177
public static TokenManager<JavaccToken> newTokenManager(CharStream charStream);
178
public static final String[] TOKEN_NAMES;
179
}
180
```
181
182
**Token Processing Features**:
183
- **String normalization**: Removes line break escapes (`\\r?\\n`) from single-quoted string tokens
184
- **Token classification**: Full recognition of Python operators, keywords, separators, and literals
185
- **Grammar compliance**: Based on Python 2.7 language specification from PyDev project
186
- **Generated constants**: Token kind integers are generated at build time from Python.jj grammar
187
188
### Integration Points
189
190
```java { .api }
191
// Service provider registration (META-INF/services/net.sourceforge.pmd.lang.Language)
192
net.sourceforge.pmd.lang.python.PythonLanguageModule
193
194
// Language metadata
195
String LANGUAGE_ID = "python";
196
String LANGUAGE_NAME = "Python";
197
String[] FILE_EXTENSIONS = {".py"};
198
```
199
200
**Integration Features**:
201
- **Automatic discovery**: Registered as Java service provider for PMD language system
202
- **CPD framework**: Full integration with PMD's copy-paste detection engine
203
- **Token management**: Uses PMD's JavaCC-based token processing infrastructure
204
- **Language registry**: Available through `LanguageRegistry.CPD.getLanguageById("python")`
205
206
### Usage Examples
207
208
**Basic PMD Integration**:
209
```java
210
// Retrieve Python language module
211
PythonLanguageModule module = PythonLanguageModule.getInstance();
212
213
// Create lexer for Python code analysis
214
PythonCpdLexer lexer = module.createCpdLexer(languagePropertyBundle);
215
216
// The lexer is now ready for use with PMD's CPD engine
217
```
218
219
**Token Processing**:
220
```java
221
// The lexer handles Python-specific token processing automatically
222
// Including normalization of string literals with escape sequences
223
PythonCpdLexer lexer = new PythonCpdLexer();
224
// Token processing happens internally during CPD analysis
225
```
226
227
## Dependencies
228
229
This module requires PMD core dependencies:
230
231
- `net.sourceforge.pmd:pmd-core` - Core PMD functionality
232
- Java 8+ runtime environment
233
- JavaCC token processing support
234
235
## Technical Notes
236
237
- **Grammar**: Based on Python 2.7 language specification from PyDev project (originally from PyDev commit 32950d534139f286e03d34795aec99edab09c04c)
238
- **Token Processing**: Special handling for Python string literals including escape sequence normalization for line breaks
239
- **CPD Integration**: Designed specifically for copy-paste detection rather than full parsing
240
- **Language Support**: Currently supports Python 2.7 syntax patterns including Unicode strings, byte strings, and backtick expressions
241
- **Performance**: Optimized for large-scale code analysis across Python codebases
242
- **Testing**: Comprehensive test coverage including special comments, Unicode handling, tab width processing, variable names with dollar signs, and backtick expressions
243
- **Build Integration**: Uses JavaCC Maven plugin with ant wrapper for grammar processing during build phase