or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

index.md

index.mddocs/

0

# PMD Python Language Module

1

2

PMD Python is a Java library that provides Python language support for PMD's Copy-Paste Detector (CPD). It implements a CPD lexer specifically for Python source files, enabling detection of duplicated code in Python projects as part of PMD's comprehensive static analysis capabilities.

3

4

## Package Information

5

6

- **Package Name**: pmd-python

7

- **Package Type**: Maven

8

- **Language**: Java

9

- **Group ID**: net.sourceforge.pmd

10

- **Artifact ID**: pmd-python

11

- **Version**: 7.13.0

12

- **Installation**: Add dependency to Maven pom.xml:

13

14

```xml

15

<dependency>

16

<groupId>net.sourceforge.pmd</groupId>

17

<artifactId>pmd-python</artifactId>

18

<version>7.13.0</version>

19

</dependency>

20

```

21

22

## Core Imports

23

24

```java

25

import net.sourceforge.pmd.lang.python.PythonLanguageModule;

26

import net.sourceforge.pmd.lang.python.cpd.PythonCpdLexer;

27

import net.sourceforge.pmd.lang.python.ast.PythonTokenKinds;

28

import net.sourceforge.pmd.cpd.CpdLexer;

29

import net.sourceforge.pmd.lang.LanguagePropertyBundle;

30

import net.sourceforge.pmd.lang.LanguageRegistry;

31

import net.sourceforge.pmd.lang.TokenManager;

32

import net.sourceforge.pmd.lang.ast.impl.javacc.JavaccToken;

33

import net.sourceforge.pmd.lang.ast.impl.javacc.CharStream;

34

import net.sourceforge.pmd.lang.ast.impl.javacc.JavaccTokenDocument.TokenDocumentBehavior;

35

import net.sourceforge.pmd.lang.document.TextDocument;

36

import java.util.regex.Pattern;

37

```

38

39

## Basic Usage

40

41

```java

42

// Get the Python language module instance

43

PythonLanguageModule pythonModule = PythonLanguageModule.getInstance();

44

45

// Create a CPD lexer for Python code analysis

46

PythonCpdLexer lexer = new PythonCpdLexer();

47

48

// The lexer can be used with PMD's CPD framework to detect duplicate code

49

```

50

51

## Architecture

52

53

PMD Python integrates with PMD's language module framework through several key components:

54

55

- **Language Module**: `PythonLanguageModule` registers Python as a supported language for copy-paste detection

56

- **CPD Lexer**: `PythonCpdLexer` provides Python-specific tokenization for duplicate code detection

57

- **Grammar Definition**: JavaCC-based Python 2.7 grammar for lexical analysis

58

- **Service Registration**: Automatic discovery through Java service provider interface

59

60

## Types

61

62

### PMD Framework Types

63

64

```java { .api }

65

// Core language module base class

66

abstract class CpdOnlyLanguageModuleBase {

67

protected CpdOnlyLanguageModuleBase(LanguageMetadata metadata);

68

}

69

70

// Language property configuration

71

interface LanguagePropertyBundle {

72

// PMD language property bundle for configuration

73

}

74

75

// CPD lexer interface

76

interface CpdLexer {

77

// Copy-paste detector lexer interface

78

}

79

80

// JavaCC-based CPD lexer implementation

81

abstract class JavaccCpdLexer implements CpdLexer {

82

protected abstract TokenManager<JavaccToken> makeLexerImpl(TextDocument doc);

83

protected abstract String getImage(JavaccToken token);

84

}

85

86

// Token manager for JavaCC tokens

87

interface TokenManager<T extends JavaccToken> {

88

// Manages tokens for JavaCC-based lexers

89

}

90

91

// JavaCC token representation

92

class JavaccToken {

93

public int kind;

94

public String getImage();

95

}

96

97

// Text document for processing

98

interface TextDocument {

99

// Represents a text document for lexical analysis

100

}

101

102

// Language metadata builder

103

class LanguageMetadata {

104

public static LanguageMetadata withId(String id);

105

public LanguageMetadata name(String name);

106

public LanguageMetadata extensions(String... extensions);

107

}

108

109

// Character stream for JavaCC parsing

110

class CharStream {

111

public static CharStream create(TextDocument doc, TokenDocumentBehavior behavior);

112

}

113

114

// Token document behavior configuration

115

class TokenDocumentBehavior {

116

public TokenDocumentBehavior(String[] tokenNames);

117

}

118

```

119

120

## Capabilities

121

122

### Language Module Registration

123

124

The main entry point for integrating Python language support into PMD's language registry system.

125

126

```java { .api }

127

public class PythonLanguageModule extends CpdOnlyLanguageModuleBase {

128

public PythonLanguageModule();

129

public static PythonLanguageModule getInstance();

130

public CpdLexer createCpdLexer(LanguagePropertyBundle bundle);

131

}

132

```

133

134

**PythonLanguageModule** provides:

135

- **Constructor**: `PythonLanguageModule()` - Creates the language module with Python metadata (language ID "python", name "Python", file extension ".py")

136

- **Static method**: `getInstance()` - Returns the singleton instance from PMD's language registry

137

- **Factory method**: `createCpdLexer(LanguagePropertyBundle bundle)` - Creates a Python CPD lexer instance for tokenization

138

139

### Copy-Paste Detection Lexer

140

141

Python-specific tokenizer that integrates with PMD's Copy-Paste Detector to identify duplicate code patterns in Python source files.

142

143

```java { .api }

144

public class PythonCpdLexer extends JavaccCpdLexer {

145

public PythonCpdLexer();

146

protected TokenManager<JavaccToken> makeLexerImpl(TextDocument doc);

147

protected String getImage(JavaccToken token);

148

}

149

```

150

151

**PythonCpdLexer** provides:

152

- **Constructor**: `PythonCpdLexer()` - Creates Python CPD lexer instance

153

- **Token manager factory**: `makeLexerImpl(TextDocument doc)` - Creates token manager for processing Python source documents using PythonTokenKinds

154

- **Token image processing**: `getImage(JavaccToken token)` - Normalizes token images, particularly handling Python string literals by removing line break escapes (`\\r?\\n`) from single-quoted strings

155

156

**Internal Implementation Details**:

157

- Uses `Pattern.compile("\\\\\\r?\\n")` for string escape normalization

158

- Processes specific token kinds: `SINGLE_STRING`, `SINGLE_STRING2`, `SINGLE_BSTRING`, `SINGLE_BSTRING2`, `SINGLE_USTRING`, `SINGLE_USTRING2`

159

- Returns normalized token images for consistent duplicate detection

160

161

### Python Token Types

162

163

The lexer recognizes comprehensive Python language tokens through the generated PythonTokenKinds class:

164

165

```java { .api }

166

// Python token constants - generated from Python.jj grammar

167

class PythonTokenKinds {

168

// String literal token types (handled specially for escape processing)

169

public static final int SINGLE_STRING; // Single-quoted strings: 'text'

170

public static final int SINGLE_STRING2; // Alternative single-quoted strings

171

public static final int SINGLE_BSTRING; // Byte strings: b'text'

172

public static final int SINGLE_BSTRING2; // Alternative byte strings

173

public static final int SINGLE_USTRING; // Unicode strings: u'text'

174

public static final int SINGLE_USTRING2; // Alternative unicode strings

175

176

// Token creation and management

177

public static TokenManager<JavaccToken> newTokenManager(CharStream charStream);

178

public static final String[] TOKEN_NAMES;

179

}

180

```

181

182

**Token Processing Features**:

183

- **String normalization**: Removes line break escapes (`\\r?\\n`) from single-quoted string tokens

184

- **Token classification**: Full recognition of Python operators, keywords, separators, and literals

185

- **Grammar compliance**: Based on Python 2.7 language specification from PyDev project

186

- **Generated constants**: Token kind integers are generated at build time from Python.jj grammar

187

188

### Integration Points

189

190

```java { .api }

191

// Service provider registration (META-INF/services/net.sourceforge.pmd.lang.Language)

192

net.sourceforge.pmd.lang.python.PythonLanguageModule

193

194

// Language metadata

195

String LANGUAGE_ID = "python";

196

String LANGUAGE_NAME = "Python";

197

String[] FILE_EXTENSIONS = {".py"};

198

```

199

200

**Integration Features**:

201

- **Automatic discovery**: Registered as Java service provider for PMD language system

202

- **CPD framework**: Full integration with PMD's copy-paste detection engine

203

- **Token management**: Uses PMD's JavaCC-based token processing infrastructure

204

- **Language registry**: Available through `LanguageRegistry.CPD.getLanguageById("python")`

205

206

### Usage Examples

207

208

**Basic PMD Integration**:

209

```java

210

// Retrieve Python language module

211

PythonLanguageModule module = PythonLanguageModule.getInstance();

212

213

// Create lexer for Python code analysis

214

PythonCpdLexer lexer = module.createCpdLexer(languagePropertyBundle);

215

216

// The lexer is now ready for use with PMD's CPD engine

217

```

218

219

**Token Processing**:

220

```java

221

// The lexer handles Python-specific token processing automatically

222

// Including normalization of string literals with escape sequences

223

PythonCpdLexer lexer = new PythonCpdLexer();

224

// Token processing happens internally during CPD analysis

225

```

226

227

## Dependencies

228

229

This module requires PMD core dependencies:

230

231

- `net.sourceforge.pmd:pmd-core` - Core PMD functionality

232

- Java 8+ runtime environment

233

- JavaCC token processing support

234

235

## Technical Notes

236

237

- **Grammar**: Based on Python 2.7 language specification from PyDev project (originally from PyDev commit 32950d534139f286e03d34795aec99edab09c04c)

238

- **Token Processing**: Special handling for Python string literals including escape sequence normalization for line breaks

239

- **CPD Integration**: Designed specifically for copy-paste detection rather than full parsing

240

- **Language Support**: Currently supports Python 2.7 syntax patterns including Unicode strings, byte strings, and backtick expressions

241

- **Performance**: Optimized for large-scale code analysis across Python codebases

242

- **Testing**: Comprehensive test coverage including special comments, Unicode handling, tab width processing, variable names with dollar signs, and backtick expressions

243

- **Build Integration**: Uses JavaCC Maven plugin with ant wrapper for grammar processing during build phase