0
# usaddress
1
2
A Python library for parsing unstructured United States address strings into address components using advanced NLP methods and conditional random fields. It makes educated guesses in identifying address components, even in challenging cases where rule-based parsers typically fail.
3
4
## Package Information
5
6
- **Package Name**: usaddress
7
- **Package Type**: pypi
8
- **Language**: Python
9
- **Installation**: `pip install usaddress`
10
- **Version**: 0.5.16
11
- **Dependencies**: python-crfsuite>=0.7, probableparsing
12
13
## Core Imports
14
15
```python
16
import usaddress
17
```
18
19
## Basic Usage
20
21
```python
22
import usaddress
23
24
# Example address string
25
addr = '123 Main St. Suite 100 Chicago, IL'
26
27
# Parse method: split address into components and label each
28
# Returns list of (token, label) tuples
29
parsed = usaddress.parse(addr)
30
# Output: [('123', 'AddressNumber'), ('Main', 'StreetName'), ('St.', 'StreetNamePostType'),
31
# ('Suite', 'OccupancyType'), ('100', 'OccupancyIdentifier'),
32
# ('Chicago,', 'PlaceName'), ('IL', 'StateName')]
33
34
# Tag method: merge consecutive components and return address type
35
# Returns (dict, address_type) tuple
36
tagged, address_type = usaddress.tag(addr)
37
# Output: ({'AddressNumber': '123', 'StreetName': 'Main',
38
# 'StreetNamePostType': 'St.', 'OccupancyType': 'Suite',
39
# 'OccupancyIdentifier': '100', 'PlaceName': 'Chicago',
40
# 'StateName': 'IL'}, 'Street Address')
41
```
42
43
## Capabilities
44
45
### Address Parsing
46
47
Core functionality for parsing US addresses into labeled components using probabilistic models.
48
49
```python { .api }
50
def parse(address_string: str) -> list[tuple[str, str]]:
51
"""
52
Split an address string into components, and label each component.
53
54
Args:
55
address_string (str): The address to parse
56
57
Returns:
58
list[tuple[str, str]]: List of (token, label) pairs
59
"""
60
61
def tag(address_string: str, tag_mapping=None) -> tuple[dict[str, str], str]:
62
"""
63
Parse and merge consecutive components & strip commas.
64
Also return an address type ('Street Address', 'Intersection', 'PO Box', or 'Ambiguous').
65
66
Because this method returns a dict with labels as keys, it will throw a
67
RepeatedLabelError when multiple areas of an address have the same label.
68
69
Args:
70
address_string (str): The address to parse
71
tag_mapping (dict, optional): Optional mapping to remap labels to custom format
72
73
Returns:
74
tuple[dict[str, str], str]: (tagged_address_dict, address_type)
75
"""
76
```
77
78
#### Usage Examples
79
80
```python
81
# Basic parsing - get individual tokens with labels
82
tokens = usaddress.parse("1600 Pennsylvania Avenue NW Washington DC 20500")
83
for token, label in tokens:
84
print(f"{token}: {label}")
85
86
# Advanced tagging - get consolidated address components
87
address, addr_type = usaddress.tag("1600 Pennsylvania Avenue NW Washington DC 20500")
88
print(f"Address type: {addr_type}")
89
print(f"Street number: {address.get('AddressNumber', 'N/A')}")
90
print(f"Street name: {address.get('StreetName', 'N/A')}")
91
92
# Custom label mapping
93
mapping = {'StreetName': 'Street', 'AddressNumber': 'Number'}
94
address, addr_type = usaddress.tag("123 Main St", tag_mapping=mapping)
95
print(address) # Uses custom labels
96
```
97
98
### Address Tokenization
99
100
Low-level tokenization functionality for splitting addresses into unlabeled tokens.
101
102
```python { .api }
103
def tokenize(address_string: str) -> list[str]:
104
"""
105
Split each component of an address into a list of unlabeled tokens.
106
107
Args:
108
address_string (str): The address to tokenize
109
110
Returns:
111
list[str]: The tokenized address components
112
"""
113
```
114
115
#### Usage Examples
116
117
```python
118
# Tokenize without labeling
119
tokens = usaddress.tokenize("123 Main St. Apt 4B")
120
print(tokens) # ['123', 'Main', 'St.', 'Apt', '4B']
121
```
122
123
### Feature Extraction
124
125
Functions for extracting machine learning features from address tokens.
126
127
```python { .api }
128
def tokenFeatures(token: str) -> Feature:
129
"""
130
Return a Feature dict with attributes that describe a token.
131
132
Args:
133
token (str): The token to analyze
134
135
Returns:
136
Feature: Dict with attributes describing the token
137
(abbrev, digits, word, trailing.zeros, length, endsinpunc,
138
directional, street_name, has.vowels)
139
"""
140
141
def tokens2features(address: list[str]) -> list[Feature]:
142
"""
143
Turn every token into a Feature dict, and return a list of each token as a Feature.
144
Each attribute in a Feature describes the corresponding token.
145
146
Args:
147
address (list[str]): The address as a list of tokens
148
149
Returns:
150
list[Feature]: A list of all tokens with feature details and context
151
"""
152
```
153
154
#### Usage Examples
155
156
```python
157
# Extract features for a single token
158
features = usaddress.tokenFeatures("123")
159
print(features['digits']) # 'all_digits'
160
print(features['length']) # 'd:3'
161
162
# Extract features for all tokens with context
163
tokens = ["123", "Main", "St."]
164
features_list = usaddress.tokens2features(tokens)
165
print(features_list[0]['next']['word']) # 'main'
166
print(features_list[1]['previous']['digits']) # 'all_digits'
167
```
168
169
### Utility Functions
170
171
Helper functions for analyzing token characteristics.
172
173
```python { .api }
174
def digits(token: str) -> typing.Literal["all_digits", "some_digits", "no_digits"]:
175
"""
176
Identify whether the token string is all digits, has some digits, or has no digits.
177
178
Args:
179
token (str): The token to parse
180
181
Returns:
182
str: Label denoting digit presence ('all_digits', 'some_digits', 'no_digits')
183
"""
184
185
def trailingZeros(token: str) -> str:
186
"""
187
Return any trailing zeros found at the end of a token.
188
If none are found, then return an empty string.
189
190
Args:
191
token (str): The token to search for zeros
192
193
Returns:
194
str: The trailing zeros found, if any. Otherwise, an empty string.
195
"""
196
```
197
198
#### Usage Examples
199
200
```python
201
# Analyze digit content
202
print(usaddress.digits("123")) # 'all_digits'
203
print(usaddress.digits("12th")) # 'some_digits'
204
print(usaddress.digits("Main")) # 'no_digits'
205
206
# Find trailing zeros
207
print(usaddress.trailingZeros("1200")) # '00'
208
print(usaddress.trailingZeros("123")) # ''
209
```
210
211
### Address Component Labels
212
213
Constants defining the complete set of address component labels used by the parser.
214
215
```python { .api }
216
LABELS: list[str]
217
```
218
219
The complete list of 25 address component labels based on the United States Thoroughfare, Landmark, and Postal Address Data Standard:
220
221
- **Address Number Components**: AddressNumberPrefix, AddressNumber, AddressNumberSuffix
222
- **Street Name Components**: StreetNamePreModifier, StreetNamePreDirectional, StreetNamePreType, StreetName, StreetNamePostType, StreetNamePostDirectional
223
- **Subaddress Components**: SubaddressType, SubaddressIdentifier, BuildingName
224
- **Occupancy Components**: OccupancyType, OccupancyIdentifier
225
- **Intersection Components**: CornerOf, IntersectionSeparator
226
- **Location Components**: LandmarkName, PlaceName, StateName, ZipCode
227
- **USPS Box Components**: USPSBoxType, USPSBoxID, USPSBoxGroupType, USPSBoxGroupID
228
- **Other Components**: Recipient, NotAddress
229
230
### Reference Data
231
232
Built-in reference data for address parsing and feature extraction.
233
234
```python { .api }
235
DIRECTIONS: set[str]
236
STREET_NAMES: set[str]
237
PARENT_LABEL: str
238
GROUP_LABEL: str
239
```
240
241
#### Usage Examples
242
243
```python
244
# Check if token is a direction
245
if token.lower() in usaddress.DIRECTIONS:
246
print("This is a directional")
247
248
# Check if token is a street type
249
if token.lower() in usaddress.STREET_NAMES:
250
print("This is a street type")
251
252
# Access all available labels
253
print(f"Total labels: {len(usaddress.LABELS)}")
254
for label in usaddress.LABELS:
255
print(label)
256
```
257
258
## Types
259
260
```python { .api }
261
Feature = dict[str, typing.Union[str, bool, "Feature"]]
262
263
class RepeatedLabelError(probableparsing.RepeatedLabelError):
264
"""
265
Exception raised when tag() encounters repeated labels that cannot be merged.
266
267
Attributes:
268
REPO_URL (str): "https://github.com/datamade/usaddress/issues/new"
269
DOCS_URL (str): "https://usaddress.readthedocs.io/"
270
"""
271
```
272
273
## Error Handling
274
275
The `tag()` function can raise a `RepeatedLabelError` when multiple areas of an address have the same label and cannot be concatenated. This typically indicates either:
276
277
1. The input string is not a valid address
278
2. Some tokens were labeled incorrectly by the model
279
280
```python
281
try:
282
address, addr_type = usaddress.tag("123 Main St 456 Oak Ave")
283
except usaddress.RepeatedLabelError as e:
284
print(f"Ambiguous address: {e}")
285
# Fall back to parse() for detailed token analysis
286
tokens = usaddress.parse("123 Main St 456 Oak Ave")
287
```
288
289
## Address Types
290
291
The `tag()` function returns one of four address types:
292
293
- **"Street Address"**: Standard street address with AddressNumber
294
- **"Intersection"**: Street intersection without AddressNumber
295
- **"PO Box"**: Postal box address with USPSBoxID
296
- **"Ambiguous"**: Cannot be classified into other categories