0
# Site Management
1
2
Site configuration and data management for loading, filtering, and organizing information about supported social networks and platforms. The site management system handles the database of over 400 supported sites with their detection methods and metadata.
3
4
## Capabilities
5
6
### Individual Site Information
7
8
Container class that holds comprehensive information about a single social media platform or website, including detection methods and testing data.
9
10
```python { .api }
11
class SiteInformation:
12
"""
13
Information about a specific website/platform.
14
15
Contains all data needed to check for username existence on a particular site.
16
"""
17
18
def __init__(
19
self,
20
name: str,
21
url_home: str,
22
url_username_format: str,
23
username_claimed: str,
24
information: dict,
25
is_nsfw: bool,
26
username_unclaimed: str = secrets.token_urlsafe(10)
27
):
28
"""
29
Create Site Information Object.
30
31
Args:
32
name: String which identifies the site
33
url_home: String containing URL for home page of site
34
url_username_format: String containing URL format for usernames on site
35
(should contain "{}" placeholder for username substitution)
36
Example: "https://somesite.com/users/{}"
37
username_claimed: String containing username known to be claimed on website
38
information: Dictionary containing site-specific detection information
39
(includes custom detection methods and parameters)
40
is_nsfw: Boolean indicating if site is Not Safe For Work
41
username_unclaimed: String containing username known to be unclaimed
42
(defaults to secrets.token_urlsafe(10) if not provided)
43
"""
44
45
name: str # Site identifier name
46
url_home: str # Homepage URL
47
url_username_format: str # URL template with {} placeholder
48
username_claimed: str # Known claimed username for testing
49
username_unclaimed: str # Known unclaimed username for testing
50
information: dict # Site-specific detection configuration
51
is_nsfw: bool # Not Safe For Work flag
52
53
def __str__(self) -> str:
54
"""
55
String representation showing site name and homepage.
56
57
Returns:
58
Formatted string with site name and homepage URL
59
"""
60
```
61
62
### Site Collection Management
63
64
Manager class that loads and organizes information about all supported sites, with filtering and querying capabilities.
65
66
```python { .api }
67
class SitesInformation:
68
"""
69
Container for information about all supported sites.
70
71
Manages the collection of site data and provides filtering and access methods.
72
"""
73
74
def __init__(self, data_file_path: str = None):
75
"""
76
Create Sites Information Object.
77
78
Loads site data from JSON file or URL. If no path specified, uses the
79
default live data from GitHub repository for most up-to-date information.
80
81
Args:
82
data_file_path: Path to JSON data file. Supports:
83
- Absolute file path: "/path/to/data.json"
84
- Relative file path: "data.json"
85
- URL: "https://example.com/data.json"
86
- None (default): Uses live GitHub data
87
88
Raises:
89
FileNotFoundError: If data file cannot be accessed
90
ValueError: If JSON data cannot be parsed
91
"""
92
93
sites: dict # Dictionary mapping site names to SiteInformation objects
94
95
def remove_nsfw_sites(self, do_not_remove: list = []):
96
"""
97
Remove NSFW (Not Safe For Work) sites from the collection.
98
99
Filters out sites marked with isNSFW flag, with optional exceptions.
100
101
Args:
102
do_not_remove: List of site names to keep even if marked NSFW
103
(case-insensitive matching)
104
"""
105
106
def site_name_list(self) -> list:
107
"""
108
Get sorted list of all site names.
109
110
Returns:
111
List of strings containing site names, sorted alphabetically (case-insensitive)
112
"""
113
114
def __iter__(self):
115
"""
116
Iterator over SiteInformation objects.
117
118
Yields:
119
SiteInformation objects for each site in the collection
120
"""
121
122
def __len__(self) -> int:
123
"""
124
Get number of sites in collection.
125
126
Returns:
127
Integer count of sites
128
"""
129
```
130
131
## Site Data Structure
132
133
### JSON Configuration Format
134
135
Site data is stored in JSON format with the following structure for each site:
136
137
```json
138
{
139
"SiteName": {
140
"urlMain": "https://example.com/",
141
"url": "https://example.com/user/{}",
142
"username_claimed": "known_user",
143
"errorType": "status_code",
144
"isNSFW": false,
145
"headers": {
146
"User-Agent": "custom-agent"
147
}
148
}
149
}
150
```
151
152
### Detection Methods
153
154
Sites use various methods to detect username existence:
155
156
- **status_code**: HTTP status codes (200 = exists, 404 = not found)
157
- **message**: Text content analysis in response body
158
- **response_url**: URL redirection patterns
159
160
## Usage Examples
161
162
### Load Default Site Data
163
164
```python
165
from sherlock_project.sites import SitesInformation
166
167
# Load default site data from GitHub (most up-to-date)
168
sites = SitesInformation()
169
170
print(f"Loaded {len(sites)} sites")
171
print(f"Available sites: {sites.site_name_list()[:10]}") # First 10 sites
172
```
173
174
### Load Custom Site Data
175
176
```python
177
# Load from local file
178
sites = SitesInformation("custom_sites.json")
179
180
# Load from URL
181
sites = SitesInformation("https://example.com/my_sites.json")
182
183
# Load from absolute path
184
sites = SitesInformation("/path/to/sites.json")
185
```
186
187
### Filter NSFW Sites
188
189
```python
190
# Remove all NSFW sites
191
sites = SitesInformation()
192
print(f"Before filtering: {len(sites)} sites")
193
194
sites.remove_nsfw_sites()
195
print(f"After filtering: {len(sites)} sites")
196
197
# Keep specific NSFW sites while removing others
198
sites = SitesInformation()
199
sites.remove_nsfw_sites(do_not_remove=["Reddit", "Tumblr"])
200
```
201
202
### Explore Site Information
203
204
```python
205
# Iterate through all sites
206
for site in sites:
207
print(f"Site: {site.name}")
208
print(f" Homepage: {site.url_home}")
209
print(f" URL Format: {site.url_username_format}")
210
print(f" NSFW: {site.is_nsfw}")
211
print(f" Test User: {site.username_claimed}")
212
print()
213
214
# Access specific site
215
github_site = sites.sites["GitHub"]
216
print(f"GitHub URL format: {github_site.url_username_format}")
217
print(f"GitHub test user: {github_site.username_claimed}")
218
219
# Check if site exists
220
if "Twitter" in sites.sites:
221
twitter_site = sites.sites["Twitter"]
222
print(f"Twitter detection method: {twitter_site.information.get('errorType')}")
223
```
224
225
### Create Site Subsets
226
227
```python
228
# Create subset of specific sites
229
social_media_sites = {
230
name: sites.sites[name] for name in sites.sites
231
if name in ["GitHub", "Twitter", "Instagram", "Facebook", "LinkedIn"]
232
}
233
234
print(f"Social media subset: {len(social_media_sites)} sites")
235
236
# Create subset by category
237
tech_sites = {}
238
for name, site in sites.sites.items():
239
if any(keyword in site.url_home.lower() for keyword in ["github", "gitlab", "stackoverflow", "dev"]):
240
tech_sites[name] = site
241
242
print(f"Tech-related sites: {len(tech_sites)} sites")
243
```
244
245
### Analyze Site Configuration
246
247
```python
248
# Analyze detection methods
249
detection_methods = {}
250
nsfw_count = 0
251
252
for site in sites:
253
method = site.information.get('errorType', 'unknown')
254
detection_methods[method] = detection_methods.get(method, 0) + 1
255
256
if site.is_nsfw:
257
nsfw_count += 1
258
259
print("Detection method distribution:")
260
for method, count in detection_methods.items():
261
print(f" {method}: {count} sites")
262
263
print(f"\nNSFW sites: {nsfw_count}/{len(sites)}")
264
```
265
266
### Custom Site Configuration
267
268
```python
269
# Create custom site information
270
custom_site = SiteInformation(
271
name="CustomSite",
272
url_home="https://customsite.com/",
273
url_username_format="https://customsite.com/profile/{}",
274
username_claimed="testuser",
275
information={
276
"errorType": "status_code",
277
"headers": {
278
"User-Agent": "Custom-Bot/1.0"
279
}
280
},
281
is_nsfw=False
282
)
283
284
# Add to existing site collection
285
sites.sites["CustomSite"] = custom_site
286
287
print(f"Added custom site. Total sites: {len(sites)}")
288
```
289
290
### Site Data Export and Import
291
292
```python
293
import json
294
295
# Export current site configuration
296
site_data = {}
297
for name, site in sites.sites.items():
298
site_data[name] = {
299
"urlMain": site.url_home,
300
"url": site.url_username_format,
301
"username_claimed": site.username_claimed,
302
"isNSFW": site.is_nsfw,
303
**site.information # Include all detection-specific data
304
}
305
306
# Save to file
307
with open("exported_sites.json", "w") as f:
308
json.dump(site_data, f, indent=2)
309
310
# Load and verify
311
test_sites = SitesInformation("exported_sites.json")
312
print(f"Exported and reloaded {len(test_sites)} sites")
313
```
314
315
### Site Performance Analysis
316
317
```python
318
from sherlock_project.sherlock import sherlock
319
from sherlock_project.notify import QueryNotifyPrint
320
import statistics
321
322
# Test a small subset for performance analysis
323
test_sites = {name: sites.sites[name] for name in list(sites.sites.keys())[:20]}
324
325
notify = QueryNotifyPrint(verbose=True)
326
results = sherlock("testuser", test_sites, notify)
327
328
# Analyze response times
329
response_times = []
330
for site_name, result_data in results.items():
331
result = result_data['status']
332
if result.query_time:
333
response_times.append(result.query_time)
334
335
if response_times:
336
print(f"\nPerformance Analysis:")
337
print(f" Average response time: {statistics.mean(response_times):.3f}s")
338
print(f" Median response time: {statistics.median(response_times):.3f}s")
339
print(f" Fastest site: {min(response_times):.3f}s")
340
print(f" Slowest site: {max(response_times):.3f}s")
341
```