Tessl Tile for pypi/sherlock-project@0.15.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

core-search.md index.md notification-system.md result-management.md site-management.md

site-management.mddocs/

0
# Site Management
1

2
Site configuration and data management for loading, filtering, and organizing information about supported social networks and platforms. The site management system handles the database of over 400 supported sites with their detection methods and metadata.
3

4
## Capabilities
5

6
### Individual Site Information
7

8
Container class that holds comprehensive information about a single social media platform or website, including detection methods and testing data.
9

10
```python { .api }
11
class SiteInformation:
12
    """
13
    Information about a specific website/platform.
14
    
15
    Contains all data needed to check for username existence on a particular site.
16
    """
17
    
18
    def __init__(
19
        self,
20
        name: str,
21
        url_home: str,
22
        url_username_format: str,
23
        username_claimed: str,
24
        information: dict,
25
        is_nsfw: bool,
26
        username_unclaimed: str = secrets.token_urlsafe(10)
27
    ):
28
        """
29
        Create Site Information Object.
30
        
31
        Args:
32
            name: String which identifies the site
33
            url_home: String containing URL for home page of site
34
            url_username_format: String containing URL format for usernames on site
35
                                (should contain "{}" placeholder for username substitution)
36
                                Example: "https://somesite.com/users/{}"
37
            username_claimed: String containing username known to be claimed on website
38
            information: Dictionary containing site-specific detection information
39
                        (includes custom detection methods and parameters)
40
            is_nsfw: Boolean indicating if site is Not Safe For Work
41
            username_unclaimed: String containing username known to be unclaimed
42
                              (defaults to secrets.token_urlsafe(10) if not provided)
43
        """
44
        
45
    name: str                    # Site identifier name
46
    url_home: str               # Homepage URL  
47
    url_username_format: str    # URL template with {} placeholder
48
    username_claimed: str       # Known claimed username for testing
49
    username_unclaimed: str     # Known unclaimed username for testing
50
    information: dict           # Site-specific detection configuration
51
    is_nsfw: bool              # Not Safe For Work flag
52
    
53
    def __str__(self) -> str:
54
        """
55
        String representation showing site name and homepage.
56
        
57
        Returns:
58
            Formatted string with site name and homepage URL
59
        """
60
```
61

62
### Site Collection Management
63

64
Manager class that loads and organizes information about all supported sites, with filtering and querying capabilities.
65

66
```python { .api }
67
class SitesInformation:
68
    """
69
    Container for information about all supported sites.
70
    
71
    Manages the collection of site data and provides filtering and access methods.
72
    """
73
    
74
    def __init__(self, data_file_path: str = None):
75
        """
76
        Create Sites Information Object.
77
        
78
        Loads site data from JSON file or URL. If no path specified, uses the
79
        default live data from GitHub repository for most up-to-date information.
80
        
81
        Args:
82
            data_file_path: Path to JSON data file. Supports:
83
                          - Absolute file path: "/path/to/data.json"
84
                          - Relative file path: "data.json" 
85
                          - URL: "https://example.com/data.json"
86
                          - None (default): Uses live GitHub data
87
                          
88
        Raises:
89
            FileNotFoundError: If data file cannot be accessed
90
            ValueError: If JSON data cannot be parsed
91
        """
92
        
93
    sites: dict  # Dictionary mapping site names to SiteInformation objects
94
    
95
    def remove_nsfw_sites(self, do_not_remove: list = []):
96
        """
97
        Remove NSFW (Not Safe For Work) sites from the collection.
98
        
99
        Filters out sites marked with isNSFW flag, with optional exceptions.
100
        
101
        Args:
102
            do_not_remove: List of site names to keep even if marked NSFW
103
                          (case-insensitive matching)
104
        """
105
        
106
    def site_name_list(self) -> list:
107
        """
108
        Get sorted list of all site names.
109
        
110
        Returns:
111
            List of strings containing site names, sorted alphabetically (case-insensitive)
112
        """
113
        
114
    def __iter__(self):
115
        """
116
        Iterator over SiteInformation objects.
117
        
118
        Yields:
119
            SiteInformation objects for each site in the collection
120
        """
121
        
122
    def __len__(self) -> int:
123
        """
124
        Get number of sites in collection.
125
        
126
        Returns:
127
            Integer count of sites
128
        """
129
```
130

131
## Site Data Structure
132

133
### JSON Configuration Format
134

135
Site data is stored in JSON format with the following structure for each site:
136

137
```json
138
{
139
  "SiteName": {
140
    "urlMain": "https://example.com/",
141
    "url": "https://example.com/user/{}",
142
    "username_claimed": "known_user",
143
    "errorType": "status_code",
144
    "isNSFW": false,
145
    "headers": {
146
      "User-Agent": "custom-agent"
147
    }
148
  }
149
}
150
```
151

152
### Detection Methods
153

154
Sites use various methods to detect username existence:
155

156
- **status_code**: HTTP status codes (200 = exists, 404 = not found)
157
- **message**: Text content analysis in response body
158
- **response_url**: URL redirection patterns
159

160
## Usage Examples
161

162
### Load Default Site Data
163

164
```python
165
from sherlock_project.sites import SitesInformation
166

167
# Load default site data from GitHub (most up-to-date)
168
sites = SitesInformation()
169

170
print(f"Loaded {len(sites)} sites")
171
print(f"Available sites: {sites.site_name_list()[:10]}")  # First 10 sites
172
```
173

174
### Load Custom Site Data
175

176
```python
177
# Load from local file
178
sites = SitesInformation("custom_sites.json")
179

180
# Load from URL
181
sites = SitesInformation("https://example.com/my_sites.json")
182

183
# Load from absolute path
184
sites = SitesInformation("/path/to/sites.json")
185
```
186

187
### Filter NSFW Sites
188

189
```python
190
# Remove all NSFW sites
191
sites = SitesInformation()
192
print(f"Before filtering: {len(sites)} sites")
193

194
sites.remove_nsfw_sites()
195
print(f"After filtering: {len(sites)} sites")
196

197
# Keep specific NSFW sites while removing others
198
sites = SitesInformation()
199
sites.remove_nsfw_sites(do_not_remove=["Reddit", "Tumblr"])
200
```
201

202
### Explore Site Information
203

204
```python
205
# Iterate through all sites
206
for site in sites:
207
    print(f"Site: {site.name}")
208
    print(f"  Homepage: {site.url_home}")
209
    print(f"  URL Format: {site.url_username_format}")
210
    print(f"  NSFW: {site.is_nsfw}")
211
    print(f"  Test User: {site.username_claimed}")
212
    print()
213

214
# Access specific site
215
github_site = sites.sites["GitHub"]
216
print(f"GitHub URL format: {github_site.url_username_format}")
217
print(f"GitHub test user: {github_site.username_claimed}")
218

219
# Check if site exists
220
if "Twitter" in sites.sites:
221
    twitter_site = sites.sites["Twitter"]
222
    print(f"Twitter detection method: {twitter_site.information.get('errorType')}")
223
```
224

225
### Create Site Subsets
226

227
```python
228
# Create subset of specific sites
229
social_media_sites = {
230
    name: sites.sites[name] for name in sites.sites
231
    if name in ["GitHub", "Twitter", "Instagram", "Facebook", "LinkedIn"]
232
}
233

234
print(f"Social media subset: {len(social_media_sites)} sites")
235

236
# Create subset by category
237
tech_sites = {}
238
for name, site in sites.sites.items():
239
    if any(keyword in site.url_home.lower() for keyword in ["github", "gitlab", "stackoverflow", "dev"]):
240
        tech_sites[name] = site
241

242
print(f"Tech-related sites: {len(tech_sites)} sites")
243
```
244

245
### Analyze Site Configuration
246

247
```python
248
# Analyze detection methods
249
detection_methods = {}
250
nsfw_count = 0
251

252
for site in sites:
253
    method = site.information.get('errorType', 'unknown')
254
    detection_methods[method] = detection_methods.get(method, 0) + 1
255
    
256
    if site.is_nsfw:
257
        nsfw_count += 1
258

259
print("Detection method distribution:")
260
for method, count in detection_methods.items():
261
    print(f"  {method}: {count} sites")
262

263
print(f"\nNSFW sites: {nsfw_count}/{len(sites)}")
264
```
265

266
### Custom Site Configuration
267

268
```python
269
# Create custom site information
270
custom_site = SiteInformation(
271
    name="CustomSite",
272
    url_home="https://customsite.com/",
273
    url_username_format="https://customsite.com/profile/{}",
274
    username_claimed="testuser",
275
    information={
276
        "errorType": "status_code",
277
        "headers": {
278
            "User-Agent": "Custom-Bot/1.0"
279
        }
280
    },
281
    is_nsfw=False
282
)
283

284
# Add to existing site collection
285
sites.sites["CustomSite"] = custom_site
286

287
print(f"Added custom site. Total sites: {len(sites)}")
288
```
289

290
### Site Data Export and Import
291

292
```python
293
import json
294

295
# Export current site configuration
296
site_data = {}
297
for name, site in sites.sites.items():
298
    site_data[name] = {
299
        "urlMain": site.url_home,
300
        "url": site.url_username_format,
301
        "username_claimed": site.username_claimed,
302
        "isNSFW": site.is_nsfw,
303
        **site.information  # Include all detection-specific data
304
    }
305

306
# Save to file
307
with open("exported_sites.json", "w") as f:
308
    json.dump(site_data, f, indent=2)
309

310
# Load and verify
311
test_sites = SitesInformation("exported_sites.json")
312
print(f"Exported and reloaded {len(test_sites)} sites")
313
```
314

315
### Site Performance Analysis
316

317
```python
318
from sherlock_project.sherlock import sherlock
319
from sherlock_project.notify import QueryNotifyPrint
320
import statistics
321

322
# Test a small subset for performance analysis
323
test_sites = {name: sites.sites[name] for name in list(sites.sites.keys())[:20]}
324

325
notify = QueryNotifyPrint(verbose=True)
326
results = sherlock("testuser", test_sites, notify)
327

328
# Analyze response times
329
response_times = []
330
for site_name, result_data in results.items():
331
    result = result_data['status']
332
    if result.query_time:
333
        response_times.append(result.query_time)
334

335
if response_times:
336
    print(f"\nPerformance Analysis:")
337
    print(f"  Average response time: {statistics.mean(response_times):.3f}s")
338
    print(f"  Median response time: {statistics.median(response_times):.3f}s")
339
    print(f"  Fastest site: {min(response_times):.3f}s")
340
    print(f"  Slowest site: {max(response_times):.3f}s")
341
```

Version

Tile

Files

site-management.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

site-management.mddocs/