0
# Googlesearch Python
1
2
A Python library for scraping the Google search engine using web scraping techniques. It leverages requests for HTTP communication and BeautifulSoup4 for HTML parsing to extract search results, titles, URLs, and descriptions from Google's search pages.
3
4
## Package Information
5
6
- **Package Name**: googlesearch-python
7
- **Package Type**: Library
8
- **Language**: Python
9
- **Installation**: `pip install googlesearch-python`
10
11
## Core Imports
12
13
```python
14
from googlesearch import search
15
```
16
17
For advanced search results with structured data:
18
19
```python
20
from googlesearch import search, SearchResult
21
```
22
23
For user agent utilities:
24
25
```python
26
from googlesearch import get_useragent
27
```
28
29
## Basic Usage
30
31
```python
32
from googlesearch import search
33
34
# Simple search - returns URLs only
35
for url in search("Python programming", num_results=10):
36
print(url)
37
38
# Advanced search - returns SearchResult objects with structured data
39
for result in search("Python programming", num_results=10, advanced=True):
40
print(f"Title: {result.title}")
41
print(f"URL: {result.url}")
42
print(f"Description: {result.description}")
43
print("---")
44
45
# Search with language and region settings
46
for url in search("Python programming", lang="en", region="us", num_results=5):
47
print(url)
48
```
49
50
## Capabilities
51
52
### Google Search
53
54
Performs Google search queries with extensive customization options including result count control, language and region specification, proxy support, and safe search toggles.
55
56
```python { .api }
57
def search(
58
term: str,
59
num_results: int = 10,
60
lang: str = "en",
61
proxy: str = None,
62
advanced: bool = False,
63
sleep_interval: int = 0,
64
timeout: int = 5,
65
safe: str = "active",
66
ssl_verify: bool = None,
67
region: str = None,
68
start_num: int = 0,
69
unique: bool = False
70
):
71
"""
72
Search the Google search engine and yield results.
73
74
Parameters:
75
- term: Search query string
76
- num_results: Number of results to return (default: 10)
77
- lang: Language code for search results (default: "en")
78
- proxy: HTTP/HTTPS proxy URL (optional)
79
- advanced: Return SearchResult objects instead of URLs (default: False)
80
- sleep_interval: Sleep time between requests in seconds (default: 0)
81
- timeout: Request timeout in seconds (default: 5)
82
- safe: Safe search setting - "active" or None (default: "active")
83
- ssl_verify: SSL certificate verification (optional)
84
- region: Country code for region-specific results (optional)
85
- start_num: Starting result number for pagination (default: 0)
86
- unique: Filter duplicate URLs (default: False)
87
88
Yields:
89
- str: URLs when advanced=False
90
- SearchResult: Result objects when advanced=True
91
92
Examples:
93
Basic search returning URLs:
94
>>> for url in search("machine learning", num_results=5):
95
... print(url)
96
97
Advanced search with structured results:
98
>>> for result in search("AI research", advanced=True, num_results=3):
99
... print(f"{result.title}: {result.url}")
100
101
Search with language and region:
102
>>> for url in search("café", lang="fr", region="fr", num_results=5):
103
... print(url)
104
105
Search with proxy and SSL settings:
106
>>> proxy_url = "http://proxy.example.com:8080"
107
>>> for url in search("secure search", proxy=proxy_url, ssl_verify=False):
108
... print(url)
109
110
Paginated search with rate limiting:
111
>>> for url in search("large dataset", num_results=200, sleep_interval=2):
112
... print(url)
113
"""
114
```
115
116
### User Agent Generation
117
118
Generates random user agent strings for HTTP requests to improve request diversity and reduce detection.
119
120
```python { .api }
121
def get_useragent() -> str:
122
"""
123
Generate a random user agent string mimicking Lynx browser format.
124
125
The user agent string components:
126
- Lynx version: Lynx/x.y.z where x is 2-3, y is 8-9, and z is 0-2
127
- libwww version: libwww-FM/x.y where x is 2-3 and y is 13-15
128
- SSL-MM version: SSL-MM/x.y where x is 1-2 and y is 3-5
129
- OpenSSL version: OpenSSL/x.y.z where x is 1-3, y is 0-4, and z is 0-9
130
131
Returns:
132
str: A randomly generated user agent string in the format:
133
"Lynx/x.y.z libwww-FM/x.y SSL-MM/x.y OpenSSL/x.y.z"
134
135
Examples:
136
>>> agent = get_useragent()
137
>>> print(agent)
138
"Lynx/2.8.1 libwww-FM/2.14 SSL-MM/1.4 OpenSSL/1.2.7"
139
"""
140
```
141
142
## Types
143
144
```python { .api }
145
class SearchResult:
146
"""
147
Data structure for advanced search results containing structured information
148
about each search result including URL, title, and description.
149
"""
150
151
def __init__(self, url: str, title: str, description: str):
152
"""
153
Initialize a SearchResult object.
154
155
Parameters:
156
- url: The result URL
157
- title: The result title
158
- description: The result description/snippet
159
"""
160
self.url = url
161
self.title = title
162
self.description = description
163
164
def __repr__(self) -> str:
165
"""
166
Return string representation of the SearchResult.
167
168
Returns:
169
str: String representation in format:
170
"SearchResult(url={url}, title={title}, description={description})"
171
"""
172
```
173
174
## Configuration Options
175
176
### Language Codes
177
Use standard language codes like "en" (English), "fr" (French), "de" (German), "es" (Spanish), "ja" (Japanese), etc.
178
179
### Region Codes
180
Use [Country Codes](https://developers.google.com/custom-search/docs/json_api_reference#countryCodes) like "us" (United States), "uk" (United Kingdom), "ca" (Canada), "au" (Australia), etc.
181
182
### Safe Search Options
183
- `"active"`: Enable safe search filtering (default)
184
- `None`: Disable safe search filtering
185
186
### Proxy Configuration
187
Supports both HTTP and HTTPS proxies:
188
```python
189
# HTTP proxy
190
proxy = "http://proxy.example.com:8080"
191
192
# HTTPS proxy
193
proxy = "https://proxy.example.com:8080"
194
195
# Proxy with authentication
196
proxy = "http://user:pass@proxy.example.com:8080"
197
```
198
199
## Error Handling
200
201
The library may raise the following exceptions:
202
203
- **requests.exceptions.RequestException**: Network-related errors (timeouts, connection errors)
204
- **requests.exceptions.HTTPError**: HTTP status errors from Google (via resp.raise_for_status())
205
- **requests.exceptions.Timeout**: Request timeout errors when timeout parameter is exceeded
206
- **requests.exceptions.ConnectionError**: Connection-related network errors
207
- **bs4.FeatureNotFound**: BeautifulSoup parsing errors during HTML processing
208
- **ValueError**: Invalid parameter values or parsing errors
209
- **AttributeError**: HTML structure parsing errors when expected elements are missing
210
211
Example error handling:
212
```python
213
import requests
214
from googlesearch import search
215
216
try:
217
results = list(search("example query", num_results=10, timeout=10))
218
except requests.exceptions.Timeout:
219
print("Request timed out")
220
except requests.exceptions.RequestException as e:
221
print(f"Network error: {e}")
222
except Exception as e:
223
print(f"Unexpected error: {e}")
224
```
225
226
## Rate Limiting Best Practices
227
228
To avoid being blocked by Google:
229
230
1. **Use sleep intervals**: Set `sleep_interval` to 1-5 seconds for large result sets
231
2. **Limit concurrent requests**: Don't run multiple searches simultaneously
232
3. **Use reasonable result counts**: Avoid requesting excessive numbers of results
233
4. **Rotate user agents**: The library automatically uses random user agents
234
5. **Consider using proxies**: For high-volume usage, rotate through different proxies
235
236
Example with rate limiting:
237
```python
238
# Good practice for large result sets
239
for url in search("large query", num_results=100, sleep_interval=2):
240
print(url)
241
```
242
243
## Dependencies
244
245
The package requires:
246
- **beautifulsoup4 >= 4.9**: HTML parsing for search result extraction
247
- **requests >= 2.20**: HTTP client for Google search requests