0
# HtmlUnit
1
2
HtmlUnit is a comprehensive headless browser library for Java that enables automated testing and web scraping of web-based applications. It provides a pure Java implementation of a web browser with full HTML, CSS, and JavaScript support, including advanced features like form submission, cookie management, SSL certificate handling, and proxy configuration.
3
4
## Package Information
5
6
- **Package Name**: htmlunit
7
- **Package Type**: maven
8
- **Language**: Java
9
- **GroupId**: net.sourceforge.htmlunit
10
- **ArtifactId**: htmlunit
11
- **Installation**: Add to `pom.xml`: `<dependency><groupId>net.sourceforge.htmlunit</groupId><artifactId>htmlunit</artifactId><version>2.70.0</version></dependency>`
12
13
## Core Imports
14
15
```java
16
import com.gargoylesoftware.htmlunit.WebClient;
17
import com.gargoylesoftware.htmlunit.html.HtmlPage;
18
import com.gargoylesoftware.htmlunit.BrowserVersion;
19
import com.gargoylesoftware.htmlunit.WebRequest;
20
import com.gargoylesoftware.htmlunit.WebResponse;
21
```
22
23
## Basic Usage
24
25
```java
26
import com.gargoylesoftware.htmlunit.WebClient;
27
import com.gargoylesoftware.htmlunit.html.HtmlPage;
28
import com.gargoylesoftware.htmlunit.html.HtmlForm;
29
import com.gargoylesoftware.htmlunit.html.HtmlTextInput;
30
import com.gargoylesoftware.htmlunit.html.HtmlSubmitInput;
31
32
// Create a web client instance
33
try (WebClient webClient = new WebClient(BrowserVersion.CHROME)) {
34
// Configure options
35
webClient.getOptions().setJavaScriptEnabled(true);
36
webClient.getOptions().setCssEnabled(false);
37
38
// Load a web page
39
HtmlPage page = webClient.getPage("http://example.com");
40
41
// Find and interact with form elements
42
HtmlForm form = page.getFormByName("loginForm");
43
HtmlTextInput usernameField = form.getInputByName("username");
44
usernameField.setValueAttribute("user123");
45
46
HtmlSubmitInput submitButton = form.getInputByType("submit");
47
HtmlPage resultPage = submitButton.click();
48
49
// Extract page content
50
String pageTitle = resultPage.getTitleText();
51
String pageText = resultPage.asText();
52
} // WebClient implements AutoCloseable
53
```
54
55
## Architecture
56
57
HtmlUnit is built around several key components:
58
59
- **WebClient**: Main browser automation class that manages windows, connections, and global settings
60
- **Page Hierarchy**: Different page types (HtmlPage, TextPage, XmlPage) for different content types
61
- **DOM Tree**: Full DOM implementation with element manipulation and CSS selector support
62
- **JavaScript Engine**: Mozilla Rhino-based JavaScript execution with browser API simulation
63
- **HTTP Layer**: Configurable HTTP client with cookie management, authentication, and proxy support
64
- **Browser Simulation**: Accurate simulation of Chrome, Firefox, IE, and Edge browser behaviors
65
66
## Capabilities
67
68
### Browser Automation
69
70
Core browser automation functionality for loading pages, managing windows, and configuring browser behavior. Essential for web scraping and automated testing.
71
72
```java { .api }
73
public class WebClient implements AutoCloseable {
74
public WebClient();
75
public WebClient(BrowserVersion browserVersion);
76
public WebClient(BrowserVersion browserVersion, String proxyHost, int proxyPort);
77
78
public <P extends Page> P getPage(String url) throws IOException, FailingHttpStatusCodeException;
79
public <P extends Page> P getPage(URL url) throws IOException, FailingHttpStatusCodeException;
80
public <P extends Page> P getPage(WebRequest request) throws IOException, FailingHttpStatusCodeException;
81
82
public WebClientOptions getOptions();
83
public BrowserVersion getBrowserVersion();
84
public void close();
85
}
86
```
87
88
[Browser Automation](./browser-automation.md)
89
90
### HTML DOM Manipulation
91
92
Comprehensive HTML DOM access and manipulation with CSS selectors, XPath queries, and element interaction. Perfect for form automation and content extraction.
93
94
```java { .api }
95
public class HtmlPage extends SgmlPage {
96
public HtmlElement getElementById(String id);
97
public List<HtmlElement> getElementsByTagName(String tagName);
98
public List<HtmlElement> getElementsByName(String name);
99
public List<HtmlElement> getElementsByClassName(String className);
100
101
public HtmlElement querySelector(String selectors);
102
public List<HtmlElement> querySelectorAll(String selectors);
103
104
public DomNode getFirstByXPath(String xpathExpr);
105
public List<?> getByXPath(String xpathExpr);
106
107
public String asText();
108
public String asXml();
109
public String getTitleText();
110
}
111
```
112
113
[HTML DOM Manipulation](./html-dom.md)
114
115
### Form Interaction
116
117
Form automation capabilities including field input, selection handling, and form submission. Ideal for login automation and data entry workflows.
118
119
```java { .api }
120
public class HtmlForm extends HtmlElement {
121
public <P extends Page> P submit() throws IOException;
122
public <P extends Page> P submit(SubmittableElement submitElement) throws IOException;
123
public void reset();
124
125
public HtmlElement getInputByName(String name);
126
public List<HtmlElement> getInputsByName(String name);
127
public HtmlTextInput getInputByValue(String value);
128
}
129
130
public abstract class HtmlInput extends HtmlElement implements SubmittableElement {
131
public String getValueAttribute();
132
public void setValueAttribute(String value);
133
public String getNameAttribute();
134
public boolean isDisabled();
135
public void setDisabled(boolean disabled);
136
}
137
```
138
139
[Form Interaction](./forms.md)
140
141
### HTTP Communication
142
143
HTTP request/response handling with full control over headers, methods, authentication, and connection settings. Essential for API testing and advanced web scraping.
144
145
```java { .api }
146
public class WebRequest {
147
public WebRequest(URL url);
148
public WebRequest(URL url, HttpMethod submitMethod);
149
150
public URL getUrl();
151
public void setUrl(URL url);
152
public HttpMethod getHttpMethod();
153
public void setHttpMethod(HttpMethod method);
154
155
public String getRequestBody();
156
public void setRequestBody(String requestBody);
157
public void setAdditionalHeader(String name, String value);
158
public Map<String, String> getAdditionalHeaders();
159
}
160
161
public class WebResponse {
162
public int getStatusCode();
163
public String getStatusMessage();
164
public String getContentAsString();
165
public String getContentAsString(Charset charset);
166
public InputStream getContentAsStream();
167
public List<NameValuePair> getResponseHeaders();
168
public String getResponseHeaderValue(String headerName);
169
}
170
```
171
172
[HTTP Communication](./http.md)
173
174
### JavaScript Execution
175
176
JavaScript engine integration for executing JavaScript code within web pages and handling browser API calls. Critical for modern web application automation.
177
178
```java { .api }
179
public class HtmlPage extends SgmlPage {
180
public ScriptResult executeJavaScript(String sourceCode);
181
public ScriptResult executeJavaScript(String sourceCode, String sourceName, int startLine);
182
}
183
184
public class ScriptResult {
185
public Object getJavaScriptResult();
186
public Page getNewPage();
187
}
188
189
public interface JavaScriptErrorListener {
190
void scriptException(HtmlPage page, ScriptException scriptException);
191
void timeoutError(HtmlPage page, long allowedTime, long executionTime);
192
void malformedScriptURL(HtmlPage page, String url, MalformedURLException malformedURLException);
193
void loadScriptError(HtmlPage page, URL scriptUrl, Exception exception);
194
}
195
```
196
197
[JavaScript Execution](./javascript.md)
198
199
### Window Management
200
201
Browser window and frame management for handling pop-ups, iframes, and multi-window scenarios. Required for complex web application navigation.
202
203
```java { .api }
204
public interface WebWindow {
205
public String getName();
206
public void setName(String name);
207
public Page getEnclosedPage();
208
public void setEnclosedPage(Page page);
209
public WebClient getWebClient();
210
public WebWindow getParentWindow();
211
public WebWindow getTopWindow();
212
public History getHistory();
213
public int getInnerHeight();
214
public int getInnerWidth();
215
}
216
217
public class TopLevelWindow extends WebWindowImpl {
218
// Implementation for top-level browser windows
219
}
220
221
public class DialogWindow extends WebWindowImpl {
222
// Implementation for modal dialog windows
223
}
224
```
225
226
[Window Management](./windows.md)
227
228
### Cookie Management
229
230
HTTP cookie handling with domain scoping, expiration management, and security flags. Essential for session management and authentication workflows.
231
232
```java { .api }
233
public class CookieManager {
234
public void addCookie(Cookie cookie);
235
public Set<Cookie> getCookies();
236
public Set<Cookie> getCookies(URL url);
237
public void clearCookies();
238
public boolean isCookiesEnabled();
239
public void setCookiesEnabled(boolean enabled);
240
}
241
242
public class Cookie {
243
public Cookie(String domain, String name, String value);
244
public Cookie(String domain, String name, String value, String path, Date expires, boolean secure);
245
246
public String getName();
247
public String getValue();
248
public String getDomain();
249
public String getPath();
250
public Date getExpires();
251
public boolean isSecure();
252
public boolean isHttpOnly();
253
}
254
```
255
256
[Cookie Management](./cookies.md)
257
258
## Common Types
259
260
```java { .api }
261
public enum HttpMethod {
262
OPTIONS, GET, HEAD, POST, PUT, DELETE, TRACE, PATCH
263
}
264
265
public class BrowserVersion {
266
public static final BrowserVersion CHROME;
267
public static final BrowserVersion FIREFOX;
268
public static final BrowserVersion FIREFOX_ESR;
269
public static final BrowserVersion EDGE;
270
public static final BrowserVersion INTERNET_EXPLORER;
271
public static final BrowserVersion BEST_SUPPORTED;
272
273
public String getApplicationName();
274
public String getApplicationVersion();
275
public String getUserAgent();
276
public boolean hasFeature(BrowserFeature feature);
277
}
278
279
public class NameValuePair {
280
public NameValuePair(String name, String value);
281
public String getName();
282
public String getValue();
283
}
284
285
public class FailingHttpStatusCodeException extends RuntimeException {
286
public int getStatusCode();
287
public String getStatusMessage();
288
public WebResponse getResponse();
289
}
290
291
public class ElementNotFoundException extends RuntimeException {
292
// Thrown when element lookups fail
293
}
294
```