0
# Two-Stage Least Squares Models
1
2
Two-stage least squares (TSLS) estimation for handling endogenous variables in regression models, with spatial diagnostic capabilities and regime-based analysis options.
3
4
## Capabilities
5
6
### Base TSLS Estimation
7
8
Core two-stage least squares estimation without diagnostics, providing instrumental variable estimation for models with endogeneity.
9
10
```python { .api }
11
class BaseTSLS:
12
def __init__(self, y, x, yend, q=None, h=None, robust=None, gwk=None, sig2n_k=False):
13
"""
14
Two-stage least squares estimation (no diagnostics).
15
16
Parameters:
17
- y (array): nx1 dependent variable
18
- x (array): nxk exogenous independent variables (excluding constant)
19
- yend (array): nxp endogenous variables
20
- q (array, optional): nxq external instruments (cannot use with h)
21
- h (array, optional): nxl all instruments (cannot use with q)
22
- robust (str, optional): 'white' or 'hac' for robust standard errors
23
- gwk (pysal W object, optional): Kernel weights for HAC estimation
24
- sig2n_k (bool): If True, use n-k for sigma^2 estimation
25
26
Attributes:
27
- betas (array): kx1 estimated coefficients (for x and yend combined)
28
- u (array): nx1 residuals
29
- predy (array): nx1 predicted values
30
- z (array): nxk combined exogenous and endogenous variables
31
- h (array): nxl all instruments
32
- vm (array): kxk variance-covariance matrix
33
- sig2 (float): Sigma squared
34
- n (int): Number of observations
35
- k (int): Number of parameters
36
- kstar (int): Number of endogenous variables
37
"""
38
```
39
40
### Full TSLS with Diagnostics
41
42
Complete TSLS implementation with spatial diagnostics, endogeneity tests, and comprehensive output formatting.
43
44
```python { .api }
45
class TSLS:
46
def __init__(self, y, x, yend, q, h=None, robust=None, gwk=None, sig2n_k=False,
47
nonspat_diag=True, spat_diag=False, w=None, slx_lags=0,
48
slx_vars='All', regimes=None, vm=False, constant_regi='one',
49
cols2regi='all', regime_err_sep=False, regime_lag_sep=False,
50
cores=False, name_y=None, name_x=None, name_yend=None,
51
name_q=None, name_h=None, name_w=None, name_ds=None, latex=False):
52
"""
53
Two-stage least squares with diagnostics.
54
55
Parameters:
56
- y (array): nx1 dependent variable
57
- x (array): nxk exogenous independent variables (constant added automatically)
58
- yend (array): nxp endogenous variables
59
- q (array): nxq external instruments
60
- h (array, optional): nxl all instruments (alternative to q)
61
- robust (str, optional): 'white' or 'hac' for robust standard errors
62
- gwk (pysal W object, optional): Kernel weights for HAC
63
- sig2n_k (bool): Use n-k for sigma^2 estimation
64
- nonspat_diag (bool): Compute non-spatial diagnostics
65
- spat_diag (bool): Compute Anselin-Kelejian test (requires w)
66
- w (pysal W object, optional): Spatial weights for spatial diagnostics
67
- slx_lags (int): Number of spatial lags of X to include
68
- slx_vars (str/list): Variables to be spatially lagged
69
- regimes (list/Series, optional): Regime identifier
70
- vm (bool): Include variance-covariance matrix
71
- constant_regi (str): Regime treatment of constant
72
- cols2regi (str/list): Variables that vary by regime
73
- regime_err_sep (bool): Separate error variance by regime
74
- regime_lag_sep (bool): Separate spatial lag by regime
75
- cores (bool): Use multiprocessing
76
- name_y, name_x, name_yend, name_q, name_h, name_w, name_ds (str): Variable names
77
- latex (bool): LaTeX formatting
78
79
Attributes:
80
- All BaseTSLS attributes plus:
81
- pr2 (float): Pseudo R-squared
82
- z_stat (list): z-statistics with p-values for each coefficient
83
- ak_test (dict): Anselin-Kelejian test for spatial dependence (if spat_diag=True)
84
- dwh (dict): Durbin-Wu-Hausman endogeneity test
85
- summary (str): Comprehensive formatted results
86
- output (DataFrame): Formatted results table
87
"""
88
```
89
90
## Usage Examples
91
92
### Basic TSLS Estimation
93
94
```python
95
import numpy as np
96
import spreg
97
98
# Generate data with endogeneity
99
n = 100
100
# Structural error and measurement error
101
e1 = np.random.randn(n, 1) # structural error
102
e2 = np.random.randn(n, 1) # error in endogenous variable
103
104
# Exogenous variables and instruments
105
x = np.random.randn(n, 2)
106
z = np.random.randn(n, 1) # external instrument
107
108
# Endogenous variable (correlated with error)
109
yend = 2 * z + 0.5 * e1 + e2
110
111
# Dependent variable
112
y = 1 + 2 * x[:, 0:1] + 3 * x[:, 1:2] + 1.5 * yend + e1
113
114
# TSLS estimation
115
tsls_model = spreg.TSLS(y, x, yend, z, name_y='y',
116
name_x=['x1', 'x2'], name_yend=['yend'],
117
name_q=['instrument'])
118
119
print(tsls_model.summary)
120
print("Pseudo R-squared:", tsls_model.pr2)
121
print("Durbin-Wu-Hausman test:", tsls_model.dwh)
122
```
123
124
### TSLS with Multiple Instruments
125
126
```python
127
import numpy as np
128
import spreg
129
130
# Multiple endogenous variables and instruments
131
n = 100
132
x = np.random.randn(n, 2)
133
z1 = np.random.randn(n, 1) # instrument for first endogenous var
134
z2 = np.random.randn(n, 1) # instrument for second endogenous var
135
z3 = np.random.randn(n, 1) # additional instrument (overidentification)
136
137
# Two endogenous variables
138
yend1 = 1.5 * z1 + 0.3 * z3 + np.random.randn(n, 1)
139
yend2 = 2.0 * z2 + 0.4 * z3 + np.random.randn(n, 1)
140
yend = np.hstack([yend1, yend2])
141
142
# All external instruments
143
q = np.hstack([z1, z2, z3])
144
145
# Dependent variable
146
y = 1 + x[:, 0:1] + 2 * x[:, 1:2] + 0.5 * yend1 + 1.2 * yend2 + np.random.randn(n, 1)
147
148
# TSLS with multiple endogenous variables
149
multi_tsls = spreg.TSLS(y, x, yend, q,
150
name_y='y', name_x=['x1', 'x2'],
151
name_yend=['yend1', 'yend2'],
152
name_q=['z1', 'z2', 'z3'])
153
154
print(multi_tsls.summary)
155
print(f"Model is {'over' if multi_tsls.h.shape[1] > multi_tsls.kstar else 'just'}identified")
156
```
157
158
### TSLS with Spatial Diagnostics
159
160
```python
161
import numpy as np
162
import spreg
163
from libpysal import weights
164
165
# Spatial TSLS
166
n = 49 # 7x7 grid
167
x = np.random.randn(n, 1)
168
z = np.random.randn(n, 1) # instrument
169
w = weights.lat2W(7, 7) # spatial weights
170
171
# Endogenous variable
172
yend = 1.5 * z + np.random.randn(n, 1)
173
174
# Dependent variable with spatial structure
175
y = np.random.randn(n, 1)
176
177
# TSLS with Anselin-Kelejian test
178
spatial_tsls = spreg.TSLS(y, x, yend, z, w=w, spat_diag=True,
179
name_y='y', name_x=['x1'],
180
name_yend=['yend'], name_q=['instrument'])
181
182
print(spatial_tsls.summary)
183
print("Anselin-Kelejian test:", spatial_tsls.ak_test)
184
185
if spatial_tsls.ak_test['p-value'] < 0.05:
186
print("Spatial dependence detected in TSLS residuals")
187
```
188
189
### TSLS with SLX Specification
190
191
```python
192
import numpy as np
193
import spreg
194
from libpysal import weights
195
196
# TSLS with spatial lag of X
197
n = 100
198
x = np.random.randn(n, 2)
199
z = np.random.randn(n, 1)
200
w = weights.KNN.from_array(np.random.randn(n, 2), k=5)
201
202
# Endogenous variable
203
yend = 2 * z + np.random.randn(n, 1)
204
205
# Dependent variable
206
y = 1 + x.sum(axis=1, keepdims=True) + 0.8 * yend + np.random.randn(n, 1)
207
208
# Include spatial lags of exogenous variables
209
slx_tsls = spreg.TSLS(y, x, yend, z, w=w, slx_lags=1, slx_vars='All',
210
name_y='y', name_x=['x1', 'x2'],
211
name_yend=['yend'], name_q=['instrument'])
212
213
print(slx_tsls.summary)
214
print("Includes spatial lags of X variables")
215
```
216
217
### TSLS with Robust Standard Errors
218
219
```python
220
import numpy as np
221
import spreg
222
223
# TSLS with heteroskedasticity-robust standard errors
224
n = 100
225
x = np.random.randn(n, 2)
226
z = np.random.randn(n, 2) # two instruments
227
yend = np.random.randn(n, 1)
228
y = np.random.randn(n, 1)
229
230
# White-robust TSLS
231
robust_tsls = spreg.TSLS(y, x, yend, z, robust='white',
232
name_y='y', name_x=['x1', 'x2'],
233
name_yend=['yend'], name_q=['z1', 'z2'])
234
235
print(robust_tsls.summary)
236
print("Uses White-robust standard errors")
237
```
238
239
### Regime-Based TSLS
240
241
```python
242
import numpy as np
243
import spreg
244
245
# TSLS with regimes
246
n = 150
247
x = np.random.randn(n, 2)
248
z = np.random.randn(n, 2)
249
yend = np.random.randn(n, 1)
250
y = np.random.randn(n, 1)
251
regimes = np.random.choice(['North', 'South', 'East'], n)
252
253
# TSLS allowing coefficients to vary by regime
254
regime_tsls = spreg.TSLS(y, x, yend, z, regimes=regimes,
255
constant_regi='many', cols2regi='all',
256
name_y='y', name_x=['x1', 'x2'],
257
name_yend=['yend'], name_q=['z1', 'z2'],
258
name_regimes='region')
259
260
print(regime_tsls.summary)
261
print("Coefficients vary by regime")
262
print("Chow test:", regime_tsls.chow)
263
```
264
265
## Key Diagnostic Tests
266
267
### Endogeneity Testing
268
- `dwh`: Durbin-Wu-Hausman test for endogeneity
269
- Tests whether OLS and TSLS estimates differ significantly
270
- Significant result indicates endogeneity is present
271
272
### Spatial Dependence in TSLS
273
- `ak_test`: Anselin-Kelejian test for spatial dependence in TSLS residuals
274
- Robust to heteroskedasticity and endogeneity
275
- Significant result suggests spatial error dependence
276
277
### Model Fit
278
- `pr2`: Pseudo R-squared for TSLS models
279
- Cannot use standard R-squared due to two-stage estimation
280
- Measures explained variation accounting for instrumentation
281
282
## Instrument Quality Guidelines
283
284
### Instrument Relevance
285
- Instruments must be strongly correlated with endogenous variables
286
- Check first-stage F-statistics (weak instruments if F < 10)
287
- Use multiple instruments when available for overidentification tests
288
289
### Instrument Exogeneity
290
- Instruments must be uncorrelated with structural error
291
- Cannot be directly tested, requires economic reasoning
292
- Overidentification tests can detect some violations
293
294
### Identification Requirements
295
- Need at least as many instruments as endogenous variables
296
- More instruments than endogenous variables allows overidentification testing
297
- Quality is more important than quantity
298
299
## Model Selection Strategy
300
301
1. **Identify endogenous variables** through economic theory and testing
302
2. **Find valid instruments** that are relevant and exogenous
303
3. **Estimate TSLS model** and check Durbin-Wu-Hausman test
304
4. **Test for spatial dependence** using Anselin-Kelejian test if spatial data
305
5. **Consider spatial error models** if spatial dependence detected
306
6. **Use robust standard errors** if heteroskedasticity suspected
307
7. **Apply regime analysis** if parameters vary systematically across groups