 Errorsinvariables models

In statistics and econometrics, errorsinvariables models or measurement errors models are regression models that account for measurement errors in the independent variables. In contrast, standard regression models assume that those regressors have been measured exactly, or observed without error; as such, those models account only for errors in the dependent variables, or responses.
In the case when some regressors have been measured with errors, estimation based on the standard assumption leads to inconsistent estimates, meaning that the parameter estimates do not tend to the true values even in very large samples. For simple linear regression the effect is an underestimate of the coefficient, known as the attenuation bias. In nonlinear models the direction of the bias is likely to be more complicated.^{[1]}
Contents
Motivational example
Consider a simple linear regression model of the form
where x* denotes the true but unobserved value of the regressor. Instead we observe this value with an error:
where the measurement error η_{t} is assumed to be independent from the true value x*_{t}.
If the y_{t}′s are simply regressed on the x_{t}′s (see simple linear regression), then the estimator for the slope coefficient is
which converges as the sample size T increases without bound:
The two variances here are positive, so that in the limit the estimate is smaller in magnitude than the true value of β, an effect which statisticians call attenuation or regression dilution.^{[2]} Thus the “naїve” least squares estimator is inconsistent in this setting. However, the estimator is a consistent estimator of the parameter required for a best linear predictor of y given x: in some applications this may be what is required, rather than an estimate of the "true" regression coefficient, although that what assume that the variance of the errors in observing x* remains fixed.
It can be argued that almost all existing data sets contain errors of different nature and magnitude, so that attenuation bias is extremely frequent (although in multivariate regression the direction of bias is ambiguous^{[citation needed]}). Jerry Hausman sees this as an iron law of econometrics: “The magnitude of the estimate is usually smaller than expected.”^{[3]}
Specification
Usually measurement error models are described using the latent variables approach. If y is the response variable and x are observed values of the regressors, then we assume there exist some latent variables y* and x* which follow the model's “true” functional relationship g, and such that the observed quantities are their noisy observations:
where θ is the model's parameter and w are those regressors which are assumed to be errorfree (for example when linear regression contains an intercept, the regressor which corresponds to the constant certainly has no “measurement errors”). Depending on the specification these errorfree regressors may or may not be treated separately; in the latter case it is simply assumed that corresponding entries in the variance matrix of η's are zero.
The variables y, x, w are all observed, meaning that the statistician possesses a data set of n statistical units {y_{i}, x_{i}, w_{i}}_{i = 1, ..., n} which follow the data generating process described above; the latent variables x*, y*, ε, and η are not observed however.
This specification does not encompass all the existing EiV models. For example in some of them function g may be nonparametric or semiparametric. Other approaches model the relationship between y* and x* as distributional instead of functional, that is they assume that y* conditionally on x* follows a certain (usually parametric) distribution.
Terminology and assumptions
 The observed variable x may be called the manifest, indicator, or proxy variable.
 The unobserved variable x* may be called the latent or true variable. It may be regarded either as an unknown constant (in which case the model is called a functional model), or as a random variable (correspondingly a structural model).^{[4]}
 The relationship between the measurement error η and the latent variable x* can be modeled in different ways:
 Classical errors: the errors are independent from the latent variable. This is the most common assumption, it implies that the errors are introduced by the measuring device and their magnitude does not depend on the value being measured.
 Meanindependence: the errors are meanzero for every value of the latent regressor. This is a less restrictive assumption than the classical one, as it allows for the presence of heteroscedasticity or other effects in the measurement errors.
 Berkson’s errors: the errors are independent from the observed regressor x. This assumption has very limited applicability. One example is roundoff errors: for example if a person’s age* is a continuous random variable, whereas the observed age is truncated to the next smallest integer, then the truncation error is approximately independent from the observed age. Another possibility is with the fixed design experiment: for example if a scientist decides to make a measurement at a certain predetermined moment of time x, say at x = 10 s, then the real measurement may occur at some other value of x* (for example due to her finite reaction time) and such measurement error will be generally independent from the “observed” value of the regressor.
 Misclassification errors: special case used for the dummy regressors. If x* is an indicator of a certain event or condition (such as person is male/female, some medical treatment given/not, etc.), then the measurement error in such regressor will correspond to the incorrect classification similar to type I and type II errors in statistical testing. In this case the error η may take only 3 possible values, and its distribution conditional on x* is modeled with two parameters: α = Pr[η=−1  x*=1], and β = Pr[η=1  x*=0]. The necessary condition for identification is that α+β<1, that is misclassification should not happen “too often”. (This idea can be generalized to discrete variables with more than two possible values.)
Linear model
Linear errorsinvariables models were studied first, probably because linear models were so widely used and they are easier than nonlinear ones. Unlike standard least squares regression (OLS), extending errors in variables regression (EiV) from the simple to the multivariate case is not straightforward.
Simple linear model
The simple linear errorsinvariables model was already presented in the “motivation” section:
where all variables are scalar. Here α and β are the parameters of interest, whereas σ_{ε} and σ_{η} — standard deviations of the error terms — are the nuisance parameters. The “true” regressor x* is treated as a random variable (structural model), independent from the measurement error η (classic assumption).
This model is identifiable in two cases: (1) either the latent regressor x* is not normally distributed, (2) or x* has normal distribution, but neither ε_{t} nor η_{t} are divisible by a normal distribution.^{[5]} That is, the parameters α, β can be consistently estimated from the data set without any additional information, provided the latent regressor is not Gaussian.
Before this identifiability result was established, statisticians attempted to apply the maximum likelihood technique by assuming that all variables are normal, and then concluded that the model is not identified. The suggested remedy was to assume that some of the parameters of the model are known or can be estimated from the outside source. Such estimation methods include:^{[6]}
 Deming regression — assumes that the ratio δ = σ²_{ε}/σ²_{η} is known. This could be appropriate for example when errors in y and x are both caused by measurements, and the accuracy of measuring devices or procedures are known. The case when δ = 1 is also known as the orthogonal regression.
 Regression with known reliability ratio λ = σ²_{∗}/ ( σ²_{η} + σ²_{∗}), where σ²_{∗} is the variance of the latent regressor. Such approach may be applicable for example when repeating measurements of the same unit are available, or when the reliability ratio has been known from the independent study. In this case the consistent estimate of slope is equal to the leastsquares estimate divided by λ.
 Regression with known σ²_{η} may occur when the source of the errors in x’s is known and their variance can be calculated. This could include rounding errors, or errors introduced by the measuring device. When σ²_{η} is known we can compute the reliability ratio as λ = ( σ²_{x} − σ²_{η}) / σ²_{x} and reduce the problem to the previous case.
Newer estimation methods that do not assume knowledge of some of the parameters of the model, include:
 Method of moments — the GMM estimator based on the third (or higher) order joint cumulants of observable variables. The slope coefficient can be estimated from ^{[7]}
 Instrumental variables — a regression which requires that certain additional data variables z, called instruments, were available. These variables should be uncorrelated with the errors in the equation for the dependent variable, and they should also be correlated (relevant) with the true regressors x*. If such variables can be found then the estimator takes form
Multivariate linear model
Multivariate model looks exactly like the linear model, only this time β, η_{t}, x_{t} and x*_{t} are k×1 vectors.
The general identifiability condition for this model remains an open question. It is known however that in the case when (ε,η) are independent and jointly normal, the parameter β is identified if and only if it is impossible to find a nonsingular k×k block matrix [a A] (where a is a k×1 vector) such that a′x* is distributed normally and independently from A′x*.^{[8]}
Some of the estimation methods for multivariate linear models are:
 Total least squares is an extension of Deming regression to the multivariate setting. When all the k+1 components of the vector (ε,η) have equal variances and are independent, this is equivalent to running the orthogonal regression of y on the vector x — that is, the regression which minimizes the sum of squared distances between points (y_{t},x_{t}) and the kdimensional hyperplane of “best fit”.
 The method of moments estimator ^{[9]} can be constructed based on the moment conditions E[z_{t}·(y_{t} − α − β'x_{t})] = 0, where the (5k+3)dimensional vector of instruments z_{t} is defined as
This method can be extended to use moments higher than the third order, if necessary, and to accommodate variables measured without error.^{[11]}  The instrumental variables approach requires to find additional data variables z_{t} which would serve as instruments for the mismeasured regressors x_{t}. This method is the simplest from the implementation point of view, however its disadvantage is that it requires to collect additional data, which may be costly or even impossible. When the instruments can be found, the estimator takes standard form
Nonlinear models
A generic nonlinear measurement error model takes form
Here function g can be either parametric or nonparametric. When function g is parametric it will be written as g(x*, β).
For a general vectorvalued regressor x* the conditions for model identifiability are not known. However in the case of scalar x* the model is identified unless the function g is of the “logexponential” form ^{[12]}
and the latent regressor x* has density
where constants A,B,C,D,E,F may depend on a,b,c,d.
Despite this optimistic result, as of now no methods exist for estimating nonlinear errorsinvariables models without any extraneous information. However there are several techniques which make use of some additional data: either the instrumental variables, or repeated observations.
Instrumental variables methods
 Newey’s simulated moments method ^{[13]} for parametric models — requires that there is an additional set of observed predictor variabels z_{t}, such that the true regressor can be expressed as
Repeated observations
In this approach two (or maybe more) repeated observations of the regressor x* are available. Both observations contain their own measurement errors, however those errors are required to be independent:
where x* ⊥ η_{1} ⊥ η_{2}. Variables η_{1}, η_{2} need not be identically distributed (although if they are efficiency of the estimator can be slightly improved). With only these two observations it is possible to consistently estimate the density function of x* using Kotlarski’s deconvolution technique.^{[14]}
 Li’s conditional density method^{[15]} for parametric models. The regression equation can be written in terms of the observable variables as
Assuming for simplicity that η_{1}, η_{2} are identically distributed, this conditional density can be computed as
All densities in this formula can be estimated using inversion of the empirical characteristic functions. In particular,  Schennach’s estimator^{[16]} for a parametric linearinparameters nonlinearinvariables model. This is a model of the form
If not for the measurement errors, this would have been a standard linear model with the estimator ,
 Schennach’s estimator ^{[17]} for a nonparametric model. The standard Nadaraya–Watson estimator for a nonparametric model takes form
Further reading
 Chen, Hong & Nekipelov 2009
 Soderstrom 2007
 An Historical Overview of Linear Regression with Errors in both Variables, J.W. Gillard 2006
Notes
 ^ Griliches & Ringstad 1970, Chesher 1991
 ^ Greene 2003, Chapter 5.6.1
 ^ Hausman 2001, p. 58
 ^ Fuller 1987, p. 2
 ^ Reiersøl 1950, p. 383. A somewhat more restrictive result was established earlier by R. C. Geary in “Inherent relations between random variables”, Proceedings of Royal Irish Academy, vol.47 (1950). He showed that under the additional assumption that (ε, η) are jointly normal, the model is not identified if and only if x*’s are normal.
 ^ Fuller 1987, ch. 1
 ^ Pal 1980, §6
 ^ Bekker 1986. An earlier proof by Y. Willassen in “Extension of some results by Reiersøl to multivariate models”, Scand. J. Statistics, 6(2) (1979) contained errors.
 ^ Dagenais & Dagenais 1997. In the earlier paper (Pal 1980) considered a simpler case when all components in vector (ε, η) are independent and symmetrically distributed.
 ^ Fuller 1987, p. 184
 ^ Erickson & Whited 2002
 ^ Schennach, Hu & Lewbel 2007
 ^ Newey 2001
 ^ Li & Vuong 1998
 ^ Li 2002
 ^ Schennach 2004a
 ^ Schennach 2004b
References
 Bekker, Paul A. (1986), "Comment on identification in the linear errors in variables model", Econometrica 54 (1): 215–217, doi:10.2307/1914166, JSTOR 1914166
 Chen X., Hong H., and Nekipelov D. (2009), Nonlinear models of measurement errors, Working paper, http://www.stanford.edu/~doubleh/papers/surveyround2.pdf.
 Chesher, Andrew (1991), "The effect of measurement error", Biometrika 78 (3): 451–462, doi:10.1093/biomet/78.3.451, JSTOR 2337015
 Dagenais, Marcel G.; Dagenais, Denyse L. (1997), "Higher moment estimators for linear regression models with errors in the variables", Journal of Econometrics 76: 193–221, doi:10.1016/03044076(95)017895
 Erickson, Timothy; Whited, Toni M. (2002), "Twostep GMM estimation of the errorsinvariables model using highorder moments", Econometric Theory 18 (3): 776–799, JSTOR 3533649
 Fuller, Wayne A. (1987), Measurement error models, John Wiley & Sons, Inc, ISBN 0471861871
 Greene, William H. (2003), Econometric analysis (5th ed.), New Jersey: Prentice Hall, ISBN 0130661899, LCCN 2002 HB139.G74 2002
 Griliches, Zvi; Hausman, Jerry A. (1986), "Errors in variables in panel data", Journal of Econometrics 31 (1): 93–118, doi:10.1016/03044076(86)900588
 Griliches, Zvi; Ringstad, Vidar (1970), "Errorsinthevariables bias in nonlinear contexts", Econometrica 38 (2): 368–370, doi:10.2307/1913020, JSTOR 1913020
 Hausman, Jerry A. (2001), "Mismeasured variables in econometric analysis: problems from the right and problems from the left", The Journal of Economic Perspectives 15 (4): 57–67, doi:10.1257/jep.15.4.57, JSTOR 2696516
 Hong, Han; Tamer, Elie (2003), "A simple estimator for nonlinear error in variable models", Journal of Econometrics 117 (1): 1–19, doi:10.1016/S03044076(03)001167
 Jung, KangMo (2007) "Least Trimmed Squares Estimator in the ErrorsinVariables Model", Journal of Applied Statistics, 34 (3), 331–338. doi: 10.1080/02664760601004973
 Kummell, C. H. (1879), "Reduction of observation equations which contain more than one observed quantity", The Analyst 6 (4): 97–105, doi:10.2307/2635646, JSTOR 2635646
 Li, Tong (2002), "Robust and consistent estimation of nonlinear errorsinvariables models", Journal of Econometrics 110 (1): 1–26, doi:10.1016/S03044076(02)001203
 Li, Tong; Vuong, Quang (1998), "Nonparametric estimation of the measurement error model using multiple indicators", Journal of Multivariate Analysis 65 (2): 139–165, doi:10.1006/jmva.1998.1741
 Newey, Whitney K. (2001), "Flexible simulated moment estimation of nonlinear errorsinvariables model", The review of economics and statistics 83 (4): 616–627, doi:10.1162/003465301753237704, JSTOR 3211757
 Pal, Manoranjan (1980), "Consistent moment estimators of regression coefficients in the presence of errors in variables", Journal of Econometrics 14 (3): 349–364, doi:10.1016/03044076(80)900329
 Reiersøl, Olav (1950), "Identifiability of a linear relation between variables which are subject to error", Econometrica 18 (4): 375–389, doi:10.2307/1907835, JSTOR 1907835
 Schennach, Susanne M. (2004), "Estimation of nonlinear models with measurement error", Econometrica 72 (1): 33–75, doi:10.1111/j.14680262.2004.00477.x, JSTOR 3598849.
 Schennach, Susanne M. (2004), "Nonparametric regression in the presence of measurement error", Econometric Theory 20 (6): 1046–1093, doi:10.1017/S0266466604206028.
 Schennach S., Hu Y., Lewbel A. (2007), Nonparametric identification of the classical errorsinvariables model without side information, Working paper, http://escholarship.bc.edu/cgi/viewcontent.cgi?article=1433&context=econ_papers.
 Söderström, Torsten (2007), "Errorsinvariables methods in system identification", Automatica 43 (6): 939–958, doi:10.1016/j.automatica.2006.11.025
Categories: Regression analysis
 Statistical models
 Econometrics
Wikimedia Foundation. 2010.