- High-dimensional statistics
High-Dimensional Statistics
High-dimensional statistics is the branch of mathematical and applied
multivariate statistics aimed at treatment of statisical data whose dimension is so large that it is comparable in magnitude to sample size and may be much greater.History
The classical
Fisher approach tostatistics is based on the concept of fixed population and fixed model, whose parameters can be infinitely sharpened in the process of data accumulation. The main requirement to estimators is theconsistency , that is, the convergence to unknown true population parameters. Well known classical statistical procedures provide satisfactory solutions for "p"-dimensional data only for samples of size "n" much greater than "p". Meanwhile, inmultivariate statistics , practical investigators often meet the situation when programs included into most of standard statistical packages prove to be inefficient and do not guarantee any stable results. The existing theory could recommend nothing else as to ignore a part of data in hope to obtain a plausible solution.In 1968 A.N.
Kolmogorov proposed another setting of statistical problems and another asymptotics, in which the dimension of variables "p" increases along with the sample size "n" so that the ratio "p"/"n" tends to a constant. It was called “the increasing dimension asymptotics” or “the Kolmiogorov asymptotics” (see in [1] ). This method makes it easy to isolate principal terms of error probabilities and of standard quality functions for large "p" and "n". On the other side, the basic concept of traditional statistics is changed: the interest to estimation of separate parameters and to consistency is replaced by quality functions maximization in the Wald optimal decision rule meaning.Mathematical Theory
Extensive mathematical investigations were carried out that resulted in the creation of systematic theory of improved and asymptotically unimprovable versions of multivariate statistical procedures (see references at URL [2] ). A special parameter "G" - a function of fourth moments of variables was found, whose small value produces a number of specifically many-parametric phenomena. For increasing "p" and "n" so that "p"/"n" tends to a constant and "G" → 0, the principal terms of rotation invariant functionals occurring in statistics prove to be dependent on only first two moments of variables. Under "n" and "p" tending to infinity, "p"/"n" → "y" > 0, and "G" → 0, these functionals have vanishing variance and converge to constants that present limit functions of empirical means andvariances. As a consequence, some stable integral relations are produced between functions of parameters and functions of observable variables. They were called “stochastic canonical equations” or “dispersion equations” (see [3] ). Using them one can express principle parts of standard quality functions of regularized multivariate statistical procedures as functions of only observed variables. This provides the possibility to choose better procedures and find asymptoticaly unimprovable solutions
More Efficient Methods
A number of more efficient “essentially multivariate” statistical procedures were suggested that have obvious advantages over traditional consistent ones: they never degenerate, are applicable to observations of any dimension, and are approximately unimprovable for a wide class of populations. This method of statistical investigations was called “the essentially multivariate analysis”, and this approach was called the “multiparametric statistics” of “high-dimensional statistics”.
New Regions of Applications
Meanwhile, in the last decade due to progress of computer technologies, a number of new urgent statistical problems were put forth, in which the dimension of observations "p" is so high that it is much larger "n". In this situation, all existing multivariate procedures (including improved ones) do not provide satisfactory solutions even under a strong assumption of variable independence. Such problems would arise in connection with the necessity to treat huge amounts (terabytes) of genetic information, in the image analysis, for natural language text analysis, and other applications. Theoretical and practical aspects of analyzing high-dimensional data were intensely discussed at a number of seminars and workshops [4–7] . Some remarkable progress was achieved as in the development of methods, as in practical applications. Nevertheless, until now no regular methods exist for efficient treatment of so many variables. This region of statistical investigations got generally accepted name “High-Dimensional Statistics” or “HD-Statistics” (see [4–7] and references at URL [2] ).
REFERENCES
1. S.A.Aivasian, V.M.Buchstaber, I.S.Yenyukov, L.D.Meshalkin. Applied Statistics. Classification and Reduction of Dimensionality. Moscow, 1989 (in Russian).
2. URL [hd-stat.narod.ru]
3. V.L.Girko. Canonical Stochastic Equations, vol. 1,2, Kluwer Academic Publishers, Dordrecht, 2000
4. Program on High-Dimensional Inference for 2006-2007. SAMSI, USA.
5. Workshop in High-Dimensional Data Analysis, National University of Singapore. February, 2008.
6. Workshops HD-statistics in biology, Isaac Newton Inst. for Math. Sci., Cambridge. 31.03-27.06 2008.
7. Young European Statistics Workshop (YES-2), Eindhoven, Netherland. June, 2008.
Wikimedia Foundation. 2010.