RANDOM MATRICES AND HIGHER DIMENSIONAL INFERENCE

Organizers: Peter Bickel, Christopher Jones, Helene Massam, and Don Richards

April 9, 2007--April 13, 2007

A random variable is a function whose values cannot be predicted with complete certainty. Examples of random variables are the highest daily temperature, the weight of a newborn baby, and the daily closing value of the Dow Jones Industrial Average. A random vector is a vector all of whose entries are random variables, and a random matrix is a matrix all of whose rows or columns are random vectors.

This AIM workshop is a followup to a semester-long event at the Statistical and Applied Mathematical Sciences Institute (SAMSI) located in the Research Triangle Park in North Carolina. The workshop was dedicated to the study of random matrices and their applications in a variety of real-world problems. Particular emphasis was put on problems that give rise to data sets of large sample sizes, high-dimensional vectors, and commensurately large random matrices.

Among the working groups at SAMSI, and at this workshop, were

  • Climate and Weather
  • Wireless Communication
  • Universality
  • Regularization and Covariance
  • Geometric Methods
  • Multivariate Distributions
  • Graphical Models/Bayesian Methods
  • Estimating Functionals of High-Dimensional Sparse Vectors

The examples that we shall describe in detail below draw on ideas from several of these groups.

When confronted with a large random matrix constructed from high-dimensional data, statisticians often perform a "principal component analysis" (PCA) of the data. In principal component analysis, the eigenvalues and eigenvectors of the random matrix are used to develop low-dimensional approximations to the high-dimensional data. Principal component analysis is widely used in applications such as meteorology, the analysis of income tax returns, and the design of Internest search engines. The Climate and Weather working group focused primarily on principal component analysis and its applications to the detection and attribution of climate change.

Graphical models are statistical models designed to analyse complex high-dimensional data. Graphical models often are characterized by the nature of their covariance matrices or the inverse of their covariance matrices. The working group on Graphical Models/Bayesian Methods developed methods for reducing the dimension and the number of parameters of high-dimensional data sets.

The Internal Revenue Service (IRS) of the United States Government is one of the largest users of linear algebra. The IRS may view an individual taxpayer as a large vector with d entries, where d can be a fairly large positive integer (at least 100). The entries in this vector may include zip code, income, number of income streams, and other pieces of data that are relevant to the collection of taxes. Call this vector X.

If X1, X2, ..., XN are N different taxpayers, then the first piece of statistical data that may be derived from this set is the sample mean

X¯ = (1/N) ∑j=1N Xj .

Then we can define the sample covariance matrix

M = ∑j=1N (Xj - X¯)(Xj - X¯)T .

We use here the exponent T to denote the transpose of the vector. Thus M is a d × d matrix. It is known that, if N is larger than d, then M is positive definite almost surely. Here "almost surely" is probabilistic jargon that means that the event occurs with probability 1. In other words, it is a sure bet that this event will occur. A positive definite matrix A is one which satisfies x A T > 0. Equivalently, A is positive definite if the determinant of each square upper-left submatrix is positive. The study of the covariance matrix M gives rise to the so-called Wishart distribution.

If the IRS determines that your vector X of taxpayer information deviates markedly from the average or mean of other taxpayers in your zip code, then it concludes that there is something unusual about your income tax status. Thus you are more likely to be the subject of an audit.

The IRS has various means of calculating your deviation from the mean. One of these, the previously mentioned technique of principal component analysis, is to calculate the eigenvalues λj and eigenvectors vj of the matrix M. Since this matrix is (almost surely) positive definite, it has a full set of positive eigenvalues. Using the philosophy of the finite element method, the IRS will choose a fixed number of these, say 20, and calculate ‹ X, vj, j = 1, ..., 20. These numbers can of course be used to express your vector X as a linear combination of the corresponding eigenvectors.

If your X is well-approximated by the eigenvectors vj then your data fits in well with that of the chosen population (in your zip code). So you fit the profile of an "average taxpayer" and it is not likely that you will get an audit. If instead your X is not well approximated by the eigenvectors, then you and your data deviate from the population and your likelihood of an audit is much higher. What is lurking in the background here is a sophisticated idea from Hilbert space theory called the spectral theorem. The spectral theorem expresses a linear operator on a Hilbert space as a multiplication operator on a space of functions.

At the SAMSI event, and at this AIM workshop, this group studied high-dimensional random matrices. For example, the data vectors X that would come from a question in the human genome project would typically have 10,000 pieces of information. This would give rise to a very large covariance matrix. The corresponding calculations and analysis are orders of magnitude more difficult than those that correspond to small X and small M.

One of the matters of interest is discriminant analysis. This is a device for determining the probability of misclassification of an X. For example, imagine that there are two populations of citizens in a neighborhood---group A consisting of those who work as executives for big corporations and group B consisting of those who are freelance consultants. The first group have regular, fairly large, salaries. The second group will consist of people with an irregular income stream with widely deviating magnitude. Obviously these two different types of citizens will have different tax characteristics. Given a citizen's tax vector X, we want to be able to determine analytically whether X belongs in A or in B. Thus one needs a metric, or a notion of distance, to determine which of A or B the vector X is closest to. And this metric cannot be the standard isotropic Euclidean metric which treats each coordinate in the same way. Different tax data will count more or less than other tax data, so one requires a non-isotropic metric that weights the different pieces of data differently. The positive definite matrix M gives a device for constructing such a metric.

The analytical questions give rise to subtle considerations in Riemannian geometry. One wishes to know how to calculate distances, and geodesics, in a space of positive definite matrices. The resulting analyses draw on many parts of modern mathematics.

The Wireless Communications group at this workshop is concerned with using Bessel functions and other sophisticated notions from classical analysis to design multiple input and multiple output channels for cell phones. It uses covariance analysis to effect these studies.

The Multivariate Distribution group wants to apply these techniques in medical studies. For example, in the testing of a new drug each X represents a patient. Many of the sort of people who would volunteer for a drug study are unreliable. Thus the data sets that arise from the study have gaps in them. The random matrix techniques provide methods for interpolating across these gaps and still drawing meaningful conclusions.

An important message here is that random matrix theory is a burgeoning and developing part of modern mathematics. It is used in many different disciplines, ranging from number theory to mathematical physics to probability. Moreover, it is used decisively in statistical studies as indicated in the present discussion. The subject is a source of new problems and new research directions, and should serve as an attractive venue for young researchers.