Advanced DSP

Algebra of Random Variables

Lecture 6

Conducted by: Udayan Kanade

When we say the probability of an event is 0.2, we are saying that of all the futures that we can imagine, 20% of the futures will have that event happening. Why we believe this is our problem. A random variable X is an unknown real number sealed in a box. As long as it is unknown, our belief about it may be quantified as a probability distribution – our understanding of how many futures hold a particular value for the X box. The average of all the futures, called the average future, or, funnily enough, the expected value, is denoted E[X].

Suppose we have two secrets in two boxes, X and Y, which are allowed to collude with each other before we have opened either. The situation has to be modeled with a joint probability distribution – the description of how many futures hold a particular pair of values for (X,Y). From this description, we can find the probability distribution of a single random variable, say X, which is called marginalization. If we open, say, the Y box, our expectations with respect to X will change. These changed expectations are called X|Y=y (“X given Y equals y”) or simply X|y. All that has happened is our feasible future set has reduced.

The random variable X+Y is defined as a procedure which will look inside the X box and the Y box, add up the values and keep with itself. You may choose to open it – it will not tell you the values of X and Y, only the sum. Similarly we can define log(X) or X^Y, etc. as random variables. The expected value of any such function of random variables can be found directly using the joint probability distribution of those random variables, we do not need to histogram the function itself.

It is easy to see that E[X+Y] = E[X]+E[Y]. Average the sum, or sum the average – it is the same thing. Similarly, it is obvious that E[aX] = aE[X]. Together these properties are called linearity of expectations.

Suppose we are asked to give a real number a which is our best possible estimate of X. Once we know X=x, our estimation error will be x-a. The closer this is to zero, the happier we will be. Suppose we become as sad as (x-a)². Since x is not known, this future sadness is itself a random variable (X-a)². To design a we try to minimize the average future of this future sadness, i.e. minimize E[(X-a)²]. Using linearity of expectations and differentiating with respect to a, we easily see that this is minimized when a=E[X]. Thus, the expected value of a random variable is the best single number predictor.

Suppose we are allowed to open Y and get y and then estimate X from it. The best single number estimate will obviously be E[X|y]. Suppose I want an estimate of the form cy, where the constant c should be determined even before Y is opened. After X is opened, we will be as sad as (x-cy)². While deciding c we know neither Y nor X making our future sadness a random variable (X-cY)². Minimizing the expected value of this with respect to c gives c = E[XY]/E[Y²].

This looks so much like the single-vector least squares modeling formula, that we claim it is exactly that. In a slightly bizarre vector space, random variables are vectors (they can be added and scaled naturally enough...). Their dot product is E[XY]. Their energy is E[X²]. Orthogonality is defined as E[XY] = 0. The same dot product divided into energy formula gives the best scale of a random variable to fit another – in the “least squares” (actually minimum mean square error) sense. The error is orthogonal to the modeling random variable. And the pythagoras theorem holds. Since all we used in the previous lectures were the above properties of vectors, we already have ready with us modeling of a random variable from a bunch of others, algorithms and what to do when the random variables are (!!) Toeplitz.

Links:
Last year's lecture: Probability and Statistics
Last year's lecture: Linear Prediction
Note: MMSE Linear Estimation

Relations:

Because of the vector way of looking at random variables estimating a random variable from another becomes an application of the dot product. The MMSE error estimate can be extended to multiple variables.