As I mentioned earlier, I’ve been reading a little about the probabilistic method, especially in combinatorics and related fields, as a result of a course I’m TAing. This material is certainly fundamental to a lot of mathematics and computer science, but I’ve never really studied probability. I’ve been trying to change that. The material in this post is from Alon and Spencer’s The Probabilistic Method.

 

1. First moments

For the purposes of the probabilistic method, we often want to show that a certain event is possible, which is done by showing that it occurs with positive probability as the outcome of a random process. Let {X} be a nonnegative random variable on a probability space {\Omega}. A basic observation is that:

\displaystyle \mathbb{P}(X \geq \mathbb{E}( X)) > 0.

that is, a random variable takes at least its expectation value with positive probability. This alone is sufficient to obtain many interesting results.

Example: Let’s go back to the problem of bounding below the Ramsey numbers. Consider a random 2-coloring of the edges of the graph {K_N}, and let {X} be the random variable which outputs the number of monochromatic {k}-cliques, and {Y = N - X}. We note that if we remove one vertex from {K_N} from each monochromatic edge, we get a 2-colored complete graph with {Y} vertices and no {k}-cliques.

In particular, for any 2-coloring of {K_N}, we can find a 2-coloring of {K_{Y}} with no monochromatic {k}-clique. Moreover, we find that:

\displaystyle \mathbb{E}(Y) = N - \mathbb{E}(X) = N - \sum_{A \subset [N], |A| = k} 2^{ 1 - \binom{k}{2}},

where we note that {X} is the sum of the indicator random variables for the {\binom{N}{k}} events “{A} is monochromatic,” each of which occurs with probability {2^{1 - \binom{k}{2}}}. It follows that {Y } is at least this value with positive probability, so that there exists a 2-coloring of the graph {K_{N - \binom{N}{k} 2^{1 - \binom{k}{2}}}} with no monochromatic {k}-clique. In particular, for every {N},

\displaystyle R(k, k) > N - \binom{N}{k} 2^{1 - \binom{k}{2}} .

This method of “alterations” gives a slight improvement to the probabilistic method in its first form applied to this problem, but it’s not as good as the improvements made using the Lovasz local lemma.

Example: A tournament is an orientation of the complete graph {K_N}. It is known that every tournament has a Hamiltonian path. The probabilistic method can be used to show that there is a tournament on {N} vertices with at least

\displaystyle N! 2^{-N-1}

Hamiltonian paths. Namely, for every ordering {\prec} of {[N] = \left\{1, 2, \dots, N\right\}} via {x_1 < \dots < x_N}, we consider the event {E_{\prec}} that a random tournament on {[N]} (that is, an independent choice of each orientation with probability {\frac{1}{2}}) has a Hamiltonian path in {x_1, x_2, \dots, x_{N}, x_1}. Its probability is {2^{-N-1}}.

Let {X} be the random variable assigning to a tournament the number of Hamiltonian paths. Then {X} is the sum of {N!} random variables of the form {\chi_{E_{\prec}}}, one for each linear ordering of {[N]}. We have seen that {\mathbb{P}( E_{\prec}) = 2^{-N-1}}, so that

\displaystyle \mathbb{E}(X) = N! 2^{-N-1}.

There is thus a tournament on which {X} takes at least this value, or a tournament with at least {N! 2^{-N-1}} Hamiltonian paths.

 

2. Chebyshev’s inequality

However, this method is a little crude; it used no information on {X} other than its expectation value. We can get better control on the distribution of {X} if we know not only the expectation value of {X}, but the higher moments. Let {\mathrm{Var}(X)} be the variance of {X}. Then we have Chebyshev’s inequality,

\displaystyle \mathbb{P}( |X - \mathbb{E}(X)| > c \sqrt{\mathrm{Var}(X)} ) \leq \frac{1}{c^2}, \quad c> 0,

which in turn follows from Markov’s inequality by writing

\displaystyle \mathbb{P}( |X - \mathbb{E}(X)| > c \sqrt{\mathrm{Var}(X)} ) = \mathbb{P}( |X - \mathbb{E}(X)|^2 > c^2 \mathrm{Var}(X) ) \leq \frac{1}{c^2},

and using the definition {\mathrm{Var}(X) = \mathbb{E}( (X - \mathbb{E}(X))^2)}. This has two benefits: one, we get both lower and upper bounds on the values of {X} for “most” points in the probability space, though we are required to know the variance; and two, we get a quadratic term {c^2} in the denominator.

For a simple example, let’s suppose that {X_1, \dots, X_N} are identically distributed, independent simple random variables with expectation zero. Then if {N} is large, we would expect that

\displaystyle \frac{1}{N}( X_1 + \dots + X_N)

tends to be fairly clustered near the origin. The law of large numbers and the central limit theorem provide justifications for this fact. From our point of view, we can see some of this by noting that the variance of {X_1 + \dots + X_N} is significantly controlled by the assumption of independence. If { \sigma} is the common standard deviation of the {X_i}, then {X_1 + \dots + X_N} has variance {N \sigma^2}, so that

\displaystyle \mathbb{P}( |\frac{1}{N}( X_1 + \dots + X_N) |> \epsilon\sigma) \leq \frac{1}{\epsilon^2 N}.

This is the weak version of the law of large numbers: the probabilities of deviations above a certain level tends to zero as {N \rightarrow \infty}. Again, the observation that enabled us to do this was the independence of {X_1, X_2, \dots, X_n}. We can do a little better if we consider not just the first and second moments but rather the moment generating function; this leads to the Chernoff bound.

 

3. The Hardy-Ramanujan theorem

Let’s give an example: how many prime factors does a random integer {\leq N} have? Let {\Omega = \left\{1, 2, \dots, N\right\}}, the probability space assigning each integer between {1} and {N} equal probability. In particular, we have a random variable {X} with the definition

\displaystyle X(i) \stackrel{\mathrm{def}}{=} | \left\{p: p \mid i\right\}|,

so that {X} is the sum of the indicator variables {X_p\stackrel{\mathrm{def}}{=} \chi_{\left\{p \mid i\right\}}} testing whether a number is divisible by {p}. We’d like to understand the distribution of {X}.

To do this, note first that since

\displaystyle X = \sum_{p \leq N} X_p,

we can conclude

\displaystyle \mathbb{E}(X) = \sum_{p \leq N} \mathbb{E}(X_p) = \sum_{p \leq N} \left( \frac{1}{p} + O( \frac{1}{N}) \right) = \log \log N + O(1),

where we use the prime number theorem. This suggests that a “random” number has about {\log \log N} factors. To make this precise, though, we need a second moment inequality: that is, we need to bound the variance of {X}. It turns out that we can do this, because {X} is the sum of the {X_p}, which are “almost” independent in the sense that their covariances are small.

The result is given by:

 

Theorem 1 (Hardy; Ramanujan) Let {\omega : \mathbb{N} \rightarrow \mathbb{R}} approach infinity. Then

\displaystyle \mathbb{P}( |X(i) - \log \log N| > \omega(N) \sqrt{\log \log N} )_{1 \leq i \leq N} \rightarrow 0,

as {N \rightarrow \infty}.

 

The proof of this result is based upon estimating the variance of the random variable {X} on {\left\{1, 2, \dots, N\right\}} and Chebyshev’s inequality. In fact, it is of the same order as the expectation value. Namely, we need to estimate

\displaystyle \mathrm{Var}(X) = \sum_p \mathrm{Var}(X_p) + \sum_{p \neq q} \mathrm{Cov}(X_p, X_q),

where {\mathrm{Cov}(X_p, X_q)} denotes the covariance

\displaystyle \mathrm{Cov}(X_p, X_q) = \mathbb{E}(X_p X_q) - \mathbb{E}(X_p) \mathbb{E}( X_q).

To estimate the first term, we use

\displaystyle \mathrm{Var}(X_p) = \mathbb{E}(X_p^2) - \mathbb{E}(X_p)^2 \leq \mathbb{E}(X_p) = \frac{1}{p} + O( \frac{1}{N}),

where we take advantage of the fact that {X_p} is an indicator random variable.

Next, we need to show that the covariances are small, so that the random variables {X_p} are “almost” independent. Since {X_p X_q} is the event that an integer is divisible by {pq}, we find:

\displaystyle \mathrm{Cov}(X_p, X_q) = \frac{1}{N} \left( \lfloor \frac{N}{pq} \rfloor -\frac{1}{N} \lfloor \frac{N}{p}\rfloor \lfloor \frac{N}{q}\rfloor \right) .

This is at most

\displaystyle \frac{1}{pq} - \left( \frac{1}{p} - \frac{1}{N} \right) \left( \frac{1}{q} - \frac{1}{N} \right) \leq \frac{1}{N} \left( \frac{1}{p} + \frac{1}{q} \right) ,

so that

\displaystyle \sum_{p \neq q} \mathrm{Cov}(X_p, X_q) \leq \frac{1}{N} \sum_{p \neq q} \left( \frac{1}{p} + \frac{1}{q} \right).

For {N \gg 0}, there are at most {\frac{2N}{\log N}} primes {\leq N}, so that this sum is at most

\displaystyle \frac{4}{\log N} \sum_{p \leq N} \frac{1}{p} = o(1).

In particular, combining the bound on the covariance and {\mathrm{Var}(X_p) \leq \frac{1}{p} + O( \frac{1}{N})}, we find that

\displaystyle \mathrm{Var}(X) \leq \log \log N + O(1).

Chebyshev’s inequality now gives the Hardy-Ramanujan theorem: the probabilities in the statement are in fact {O( \frac{1}{\omega(n)^2})}.