Lec 4

分析 randomized algorithms 需要用到一些概率工具, 这里介绍对期望值进行估计的两个不等式: Markov inequality and Chebyshev inequality.

Markov and Chebyshev¶

Markov's Inequality

Letting \(X\) denote a real-valued random variable that only takes non-negative values, for any \(α>0\),

\[\Pr[X \geq \alpha] \leq \frac{\mathbb{E}[X]}{\alpha}.\]

Proof

可以从 3 种角度证明.

第一种角度是 lecture notes 上的. 我们有

\[\mathbb{E}[X] = \mathbb{E}[X | X \geq \alpha]\Pr[X \geq \alpha] + \mathbb{E}[X|X < \alpha]\Pr[X < \alpha]\]

由于 \(X \geq 0\), 因此 \(\mathbb{E}[X | X < \alpha] \geq 0\), 所以有 \(\mathbb{E}[X] \geq \mathbb{E}[X | X \geq \alpha]\Pr[X \geq \alpha]\), 其中 \(\mathbb{E}[X | X \geq \alpha] \geq \alpha\), 于是有 \(\Pr[X \geq \alpha] \leq \frac{\mathbb{E}[X]}{\alpha}\).

第二种角度是由 \(\mathbb{E}[X]\) 表达式 \(\mathbb{E}[X] =\int_{0}^{\infty}x\Pr[X=x]\mathrm{d}x\), 有

\[\begin{aligned} 𝔼[X] &= ∫_{\alpha}^{\infty}xPr[X=x]dx + ∫_{α}^{∞} xPr[X=x]dx \\ &≥ ∫_{α}^{∞}xPr[X=x]dx \\ &≥ α ∫_{α}^{∞}Pr[X=x]dx \\ &= α Pr[X ≥ α] \end{aligned}\]

第三种角度是 mini lecture 里的 PDF 图像.

它的另一种形式也很常见: \(\Pr[X \geq c\mathbb{E}[X]] \leq \frac{1}{c}\).

Chebyshev's Inequality

Letting \(X\) denote a real-valued random variable. For any \(c>0\),

\[\Pr[|X - \mathbb{E}[X]| \geq c \sqrt{\mathrm{Var}[X]}] \leq \frac{1}{c^2}.\]

Proof

令 \(Y = X-\mathbb{E}[X]\), 则 \(\mathrm{Var}[X]=\mathbb{E}[Y^{2}]\). 根据 Markov's inequality, 有 \(\Pr[Y^2 ≥ c^2𝔼[Y^2]] ≤ \frac{1}{c^2}\), 又 \(\Pr\left[|Y| \geq c\sqrt{\mathbb{E}[Y^2]}\right] = \Pr\left[Y^{2} \geq c^2\mathbb{E}[Y^2]\right] \leq \frac{1}{c^2}\), 即 \(\Pr[|X - \mathbb{E}[X]| \geq c \sqrt{\mathrm{Var}[X]}] \leq \frac{1}{c^2}\).

Chebyshev's inequality 也常被写作 \(\Pr\left[|X - \mathbb{E}[X]| \geq c\right] \leq \frac{\mathrm{Var}[X]}{c^2}\).

Sampling-Based Median Algorithm¶

算法的主要思想是从 \(S\) 中选取小规模的 samples, 然后在 samples 中选取两个值 \(a,b\) 作为端点, 希望在 \(S\) 中, median 是位于这两个值中间的, 并且这两个值在排序后的 \(S\) 中相差不远 (所以需要对 samples 排序后选取), 能允许我们保留 \(S\) 中大小位于 \(a,b\) 中间的数然后直接排序求得 median.

Question

Suppose that you could show that:

with probability \(\geq 0.9\), the median of \(S\) is in the list \(T\) ; and
with probability \(\geq 0.9\), \(|T |< 4t\).

Explain (to each other) why these two things would imply that the algorithm returns the correct answer with probability \(\geq 0.8\). And if it does not return the median then it returns Fail.

根据算法描述, 当上述两个条件满足后算法一定不会 return Fail, 这个概率为 \(\Pr[\text{output a number}] \geq 0.9 \cdot 0.9 = 0.81 \geq 0.8\). 此时 \(S\) 的 median 必然在 \(T\) 中, 根据 median 定义可知, \(S\) 的 median 在 \(T\) 中的排名为 \(i = (n+1) / 2 - N_{< a}\), 即返回的数字必然是正确的. 综上, 算法返回正解的概率 \(\geq 0.8\).

Question

Convince yourself that this algorithm uses at most \(O(n)\) operations. What is the leading constant in this big-Oh notation? (Assuming that "sample a random element of \(S\)", and comparing two numbers are each single operations).

sampling 的复杂度为 \(O(t)=o(n)\)
sort \(R\) 的复杂度为 \(O(t \log t)=o(n)\)
求解 \(N_{<a}\) 和 \(N_{>b}\) 同时构造 \(T\) 的复杂度为 \(O(n)\), 具体而言, 枚举 \(S\) 的元素, 分别与 \(a,b\) 进行比较, 一共的比较次数是 \(2n\)
sort \(T\) 的复杂度为 \(O(t \log t)\)

因此总的复杂度应为 \(O(n)\), leading constant 为 \(2\).

Question

In the following parts, you will show that the median of \(S\) is in \(T\) , with probability at least \(0.9\). Let \(m\) be the median of \(S\). Consider two events:

\(\lvert \{r_{i} \in R : r_{i} < m\} \rvert < \frac{t}{2} + \sqrt{ n }\)
\(\lvert \{r_{i} \in R : r_{i} > m\} \rvert < \frac{t}{2} + \sqrt{ n }\)

(a) Explain why, if both of these events hold, then \(median(S) \in T\).

(b) Use Chebyshev’s inequality to bound the probability that the first event does not hold. (Hint: let \(X_{i}\) be the indicator random variable that is \(1\) iff \(r_{i} \leq m\), and consider \(\sum_{i}X_{i}\)).

(c) Convince yourself that the same argument will work for the second event, and write a statement of the form:

\[\Pr[median(S) \in T] \geq 1 - \_\_\_\_.\]

(a) 两个 events 分别代表 \(b>m\) 和 \(a < m\), 若都满足的话, 那么 \(m\) 一定会在 \(T\) 中. 第二个 event 与第一个 event 的情况类似, 这里仅解释第一个 event. 若 \(\left| \{r_{i} \in R : r_{i} < m\} \right| < \frac{t}{2} + \sqrt{ n }\) 满足, 这说明 \(R\) 中排名 \(\frac{t}{2} + \sqrt{n}\) 的数是大于 \(m\) 的, 而 \(b\) 正是 \(R\) 中排名 \(\frac{t}{2} + \sqrt{n}\) 的数.

(b) 由于 \(r_{i}\) 的选取是独立且在 \(S\) 上均匀分布, 因此 \(\mathbb{E}[X_{i}] = \frac{1}{2}\), \(\mathrm{Var}[X_{i}]=\mathbb{E}[X_{i}^2]-\mathbb{E}^{2}[X_{i}]=\frac{1}{4}\). 则

\[ \begin{aligned} \Pr\left[\bigg| \{r_{i} \in R : r_{i} < m\} \bigg| > \frac{t}{2} + \sqrt{ n }\right] &= \Pr\left[\sum_{i}X_{i} > \frac{t}{2} + \sqrt{n}\right] \\ &= \Pr\left[\sum_{i}X_{i} - \frac{t}{2} > \sqrt{n}\right] \\ &\leq \Pr\left[\left\lvert \sum_{i}X_{i}-\frac{t}{2} \right\rvert > \sqrt{n}\right] \\ &\leq \frac{\mathrm{Var}\left[\sum_{i}X_{i}\right]}{\left( \sqrt{n} \right)^2} = \frac{\frac{t}{4}}{n}\\ &= 4n^{-1 / 4} = O(n^{-1 / 4}) \end{aligned} \]

(c) \(\Pr[median(S) \in T] \geq 1-O(n^{-1 / 4})\). 上述 events 只要有一个不满足就会失败, 因此这里是 \(O(n^{-1 / 4})\).

Question

Now, we turn our attention to the probability that \(|T |< 4t\).

(a) Explain why it is sufficient to show that \(a\) is not one of the smallest \(n/2−2t\) elements of \(S\), and \(b\) is not one of the largest \(n/2 + 2t\) elements of \(S\).

(b) Use Chebyshev's inequality to bound the probability that \(a\) is not one of the smallest \(n/2−2t\) elements of \(S\). (Hint: Consider the indicator random variable \(Y_{i}\) that is \(1\) if \(r_{i}\) is in the smallest \(n/2−2t\) elements of \(S\). Argue that a is one of the smallest \(n/2−2t\) elements of \(S\) iff \(\sum_{i}Y_{i} \geq t / 2 - \sqrt{ n }\) (why?) and apply Chebyshev's inequality. )

(c) Convince yourself that the analogous statement for \(b\), and write a statement of the form:

\[\Pr[\lvert T \rvert < 4t] \geq 1 - \_\_\_\_.\]

(a) 若 \(a\) 不是 \(S\) 中最小的 \(n / 2 - 2t\) 个元素之一, \(b\) 也不是最大的 \(n / 2 + 2t\) 个元素之一, 那么 \(a,b\) 之间的数的数量不会超过 \(4t\), 这正是我们希望得到的.

(b) 设 \(\Pr[Y_{i}=1]=\frac{n / 2 - 2t}{n}=p\), 则 \(\mathbb{E}[Y_{i}]=\frac{n / 2 - 2t}{n}=p, \mathrm{Var}[Y_{i}] = p-p^2\). 我们有

\[ \begin{aligned} \Pr[a\text{ is not one of the smallest } n / 2 - 2t \text{ elements of }S] &= \Pr\left[\sum_{i}Y_{i} \geq t / 2 - \sqrt{n}\right] \\ &= \Pr\left[\sum_{i}Y_{i} - tp \geq \sqrt{n}\right] \\ &\leq \frac{\mathrm{Var}\left[\sum_{i}Y_{i}\right]}{n} \\ &= \frac{t(p - p^2)}{n} \\ &= \frac{1}{4n^{1 / 4}} - \frac{4}{n^{3 / 4}} = O(n^{- 1 / 4}) \end{aligned} \]

(c) 只要 \(a, b\) 中有一个不满足条件, \(|T|\) 就有可能大于 \(4t\), 因此 \(\Pr[|T| \geq 4t] \leq O(n^{-1 / 4})\), 从而 \(\Pr[|T| < 4t] \geq 1 - O(n^{-1 / 4})\).

最后, 我们有:

Theorem

If the algorithm does not output Fail, then it correctly outputs the median. The probability the algorithm returns Fail is at most \(O(1 / n^{1 / 4})\) (and hence we can repeat until success without any significant increase in expected runtime) and the algorithm performs at most \(2n+o(n)\) pairwise comparisons.