Lec 5

本节介绍 Chernoff bounds, 以及 randomized routing on the hypercube 问题，并应用了 Chernoff bounds 对 tail bound 进行估计。

Introduction¶

Chernoff bounds 常用来对 sums of independent RV 进行估计，它说明了 sum 偏离 expectation 超过 \(c\) 倍的 standard deviations 的概率大致是以 inverse exponentially with \(c^2\) 的速度降低的，Central Limit Theorem 也可以印证这一点：

Central Limit Theorem

Let \(X_{1}, \dots, X_{n}\) be independent, identical distributed random variables with \(\mathbf{E}[X_{i}] = \mu\) and \(\mathbf{Var}[X_{i}]=\sigma^2\). Then, as \(n \to \infty\), the distribution \(\frac{ \frac{1}{n}\sum_{i=1}^{n}(X_{i}-\mu)}{\sigma}\) converges to the standard Gaussian, \(N(0, 1)\).

令 \(Z \sim N(0, 1)\), 则 CLT 表明当 \(n\) 足够大时，有 \(\Pr\left[\lvert \sum_{i=1}^{n}X_{i}-n\mu \rvert \geq c\sqrt{\mathbf{Var}\left[\sum_{i=1}^{n}X_{i}\right]}\right] \approx \Pr[Z \geq cn] \leq e^{-c^2 / 2}\), 其中后一个不等式是在 Gaussian PDF 上直接做积分得到的。

但是 CLT 限制了 \(n\) 的数量级，并且 \(X_{i}\) 需要服从同一个分布，当上述条件不能同时满足时，我们可以用 Chernoff bounds 来估计。

Moment-Generating Functions¶

众所周知，\(f(x)=e^{x}\) 是个好东西，它拥有以下诱人特质：

它恒大于 \(0\);
它的定义域（几乎）没有限制，为 \((-\infty, +\infty)\);
它不仅单调递增，它还在整个定义域上是一个凸函数 (concave function);
泰勒展开式十分优美！（最重要的是好记（逃
各种关于它的不等式非常多，而且有不少形式简单同时又好用的（没错，就是 \(1+x \leq e^x\)），非常适合用来做估计
它甚至和三角函数、虚数也紧密相连！（某著名欧拉公式）
它也可以是联系 \(\sum\) 与 \(\prod\) 之间的桥梁
还有一大堆，不列了 qwq

我们希望将 \(X\) 映射到与 \(e^{x}\) 有关的分布上，同时要保证唯一性，即对于任意分布 \(Y \not= X\), \(Y\) 映射后的分布也不等同于 \(X\) 映射后的分布。一个简单的映射方式就是 \(\pi_{X}(t) = e^{tX}\), 这就是 moment-generating function 的形式。

Moment-Generating Function

The moment generating function of a random variable, \(X\), is a function \(M_{X}: \mathbb{R} \to \mathbb{R}\), defined by \(M_{X}(t)=\mathbf{E}\left[e^{tX}\right]\).

从 \(e^x\) 的泰勒展开式可以看出，它能表达的信息非常丰富，考虑 \(t=0\) 的邻域 \((-\delta, \delta)\)，对 \(M_{X}(t)\) 求 \(k\) 次导可以得到 \(M_{X}^{(k)}(0)=\mathbf{E}\left[ X^{k}e^{0 \cdot X} \right] = \mathbf{E}\left[ X^{k} \right]\), 可见，通过求任意次导数，即便是 \(t=0\) 的一个领域内便可得到 \(X\) 为文字的任意整数次多项式，因此给定任意与 \(X\) 分布不同的分布 \(Y\)，\(M_{X}(t)\) 与 \(M_{Y}(t)\) 在这个邻域内不可能等同，而根据「当一个函数是 well-bahaved 时（显然 \(M_{X}(t)\) 是），它在任意一点附近的泰勒展开式唯一地定义了该函数」，因此 moment-generating function 满足了我们前面提到的要求。

Fact

Given random variables, \(X\) and \(Y\), if there is some \(\delta > 0\) such that \(M_{X}(t)=M_{Y}(t)\) for all \(t \in (-\delta, \delta)\), then the distribution of \(X\) is the same as the distribution of \(Y\).

Example

令 \(Z \sim N(\mu, \sigma^2)\), 于是有

\[M_{Z}(t) = \mathbf{E}\left[e^{tZ}\right] = \int_{-\infty}^{+\infty}e^{tz} \frac{1}{\sigma\sqrt{2\pi}}e^{- (z-\mu)^2/2\sigma^2}\mathrm{d}z=e^{\mu t + \frac{1}{2}\sigma^2t^2}.\]

如果给定 \(X \sim N(\mu_{1}, \sigma_{1}^2), Y \sim N(\mu_{2}, \sigma_{2}^2)\), 那么根据 linearity of expectation，我们可以得到 \(X+Y\) 的 moment-generating function 为

\[M_{X+Y}(t)=\mathbf{E}\left[ e^{t(X+Y)} \right] = e^{(\mu_{1}+\mu_{2})t+\frac{1}{2}(\sigma_{1}^2+\sigma_{2}^2)t^2}.\]

于是 \(X+Y \sim N(\mu_{1}+\mu_{2}, \sigma_{1}^2+\sigma_{2}^2)\).

Example

设 \(X_{i}\) 是一个 \(0 / 1\) valued RV, 其中取值 \(1\) 的概率为 \(p_{i}\), 假定所有 \(X_{i}\) 独立，于是

\[M_{X_{i}}(t) = \mathbf{E}\left[ e^{tX_{i}} \right] = p_{i}e^{t}+(1-p_{i})e^{0}=1+p_{i}(e^t-1).\]

令 \(X=\sum_{i}^{n}X_{i}\), 有

\[M_{X}(t) = \mathbf{E}\left[ e^{t\sum_{i=1}^{n}X_{i}} \right]=\prod_{i=1}^{n}\left( 1+p_{i}(e^t-1) \right).\]

Chernoff Bounds¶

Chernoff bounds 利用 moment-generating functions 对 \(\sum\) 形式的 RV 进行估计。

Chernoff bounds 有很多种，它们都是通过对 moment-generating functions 应用 Markov's inequality 得到的，通用的形式如下：

任意 \(t > 0\), \(\Pr[X \geq c] = \Pr\left[ e^{tX} \geq e^{tc} \right] \leq \frac{\mathbf{E}\left[e^{X}\right]}{e^{tc}}\).
任意 \(t < 0\), \(\Pr[X \leq c] = \Pr\left[ e^{tX} \geq e^{tc} \right] \leq \frac{\mathbf{E}\left[e^{X}\right]}{e^{tc}}\).

可以看出，令 \(t < 0\) 我们可以令 \(\leq\) 形式的不等式转化为 \(\geq\) 形式的不等式从而应用 Markov's inequality，进而可以对 \(\leq\) 形式的式子进行估计，这是 Markov's 和 Chebyshev's 都不方便做到的。

下面介绍一些对 sums of \(0 / 1\) valued RVs 进行估计的 Chernoff bounds.

令 \(X = \sum_{i=1}^{n}X_{i}\), 其中 \(X_{i}\) 是独立的 \(0 / 1\) valued RV with \(\Pr[X_{i}=1]=p_{i}\). 令 \(\mu = \mathbf{E}[X] = \sum_{i=1}^{n}p_{i}\).

Theorem

Let \(X=\sum_{i=1}^n X_{i}\), where \(X_{i}\) is an independent 0/1 valued random variable, with \(\Pr[X_{i}=1]=p_{i}\).

For any \(\delta>0\), \(\Pr[X \geq (1+\delta)\mathbf{E}[X]] \leq \left( \frac{e^{\delta}}{(1+\delta)^{1+\delta}} \right)^{\mathbf{E}[X]}.\)
For any \(\delta \in (0,1]\), \(\Pr[X \leq (1-\delta)\mathbf{E}[X]] \leq \left( \frac{e^{-\delta}}{(1-\delta)^{1-\delta}} \right)^{\mathbf{E}[X]}\).

\(\textit{Proof}\). lower bound 的证明与 upper bound 是类似的，因此这里只证明 upper bound.

用 moment-generating function 表示得到

\[\begin{aligned} \Pr[X \geq (1+\delta)\mathbf{E}[X]] &= \Pr\left[e^{tX} \geq e^{(1+\delta)\mu t}\right]\\ &\leq \frac{\mathbf{E}\left[e^{tX}\right]}{e^{(1+\delta)\mu t}}\\ &= \frac{\prod_{i=1}^{n}\left(1+p_{i}(e^t - 1)\right)}{e^{(1+\delta)\mu t}}\\ &\leq \frac{\prod_{i=1}^{n}e^{p_{i}(e^t - 1)}}{e^{(1+\delta)\mu t}}\\ &= \frac{e^{\mu (e^t - 1)}}{e^{(1+\delta)\mu t}} = \left( \frac{e^{e^t - 1}}{e^{(1+\delta)t}} \right)^{\mu} \end{aligned}\]

然后取一个恰当的 \(t\) 值使不等式尽可能 tight，即令 \(f(t)=\frac{e^{e^t - 1}}{e^{(1+\delta)\mu t}}=e^{e^t - 1 - (1+\delta)t}\), 要使 \(f(t)\) 尽量小，令 \(g(t) = e^t - 1 - (1+\delta)t\), 则 \(g'(t)=e^t - (1+\delta)\), 取 \(t=\log(1+\delta)\), 此时 \(f(x)\) 最小，于是将 \(t=\log(1+\delta)\) 代入上式得到 \(\Pr[X \geq (1+\delta)\mathbf{E}[X]] \leq \left( \frac{e^\delta}{(1+\delta)^{1+\delta}} \right)^\mu\).

Corollary

Let \(X=\sum_{i=1}^{n}X_{i}\), where \(X_{i}\) is an independent \(0/1\) valued random variable, with \(\Pr[X_{i} = 1] = p_{i}\), and \(\mu = \sum p_{i}\), the following bounds hold:

For any \(\delta \in (0,1]\), \(\Pr[X \geq (1+\delta)\mu] \leq e^{-\frac{\mu\delta^2}{3}}\).
For any \(\delta \in (0,1]\), \(\Pr[X \leq (1-\delta)\mu] \leq e^{-\frac{\mu\delta^2}{2}}\).
For any \(c \geq 6\), \(\Pr[X \geq c\mu] \leq 2^{-c\mu}\).

将 \(\log(1+\delta)\) 的泰勒展开式代入\(g(\log(1+\delta))=\delta-(1+\delta)\log(1+\delta)\) 中可以得到 \(g(\log(1+\delta)) \leq -\delta^2 / 3\), 于是可以得到第一个 bound. 另外两个 bound 类似。

Corollary

For any \(\delta > 0\), and any \(c \geq \mathbf{E}[X]\), \(\Pr[X \geq (1+\delta)c] \leq \left(\frac{e^{\delta}}{(1+\delta)^{1+\delta}} \right)^c\).

For any \(\delta \in (0,1]\), and any \(c \leq \mathbf{E}[X]\), \(\Pr[X \leq (1-\delta)c] \leq \left( \frac{e^{-\delta}}{(1-\delta)^{1-\delta}} \right)^c\).

这两个估计式将偏离的对象从 \(\mathbf{E}[X]\) 分别扩展到了任意 \(c \geq \mathbf{E}[X]\) 和 \(c \leq \mathbf{E}[X]\)，类似 moment-generating functions 用到了映射的 trick.

对于第 1 个估计式，定义 independent 0/1 RVs \(Y_{1}, \dots, Y_m\) 使得 \(\mathbf{E}[\sum_{j}Y_{j}] = c - \mathbf{E}[\sum_{i}X_{i}]\), 此时如果定义 RV \(Z = \sum_{i}X_{i} + \sum_{j}Y_{i}\) 就会有 \(X \leq Z\) 恒成立，而又有 \(\mathbf{E}[Z]=c\)，于是我们可以将前面的 theorem 应用到 \(Z\) 上，从而得到第 1 个估计式。

第 2 个估计式则是令 \(Z = \sum_{i}X_{i}Y_{i}\), 从而 \(\mathbf{E}[Z]=c\)（可以通过令 \(\Pr[Y_{i}=1]=c / \mathbf{E}[X]\) 得到），进而 \(Z\) 是 sum of independent 0/1 RVs, 并且恒满足 \(Z \leq X\), 于是 \(\Pr[X \leq (1-\delta)c] \leq \Pr[Z \leq (1-\delta)c] = \Pr[Z \leq (1-\delta)\mathbf{E}[Z]]\), 然后套入公式即可。

Remark

其他 Chernoff bounds 也可以通过用不同方式简化 moment-generating function \(\mathbf{E}[e^{tX}]=\prod_{i}(p_{i}e^t+(1-p_{i}))\) 得到。如果不用 \(e^x \geq x+1\) 来估计的话，可以用 arithmetic-mean geometric-mean ("AMGM") inequality 来得到

\[\prod_{i}\left(p_{i}e^t + (1-p_{i})\right) \leq \left( pe^t + (1-p) \right)^n,\]

其中 \(p = \frac{1}{n}\sum_{i}p_{i}\). 然后再求出最好的 \(t\) 值，便可以得到一个比前面的 theorem 更好的 bound——因为当 \(p_{i}\) 全都相等或者相差不大时 AMGM inequality 是 tight 的。

另外还有一些有用的 Chernoff-like bounds，例如 Hoeffding's inequality 允许将 Chernoff bound 应用到任意 RVs 而不只是 \(0 / 1\)-valued:

Hoeffding's Inequality

Suppose that \(X_{1}, \dots, X_{n}\) are independent random variables with \(X_{i} \in [a_{i}, b_{i}]\) almost surely for all \(i\). Then for any \(t > 0\),

\[\Pr\left[\left\lvert \sum_{i}(X_{i} - \mathbf{E}[X_{i}]) \right\rvert \geq t \right] \leq \exp\left( \frac{-2t^{2}}{\sum_{i}(a_{i}-b_{i})^{2}} \right).\]

Bernstein's inequality 可以在 \(X_{i}\) 的上下界很小时给出更好的 bound:

Bernstein's Inequality

Suppose that \(X_{1}, \dots, X_{n}\) are independent mean-zero random variables with \(|X_{i}| \leq M\) almost surely for all \(i\). Then for any \(t > 0\),

\[\Pr\left[\lvert \sum_{i}X_{i} \rvert \geq t \right] \leq \exp\left( \frac{-t^2 / 2}{\sum_{i}\mathbf{E}[X_{i}^2] + Mt / 3} \right).\]

Randomized Routing on the Hypercube¶

Suppose we want to design a network with \(M\) nodes and a routing protocol in such a way that
1. we have relatively few edges in the network (ie \(O(M)\) or \(O(M \log M)\)),
2. if each node has a message to send to a some other node, the messages can all be routed to their destinations in a timely manner without too much congestion on the edges.

hypercube 版本的问题具体设定如下：

令 \(H\) 为 \(n\)-dimentional hypercube. 它包含 \(2^n\) 个点，每个点用一个 \(n\) 维 \(0 / 1\) 向量表示，两个点相邻当且仅当它们的向量仅有一处不同。例如 \(0101\) 与 \(1101\) 相邻。
每个点 \(i\) 有一个需要 route 到另一个点 \(\pi(i)\) 的 packet (也记作 \(i\) ), 其中 \(\pi : \{0,1\}^{n} \to \{0,1\}^{n}\) 是一个 permutation.
每条边只能同时运输一个 packet, 且时间是离散的，当 packet 无法运输时会以 FIFO 的形式排队。

Bit-Fixing Scheme¶

bit-fixing scheme 指通过每次 fix 一个 bit 的方式将 packet \(i\) 送往 node \(j\), 方向为从左往右。例如，

\[i = 001010 \to 101010 \to 101000 \to 101001 = j.\]

Question

Suppose that every packet is trying to get to \(\vec{0}\) (the all-zero string). (Yes, I know that this isn’t a permutation). Show that if every packet used the bit-fixing scheme (or, any scheme at all) to get to its destination, the total time required is at least \((2^{n} − 1)/n\) steps.

\(\vec{0}\) 的邻边共 \(n\) 条，因此能同时送往 \(\vec{0}\) 的 packet 至多 \(n\) 个，一共 \(2^{n}-1\) 个 packet 需要送往 \(\vec{0}\), 因此总时间至少为 \((2^n - 1) / n\).

Question

Suppose that \(n\) is even. Come up with an example of a permutation \(\pi\) where the bit-fixing scheme requires at least \((2^{n / 2} - 1) / (n / 2)\) steps.

Hint

Consider what happens if \((\vec{a}, \vec{b}) \in \{0,1\}^n\) wants to go to \((\vec{b}, \vec{a}) \in \{0, 1\}^n\), where \(\vec{a}, \vec{b} \in \{0,1\}^{n / 2}\), ans use part 1.

令 \(\pi((\vec{a}, \vec{b}))=(\vec{b}, \vec{a})\), 其中 \(\vec{a}, \vec{b} \in \{0,1\}^{n / 2}\). 则 \((\vec{a}, \vec{b})\) 必然会经过 \((\vec{b}, \vec{b})\), 即后半部分相同的点都会经过同一个点，满足形式为 \((\vec{b}, \vec{b})\) 的点需要运输 \(2^{n / 2} - 1\) 个 packet, 需要用到的边为前面的 \(n / 2\) 条，根据 Part 1, 需要的总时间至少为 \((2^{n / 2} - 1) / (n / 2)\).

A Useful Lemma¶

Lemma 1

Let \(D(i)\) denote the delay in the \(i\)’th packet. That is, this is the number of timesteps it spends waiting.

Let \(P(i)\) denote the path that packet \(i\) takes under the bit-fixing map. (So, \(P(i)\) is a collection of directed edges).

Let \(N(i)\) denote the number of other packets \(j\) so that \(P(j) \cap P(i) \not= \emptyset\). That is, at some point \(j\) wants to traverse an edge that \(i\) also wants to traverse, in the same direction, although possibly at some other point in time.

Then \(D(i) \leq N(i)\).

\(\textit{Proof}.\) 注意到每个满足 \(P(i) \cap P(j) \not= \emptyset\) 的 packet \(j\), 它与 packet \(i\) 的路径的交集一定是 \(P(i)\) 中一段连续的区间（也在 \(P(j)\) 中连续）。我们证明，\(j\) 实际对 \(i\) 造成的延迟只会是 \(1\). 想象 \(i\) 会给给它造成第 \(\ell\) 次延迟的 packet \(j\) 颁发 certificate \(c_{\ell}\), 当 \(j\) 获得 \(c_{\ell}\) 的时刻会进入下一个结点与 \(i\) 分开，若之后 \(j\) 的目标结点与 \(i\) 的下一个目标结点不同，则 \(c_{\ell}\) 永远不会再出现在 \(P(i)\) 上；若 \(j\) 的路径依然与 \(P(i)\) 重合，则 \(j\) 会将 \(c_{\ell}\) 转交给阻塞 \(j\) 的结点，可以断言 \(j\) 不可能同时拥有两个 certificate: 若 \(j\) 未被阻塞，则 \(i\) 与 \(j\) 不会出现在用一个 queue 中，也就不会被颁发下一个 certificate; 若 \(j\) 被阻塞，那么它会将 certificate 颁给此时的 head 结点，同样地，当且仅当 \(j' \not= i\) 为 tail 结点时才可能拥有 certificate, 当且仅当 \(j\) 成为 head 结点时才会被颁发下一个 certificate, 若这个时候（\(j\) 成为 head 结点）\(i\) 也处于同一队列中，由于 \(i\) 与 \(j\) 中间的结点都不是 tail 结点，因此不会颁发 certificate 给 \(j\), 而 queue 中位于 \(i\) 后面的结点因为没有阻塞 \(i\), 因此没有可以颁发的 certificate, 于是只有 \(i\) 会颁发新的 certificate 给 \(j\), 从而 \(j\) 不会同时拥有多个 certificate, 并且 certificate 只可能被颁发给与 \(i\) 的路径存在重合的结点，于是 certificate 的数量 \(= D(i) \leq N(i)\), 得证。

令 \(\delta: \{0,1\}^{n} \to \{0,1\}^{n}\) 为一个完全随机映射，下面分析此时的 bit-fixing scheme.

固定 \(\delta(i)\), 令其它 \(j \not= i\) 的 \(\delta(j)\) 保持随机。令 \(X_{j}\) 为 \(P(i)\) 与 \(P(j)\) 存在交集的 indicator variable.

Question

Assume that we are using the bit-fixing scheme. Argue that \(\mathbf{E}\left[ \sum_{j} X_{j} \right] \leq n / 2\).

Hint

In expectation, how many directed edges are in all of the paths \(P(j)\) taken together (with repetition)? Show that this is at most \(2^n \cdot n / 2\). Then argue that the expected number of paths \(P(j)\) that any single directed edge \(e\) is in is \(1/2\). Finally, bound \(\sum_{j} X_{j} \leq \sum_{e \in P(i)}\) (number of paths \(P(j)\) that \(e\) is in) and use linearity of expectation and the fact that \(\lvert P(i)\rvert \leq n\) to bound \(\mathbf{E}\left[ \sum_{j}X_{j} \right]\).

当至少有一条边 \(e\) 满足 \(e \in P(i)\) 且 \(e \in P(j)\) 时有 \(X_{j}=1\), 于是有 \(\mathbf{E}[X_{j}] \leq \mathbf{E}\left[ \sum_{e \in P(j)} Y_{e} \right] \leq \sum_{e \in P(j)}\mathbf{E}[Y_{e}]=\sum_{e \in P(j)}\Pr[Y_{e}=1]\), 其中 \(Y_{e}\) 为表示 \(e\) 是否属于 \(P(i)\) 的 indicator variable, 由于 \(\delta(i)\) 完全随机，因此 \(\forall e, \Pr[Y_{e}=1] \leq \frac{n}{2^n \cdot n}=\frac{1}{2^n}\), 从而 \(\mathbf{E}[X_{j}] \leq \lvert P(j) \rvert / 2^n\), 于是 \(\mathbf{E}\left[ \sum_{j} X_{j} \right] \leq \sum_{j}\lvert P(j) \rvert / 2^n\).

在 \(\delta\) 函数为完全随机的情况下，有 \(\sum_{j}\lvert P(j) \rvert / \sum_{j}1 = \sum_{j}\lvert P(j) \rvert / 2^n = n / 2\), 于是 \(\mathbf{E}\left[ \sum_{j}X_{j} \right] \leq (2^n \cdot n / 2) / 2^n = n / 2\).

Question

Use a Chernoff bound to bound the probability that \(\sum_{j} X_{j}\) is larger than \(10n\).

\[\begin{aligned} \Pr\left[ \sum_{j}X_{j} \geq 10n \right] &\leq \Pr\left[ \sum_{j}X_{j} \geq 20\mu \right]\\ &\leq \left( \frac{e^{19}}{20^{20}} \right)^{\mu} \leq \left( \frac{e}{20} \right)^{20\mu}\\ &\leq \left( \frac{e}{20} \right)^{10n} \end{aligned}.\]

Question

Use your answer to the previous question to bound the probability that the bit- fixing scheme takes more than \(11n\) timesteps to send every packet \(i\) to \(\delta(i)\), assuming that the destinations \(\delta(i)\) are completely random.

根据前面的 Lemma, 有 \(D(i) \leq N(i) = \sum_{j}X_{j}\), packet \(i\) 运输到 \(\delta(i)\) 需要的时间为 \(T(i) = \lvert P(i) \rvert + D(i) \leq n + \sum_{j}X_{j}\), 于是 \(\Pr[T(i) \geq 11n] \leq \Pr\left[ \sum_{j}X_{j} \geq 10n \right] \leq \left( \frac{e}{20} \right)^{10n}\). 由于 \(\delta(i)\) 是完全随机的，因此对于任意 packet \(i\) 都满足上式。

Question

However, the destinations are not random! They are given by some worst-case permutation \(\pi\). Using what you’ve discovered above for random destinations, develop a randomized routing algorithm that gets every packet where it wants to go, with high probability, in at most \(22n\) steps.

每个点 \(i\) 先去 \(\delta(i)\) 然后再去 \(\pi(i)\), 根据前面的结论，with high probability, 该算法会在至多 \(11n+11n=22n\) 步内停止。