Lec 6

Tldr

Part 1 介绍 poissonization 以及 depoissonization 对 maximum bin load 做估计, Part 2 则介绍 Power-of-Two Choices 来以较低的代价（例如 communication cost）来降低 maximum bin load, 都以 balls in bins 问题为例。

Introduction¶

Example 1. (Balls in Bins)

Tossing \(n\) balls uniformly at random into one of \(m\) different bins, and then examine certain properties of the resulting allocation of balls to bins.

很多问题都可以被建模为 balls in bins 问题，例如前面学过的 coupon collector problem 可以看作：what is the minimum value of \(n\), as a function of \(m\), such that we expect there to be zero empty bins after tossing the \(n\) balls? 类似地，生日悖论也属于这一类问题。

这里我们主要考虑其中的 maximum bin load 问题。

Proposition 2.

Consider tossing \(n\) balls into \(n\) bins. There is a constant \(c\) such that with high probability, the maximum load will be at most \(c \log n / \log \log n\), for sufficient large \(n\).

这个部分只需用基本的概率估计方法和 Stirling approximation 即可得证。

\(\textit{Proof.}\) 只需证明对于任意满足 \(k > c \log / \log \log n\) 都满足 \(\Pr[\text{bin 1 has load exactly }k]=o(1 / n^2)\) 然后 union bound \(n\) 个 bin 和数量不超过 \(n\) 的 \(k\) 使得最后 load \(\geq c \log / \log \log n\) 的概率为 \(o(1)\).

显然我们有

\[\begin{aligned}\Pr[\text{bin 1 has load exactly }k] &\leq {n \choose k} \left( \frac{1}{n} \right)^{k} \left( 1 - \frac{1}{n} \right)^{n-k}\\ &\leq \frac{n^k}{k!} \cdot \frac{1}{n^k} \cdot \left( 1 - \frac{1}{n} \right)^{n-k}\\ &\leq \frac{1}{k!} \end{aligned}\]

代入 \(k! \geq \left( \frac{k}{e} \right)^k\) 并要令上式 \(=o\left( \frac{1}{n^2} \right)\), 不妨直接令其 \(=O\left( \frac{1}{n^3} \right)\), 会得到 \(k(\log k - 1) = 3O( \log n)\), 此时若直接令 \(k=c \log n\) 并舍掉 \(o(\log n)\) 项是可以满足的，但是我们会希望 \(k\) 尽可能小，那么不妨令 \(k = \frac{c \log n}{o(\log n)}=\frac{c \log n}{\log \log n}\)，这样一来 \(k\) 的减小量为 \(f(n)\)（\(n\) 增大的同时 \(k\) 的减小量也会增大）同时添上的 \(o(\log n)\) 项可以被舍去从而不影响式子正确性。

进而我们 union bound 可以得到 \(\Pr[\text{bin 1 has load}> k] \leq n^{-c+1+o(1)}\)，任取 \(c > 3\) 都能使对所有 bin 进行 union bound 后得到的概率式满足 \(\leq o(1)\) 从而得证。

下面借助 Poisson distribution 证明 \(c \log n / \log \log n\) 这个结果是 tight 的。

The Poisson Distribution¶

Prerequisites

For \(X \leftarrow Poi(\lambda)\) and any integer \(k \geq 0\), \(\Pr[X = k] = \frac{e^{-\lambda}\lambda^{k}}{k!}\).
An alternate definition of the Poisson distribution is the limit, as \(n → ∞\), of the binomial distribution corresponding to \(n\) independent tosses of a coin that lands heads with probability \(\lambda / n\).
Both of the above two definitions yield that for \(X \leftarrow Poi(\lambda)\), \(\mathbf{E}[X]=\mathbf{Var}[X]=\lambda\).
For independent random variables \(X\) and \(Y\) with \(X ← Poi(λ_{1})\) and \(Y ← Poi(λ_{2})\), the sum \(X + Y\) is distributed according to \(Poi(λ_{1} + λ_{2})\).

Fact 3.

Poisson distributions satisfy strong tail bounds: Letting \(X ← Poi(λ)\), for any \(c > 0\),

\[\Pr[|X - \lambda| \geq c] \leq 2e^{- \frac{c^2}{2(c+\lambda)}}.\]

上述 fact 允许我们在实际值与初始值较大概率相差不大的情况下，将任意常量当成 poisson variable 进行计算，从而得到一些很好的 properties，例如下述定理：

Theorem 1.

Suppose we draw \(k ← Poi(n),\) and then toss \(k\) balls uniformly at random into \(m\) bins, then the number of balls in bin \(1\), bin \(2\), etc, are all independent, distributed according to \(Poi(n/m)\).

\(\textit{Proof idea.}\) 先证明 \(m=2\) 时对于任意整数 \(i,j\) 都有

\[\Pr[X_{1}=i, X_{2}=j] = \Pr[k=i+j]\Pr[Binomial(i+j, 1 / 2)=i]\]

接着代入可以得到

\[\Pr[k=i+j]\Pr[Binomial(i+j, 1 / 2)=i] = \Pr[Poi(n / 2)=i]\Pr[Poi(n / 2)=j]\]

然后利用 \(\Pr[X_{1}=i]=\sum_{j}\Pr[X_{1}=i,X_{2}=j]=\Pr[Poi(n / 2) = i]\) 以及相应的 \(\Pr[X_{2}=j]=\Pr[Poi(n / 2) = j]\) 便可以得到 \(\forall i,j\), 有

\[\Pr[X_{1}=i, X_{2}=j] = \Pr[X_{1}=i]\cdot \Pr[X_{2}=j]\]

从而证明 bin 1 和 bin 2 满足上述定理，最后数学归纳法将前 \(i\) 个 bin 看作一个整体可以得到定理。

将不同 bin 变成 independent 后我们便可以用 Chernoff bounds 来分析了。

Proposition 4. (Coupon Collector)

Assuming we get a uniformly random one of \(n\) distinct coupons each day, letting \(X\) denote the number of days until we see at least one of each coupon, we have that for any (possibly negative) constant \(c\),

\[\lim_{n \to \infty}\Pr[X \geq n\log n + cn] = 1 - e^{-e^{-c}}.\]

\(\textit{Proof idea.}\) 直接令 \(X \leftarrow Poi(n \log n + cn)\), 于是每种 coupon 出现次数互相独立且服从 \(Poi(c + \log n)\), 进而每种 coupon 的出现次数为 0 的概率为 \(e^{-\lambda}\), 那么能看到所有种类的 coupon 的概率则是 \((1 - e^{-c} / n)^n \to e^{-e^{-c}}\).

然后进行 depoissonize. 考虑多拿走 \(n^{0.9}\) 个和少拿走 \(n^{0.9}\) 个（只要在前 \(n\log n +cn\) 个 coupon 没有出现全部类型的前提下，多拿的这部分能出现一个新类型的概率进行 union bound 的结果为 \(o(1)\) 即可） coupons 的概率大小关系，这部分会出现新的 coupon 的概率不会超过 \(\frac{1}{n} + \frac{1}{n} + \cdots + \frac{1}{n} = \frac{2n^{0.9}}{n} = o(1)\)，即 \(\Pr[X > n \log n + cn + n^{0.9}] = \Pr[X > n \log n + cn - n^{0.9}] + o(1)\)，于是我们可以直接用 \(2\Pr[X > n \log + cn + n^{0.9}]\) 来代替 \(\Pr[|k - (n\log n + cn)| \geq n^{0.9}]\) 从而得到

\[\Pr[|k - (n\log n + cn)| \geq n^{0.9}] \leq 2e^{-n^{1.8} / O(n \log n)} = o(1).\]

最后不难得到待证式。另外，最后这部分的估计式也可以直接由 Chernoff bound 得到。

Proposition 5.

Consider tossing \(n\) balls into \(n\) bins. With high probability, the maximum load will be at least \(c \log n/ \log \log n\).

\(\textit{Proof idea.}\) 这里需要从每个 bin load 服从的 poisson distribution 入手，可以轻松求出 max load 小于 \(b = c \log n / \log \log n\) 的概率，即每个 bin load 都小于 \(b\)。为了便于后面 depoissonization 的估计式的形式为 \(\Pr[k\geq n]\) 而不是 \(\Pr[|k-n|\geq c]\) 的形式，可以在一开始令 \(k \leftarrow Poi\left( \frac{n}{2} \right)\) 而不是 \(k \leftarrow Poi(n)\)，然后利用 \(k\) 增大只会增加 maximum bin load 的单调性直接套用前者的概率估计式来进行 depoissonize. 可以注意下这个 trick 在其他情况的用法。

Note

由于 Poisson 分布是 Binomial 分布在试验次数 \(n \to \infty\)、成功概率 \(p \to 0\)（且 \(\lambda = np\) 固定）时的极限形式，Poisson 分布继承了 Binomial 分布的某些特性，并进一步具备了独立增量性（即在不相交区间内的事件数相互独立）。此外，Poisson 分布具有可加性和稀释性，这使得它在随机分配问题中非常有用。

可加性：若 \(X \sim Poi(\lambda_{1})\) 和 \(Y \sim Poi(\lambda_{2})\) 独立，则 \(X+Y \sim Poi(\lambda_{1}+\lambda_{2})\)。
稀释性：若将 Poisson 过程的事件以概率 \(p\) 分配到不同类别，每个类别的计数仍服从 Poisson 分布（参数为 \(\lambda p\)）。

这些性质允许我们将某些固定量（如总球数）替换为 Poisson 随机变量，从而引入随机性以消除依赖性。这是其他分布（如二项分布、负二项分布等）所不具备的特性，因此 Poisson 分布成为分析随机分配问题的强大工具。

Power of Two Choices¶

Proposition 2 和 5 说明了当 \(n\) 个 ball 都是随机且独立地扔进 \(n\) 个 bin 时，maximum bin load 为 \(\theta\left( \frac{\log{n}}{\log \log n} \right)\) w.h.p. 平均的 bin load 为 \(\frac{n}{n} = 1\) 的情况下，这个结果并不够优秀，如果我们每次都将 ball 扔进一个空的 bin 中可以达到这个值，但是每次都获取一个空的 bin 需要的 communication cost 就很大了，因此我们需要丢掉一部分 randomness 同时与 communication cost 作平衡。事实上，我们每次只需随机选取 2 个 bin 并将 ball 扔进 load 较小的 bin 中就能取到 \(\theta(\log \log n)\) 的结果，这就是 the power of two choices.

Theorem 2.

Suppose we allocate \(n\) balls to \(n\) bins as follows: the balls are allocated one at a time, and for each ball, two bins are selected uniformly at random, with the ball "choosing" the least full out of these two options, breaking ties in any way. With high probability, the maximum bin load will be at most \(\log_{2}\log{n} + O(1)\).

\(\textit{Proof idea.}\) 类似于 DP 的思路，每一步的 bin load 变化与上一步不同 bin load 的 bin 数量有关，于是记 \(B(i,t)\) 为经过第 \(t\) 步后所有 load \(\geq i\) 的 bin 数量。接着考虑边界情况，即 \(B(2,n) \leq \frac{n}{2}\). 然后不难得到转移方程 \(\mathbf{E}[B(i,t)] \leq n\left( \frac{B(i-1,t-1)}{n} \right)^2\) (e1), 不妨令 \(\beta_{2}=\frac{n}{2}, \beta_{i}=\frac{\beta_{i-1}^{2}}{n}\), 于是 \(\mathbf{E}[B(i,t)] \leq \beta_{i}\) (e2). 最后取 \(i^{\ast}=\log \log n + 2\), \(\beta_{i}\) 会减小到 0.

上述证明存在 2 个问题：

e1: 这里假定了任意 2 个 bin 之间是独立的所以才能用 \(\frac{B(i-1,t-1)}{n} \cdot \frac{B(i-1,t-1)}{n}\) 来计算。
e2: 最后的结果是 \(\mathbf{E}\) 的形式，如果 maximum bin load 的分布不够集中那么得到结果就不能说是 with high probability.

这些问题可以用一个 Lemma 来解决：

Lemma 6.

Let \(X_{1}, \dots, X_{n}\) be a set of 0/1 random variables, and let \(Z_{1}, \dots, Z_{n}\) be a set of random variables such that \(X_{i}\) depends on \(Z_{1}, \dots, Z_{i}\). Then if \(\Pr[X_{i}=1 | Z_{1}, \dots, Z_{i-1}] \leq p\) for all \(i\), then for any \(c\),

\[\Pr\left[\sum_{i=1}^{n}X_{i} \geq c\right] \leq \Pr\left[Binomial(n,p) \geq c\right].\]

对于 e1, 直接用 Lemma 6 upper bound 即可；对于 e2, 有 Lemma 6 之后我们直接用 Chernoff-style bound 证明我们的结果是很集中的。

\(\textit{Proof idea of Lemma 6.}\) 按顺序考虑 \(X_{i}\), 条件式含义为当给定 \(Z_{1}, \dots, Z_{i-1}\) 且 \(X_{i}\) 的取值依赖于 \(Z_{1}, \dots, Z_{i-1}\) 时，它能取到 1 的概率只会小于等于它与 \(Z_{1}, \dots, Z_{i-1}\) 独立、且服从 \(Binomial(n,p)\) 时的概率，然后数学归纳法可证。

Beyond Two Choices¶

如果我们增加每一步的 candidates，结果能不能更好？事实上，有一点点用，这只是一个常数级别的优化，即将 \(\log\) 的底数变成 candidates 的数量 \(d\).

Theorem 3.

Suppose we allocate \(n\) balls to \(n\) bins as follows: the balls are allocated one at a time, and for each ball, \(d\) bins are selected uniformly at random, with the ball "choosing" the least full out of these \(d\) options, breaking ties in any way. With high probability, the maximum bin load will be at most \(\log_{d}\log{n} + O(1) = \frac{\log{\log{n}}}{\log{d}} + O(1)\).