This short lecture discusses the paper by Ma et al. (2024) (Microsoft) about 1-bit LLMs.

What if we had 1-bit LLMs?

Current LLMs have full-precision weights in 2 bytes (FP16, BF16) or 4 bytes (FP32).
What would be the implications of having all weights in a LLM in $\{-1, 0, 1\}$?
- At the core of LLMs are matrix multiplication. Having 1-bit weights would allow to remove the need for multiplications. → Need for new hardware
- Reduced latency, memory, and reduced perplexity when the number of parameters is high → Reduction in energy consumption.
  
  Source: https://arxiv.org/pdf/2402.17764v1
  
  Significant reductions in memory usage and latency.
  
  Energy reduction. On the left is the components of arithmetic operations energy at 7nm process nodes (chips). On the right is the end-to-end energy cost across different model sizes.
Using two 80GB A100 cards, they increased the batch size until the GPU limit was reached. The throughput (the rate at witch the model can process data) was 8.9 times faster for the BitNet model.
Why do we use the term “1.58 bits”?
- This comes from the Shannon entropy formula:
  
  $$ H(X) = - \sum_{x \in \mathcal{X}} p(x) \log_b p(x) $$
  - $X$ is a discrete random variable
  - $\mathcal{X}$ is the set of values $X$ can take
  - $p: \mathcal{X} \to [0,1]$ is the probability distribution of $X$.
  - $b$ is the base.
- In this case, since we want to encode a variable with 3 possible states $(-1,0,1)$ with equal probability, we have:
  
  $$ -3 \times \frac{1}{3} \log_2(\frac{1}{3}) \simeq 1.584962500... $$
  
  This means that we require around 1.58 bits to encode a variable $X$ which can take 3 values.
  
  Shannon entropy quantifies the average amount of "information" produced by a stochastic source of data
They have trained BitNet b1.58 from scratch, replacing all the nn.Linear with BitLinear. However, the activations are quantized in 8-bit using the following approach.
BitNet b1.58 is based on the work of Wang et al. (2023) which encoded weights with only two values (+1 or -1) → allows for feature filtering which improves performance.
- Note that they also provide a different way to quantize weights into $(-1, 0, 1)$.
  
  Source: https://arxiv.org/pdf/2402.17764v1
- They use different quantization functions for activations (before the non-linear function). We refer the student to the original paper.