We have readings on the trendy LLM. I collected some papers myself here. The list is still updating

Attention/Transformer

The goal is to encode an sequential data: $x_1 \to x_2 \to \ldots \to x_t$.

As usual, since $x_i$ is discrete, $x_i \in \mathcal{V}$. Each $v \in \mathcal{V}$ is represented as a trainable vector. In Transformer, this vector is partitioned into 3 disjoint parts:

$\textbf{q} \in \mathbb{R}^{d_k}$
$\textbf{k} \in \mathbb{R}^{d_k}$
$\textbf{v} \in \mathbb{R}^{d_v}$

This way, a sentence is encoded by 3 matrices $\textbf{Q} \in \mathbb{R}^{t \times d_k}, \textbf{K} \in \mathbb{R}^{t \times d_k}, \textbf{V} \in \mathbb{R}^{t \times d_v}$.

Next idea: the presentation of $x_i$ is a convex combination of other $\textbf{x}_j$ where $j=1, \ldots , i-1$. The presentation is realized by $\textbf{v}_i$, so $\text{atten}_i(\textbf{v}_i) = \sum_{\ell =1}^{i} a_\ell \textbf{v}_\ell = \textbf{a}^{\sf T} \textbf{V}, \quad \textbf{a} \in \mathbb{R}^{t}$

Now the coefficient $a_i$ must be learned somehow. Attention suggests that $\begin{align*} &\widetilde{\textbf{a}}_i = [\widetilde{a}_1, \ldots , \widetilde{a}_t] = [\textbf{q}_i^{\sf T} \textbf{k}_1, \ldots , \textbf{q}_i^{\sf T} \textbf{k}_\ell, \ldots , \textbf{q}_i^{\sf T} \textbf{k}_t] = \textbf{q}_i^{\sf T} \textbf{K} \\ &\widetilde{\textbf{A}} = [\widetilde{\textbf{a}}_1, \ldots , \widetilde{\textbf{a}}_t] = [\textbf{q}_1^{\sf T} \textbf{K}, \ldots , \textbf{q}_t^{\sf T} \textbf{K}] = \textbf{Q}^{\sf T} \textbf{K} \\ &\textbf{A} = \text{softmax} (\widetilde{\textbf{A}}) \triangleq [\text{softmax}(\widetilde{\textbf{a}}_i), \ldots , \text{softmax}(\widetilde{\textbf{a}}_t)] \in \mathbb{R}^{t \times t} \end{align*}$ Each vector $\textbf{a}_i$ represent distribution of “attention” of word $i$ paying over the whole sentence.

So take everything as matricies, we have $\begin{align*} \text{attention} &= \textbf{A}^{\sf T} \textbf{V}, \quad \textbf{A} \in \mathbb{R}^{d_v \times t} \\ &= \text{softmax}(\textbf{K}^{\sf T} \textbf{Q}) \textbf{V} \in \mathbb{R}^{t \times d_v} \end{align*}$

Multiheads

Then, in order to allow for multiple learned patterns, each word is now presented with $H$ different triples $(\textbf{Q}_h, \textbf{K}_h, \textbf{V}_h)$. $\text{attention} (\textbf{Q}\textbf{W}_h^{(1)}, \textbf{K}\textbf{W}_h^{(2)}, \textbf{V}\textbf{W}_h^{(3)}), \quad h=1, \ldots , H$ Then as usuall, every thing is concatenated and input to a final FC layer.

Motivation:

Order Encoding

Now as the representation is just a convex combination of some set, there is no notion of order. Hence it is necessary that the order info is encoded in the $\textbf{v}$ vector.

So that’s basically it.

Flamingo: a visual language model for few-shot learning

Task:

Mixing text and image, predict next word token, pretrained LLM, vision input is undergone a pretrain feature extractor, then to a trainable network to produce a fixed length vector for each image/video input.

Dataset is crawl from webpage, image is replaced by special token

The vision module produce a fixed number of tokens. These tokens are treated as word tokens.

Method

Input example:

In more details …

Data collection:

43 million webpages. Sample a random subsequence of 𝐿 = 256 tokens and take up to the first 𝑁 = 5 images included in the sampled sequence
For image text pairs,
- ALIGN [50] dataset contains 1.8 billion images paired with alt-text
- LTIP dataset consists of 312 million image and text pairs
- VTP dataset contains 27 million short videos (approximately 22 seconds on average) paired with sentence descriptions
beam search for decoding
Evaluation
What can it do? It can learn to perform new task pretty quickly using “In-context learning” \ldots like what has been used in GPT3.
Few shot learning: using only 4 examples

LLM knowledge retrieval

Setting: Given a dataset of text pairs (x, y), like x: question, y: answer.

Idea

Model: Receive a sequence $x$, and output a prediction of sequence $\widehat{y}$

LLM contains knowledge somehow, and can be seen to have a parametric memory. Let’s extend that by adding a non-parametric external memory, in this case from Wiki. So given, for example, a question, model uses its internal knowledge, retrieve external resource, combine them and generate an answer.

More concretely, authors proposed a probabilistic model with 2 ways to do inference approximately: RAG-Sequence Model and RAG-Token Model,

Dive in to the model architecture:
- The generator: BART-large, 400M parameters. Input is the concatenation of $x$ and top-k latent documents $z$. This BART-large model is accountable for ‘parametric memory’.
Train both Query encoder and Generator. Training objective is marginal log-likelihood of the target like usual, like in sequence generation.

Thoughts?

Knowledge vs overfitting?
What could be extended?
- Offer evidence like in Bing.
- Instead of using Wiki, get top 5 articles from Google search, input them to the BERT decoder. Or in general, hot-swap memory? Why do they have to replace the whole Wiki instead of substituting relevant articles?