We have readings on the trendy LLM. I collected some papers myself here. The list is still updating
Attention/Transformer
The goal is to encode an sequential data: $x_1 \to x_2 \to \ldots \to x_t$.
As usual, since $x_i$ is discrete, $x_i \in \mathcal{V}$. Each $v \in \mathcal{V}$ is represented as a trainable vector. In Transformer, this vector is partitioned into 3 disjoint parts:
- $\textbf{q} \in \mathbb{R}^{d_k}$
- $\textbf{k} \in \mathbb{R}^{d_k}$
- $\textbf{v} \in \mathbb{R}^{d_v}$
This way, a sentence is encoded by 3 matrices $\textbf{Q} \in \mathbb{R}^{t \times d_k}, \textbf{K} \in \mathbb{R}^{t \times d_k}, \textbf{V} \in \mathbb{R}^{t \times d_v}$.
Next idea: the presentation of $x_i$ is a convex combination of other $\textbf{x}_j$ where $j=1, \ldots , i-1$. The presentation is realized by $\textbf{v}_i$, so \(\text{atten}_i(\textbf{v}_i) = \sum_{\ell =1}^{i} a_\ell \textbf{v}_\ell = \textbf{a}^{\sf T} \textbf{V}, \quad \textbf{a} \in \mathbb{R}^{t}\)
Now the coefficient $a_i$ must be learned somehow. Attention suggests that \(\begin{align*} &\widetilde{\textbf{a}}_i = [\widetilde{a}_1, \ldots , \widetilde{a}_t] = [\textbf{q}_i^{\sf T} \textbf{k}_1, \ldots , \textbf{q}_i^{\sf T} \textbf{k}_\ell, \ldots , \textbf{q}_i^{\sf T} \textbf{k}_t] = \textbf{q}_i^{\sf T} \textbf{K} \\ &\widetilde{\textbf{A}} = [\widetilde{\textbf{a}}_1, \ldots , \widetilde{\textbf{a}}_t] = [\textbf{q}_1^{\sf T} \textbf{K}, \ldots , \textbf{q}_t^{\sf T} \textbf{K}] = \textbf{Q}^{\sf T} \textbf{K} \\ &\textbf{A} = \text{softmax} (\widetilde{\textbf{A}}) \triangleq [\text{softmax}(\widetilde{\textbf{a}}_i), \ldots , \text{softmax}(\widetilde{\textbf{a}}_t)] \in \mathbb{R}^{t \times t} \end{align*}\) Each vector $\textbf{a}_i$ represent distribution of “attention” of word $i$ paying over the whole sentence.
So take everything as matricies, we have \(\begin{align*} \text{attention} &= \textbf{A}^{\sf T} \textbf{V}, \quad \textbf{A} \in \mathbb{R}^{d_v \times t} \\ &= \text{softmax}(\textbf{K}^{\sf T} \textbf{Q}) \textbf{V} \in \mathbb{R}^{t \times d_v} \end{align*}\)
Multiheads
Then, in order to allow for multiple learned patterns, each word is now presented with $H$ different triples $(\textbf{Q}_h, \textbf{K}_h, \textbf{V}_h)$. \(\text{attention} (\textbf{Q}\textbf{W}_h^{(1)}, \textbf{K}\textbf{W}_h^{(2)}, \textbf{V}\textbf{W}_h^{(3)}), \quad h=1, \ldots , H\) Then as usuall, every thing is concatenated and input to a final FC layer.
Motivation:
Order Encoding
Now as the representation is just a convex combination of some set, there is no notion of order. Hence it is necessary that the order info is encoded in the $\textbf{v}$ vector.
So that’s basically it.
Flamingo: a visual language model for few-shot learning
Task:
Mixing text and image, predict next word token, pretrained LLM, vision input is undergone a pretrain feature extractor, then to a trainable network to produce a fixed length vector for each image/video input.
Dataset is crawl from webpage, image is replaced by special token
The vision module produce a fixed number of tokens. These tokens are treated as word tokens.
Method
Input example:
In more details …
Data collection:
- 43 million webpages. Sample a random subsequence of 𝐿 = 256 tokens and take up to the first 𝑁 = 5 images included in the sampled sequence
- For image text pairs,
- ALIGN [50] dataset contains 1.8 billion images paired with alt-text
- LTIP dataset consists of 312 million image and text pairs
- VTP dataset contains 27 million short videos (approximately 22 seconds on average) paired with sentence descriptions
- beam search for decoding
Evaluation
-
What can it do? It can learn to perform new task pretty quickly using “In-context learning” \ldots like what has been used in GPT3.
- Few shot learning: using only 4 examples
LLM knowledge retrieval
Setting: Given a dataset of text pairs (x, y), like x: question, y: answer.
Idea
- Model: Receive a sequence $x$, and output a prediction of sequence $\widehat{y}$
LLM contains knowledge somehow, and can be seen to have a parametric memory. Let’s extend that by adding a non-parametric external memory, in this case from Wiki. So given, for example, a question, model uses its internal knowledge, retrieve external resource, combine them and generate an answer.
More concretely, authors proposed a probabilistic model with 2 ways to do inference approximately: RAG-Sequence Model and RAG-Token Model,
-
Dive in to the model architecture:
- The generator: BART-large, 400M parameters. Input is the concatenation of $x$ and top-k latent documents $z$. This BART-large model is accountable for ‘parametric memory’.
-
Train both Query encoder and Generator. Training objective is marginal log-likelihood of the target like usual, like in sequence generation.
Thoughts?
- Knowledge vs overfitting?
- What could be extended?
- Offer evidence like in Bing.
- Instead of using Wiki, get top 5 articles from Google search, input them to the BERT decoder. Or in general, hot-swap memory? Why do they have to replace the whole Wiki instead of substituting relevant articles?