Tri Nguyen

Some random papers on LLM

Sep 25, 2023

We have readings on the trendy LLM. I collected some papers myself here. The list is still updating

Attention/Transformer

The goal is to encode an sequential data: $x_1 \to x_2 \to \ldots \to x_t$.

As usual, since $x_i$ is discrete, $x_i \in \mathcal{V}$. Each $v \in \mathcal{V}$ is represented as a trainable vector. In Transformer, this vector is partitioned into 3 disjoint parts:

This way, a sentence is encoded by 3 matrices $\textbf{Q} \in \mathbb{R}^{t \times d_k}, \textbf{K} \in \mathbb{R}^{t \times d_k}, \textbf{V} \in \mathbb{R}^{t \times d_v}$.

Next idea: the presentation of $x_i$ is a convex combination of other $\textbf{x}_j$ where $j=1, \ldots , i-1$. The presentation is realized by $\textbf{v}_i$, so atteni(vi)==1iav=aTV,aRt\text{atten}_i(\textbf{v}_i) = \sum_{\ell =1}^{i} a_\ell \textbf{v}_\ell = \textbf{a}^{\sf T} \textbf{V}, \quad \textbf{a} \in \mathbb{R}^{t}

Now the coefficient $a_i$ must be learned somehow. Attention suggests that \begin{align*} &\widetilde{\textbf{a}}_i = [\widetilde{a}_1, \ldots , \widetilde{a}_t] = [\textbf{q}_i^{\sf T} \textbf{k}_1, \ldots , \textbf{q}_i^{\sf T} \textbf{k}_\ell, \ldots , \textbf{q}_i^{\sf T} \textbf{k}_t] = \textbf{q}_i^{\sf T} \textbf{K} \\ &\widetilde{\textbf{A}} = [\widetilde{\textbf{a}}_1, \ldots , \widetilde{\textbf{a}}_t] = [\textbf{q}_1^{\sf T} \textbf{K}, \ldots , \textbf{q}_t^{\sf T} \textbf{K}] = \textbf{Q}^{\sf T} \textbf{K} \\ &\textbf{A} = \text{softmax} (\widetilde{\textbf{A}}) \triangleq [\text{softmax}(\widetilde{\textbf{a}}_i), \ldots , \text{softmax}(\widetilde{\textbf{a}}_t)] \in \mathbb{R}^{t \times t} \end{align*} Each vector $\textbf{a}_i$ represent distribution of “attention” of word $i$ paying over the whole sentence.

So take everything as matricies, we have \begin{align*} \text{attention} &= \textbf{A}^{\sf T} \textbf{V}, \quad \textbf{A} \in \mathbb{R}^{d_v \times t} \\ &= \text{softmax}(\textbf{K}^{\sf T} \textbf{Q}) \textbf{V} \in \mathbb{R}^{t \times d_v} \end{align*}

Multiheads

Then, in order to allow for multiple learned patterns, each word is now presented with $H$ different triples $(\textbf{Q}_h, \textbf{K}_h, \textbf{V}_h)$. attention(QWh(1),KWh(2),VWh(3)),h=1,,H\text{attention} (\textbf{Q}\textbf{W}_h^{(1)}, \textbf{K}\textbf{W}_h^{(2)}, \textbf{V}\textbf{W}_h^{(3)}), \quad h=1, \ldots , H Then as usuall, every thing is concatenated and input to a final FC layer.

Motivation:

Order Encoding

Now as the representation is just a convex combination of some set, there is no notion of order. Hence it is necessary that the order info is encoded in the $\textbf{v}$ vector.

So that’s basically it.

Flamingo: a visual language model for few-shot learning

Task:

Mixing text and image, predict next word token, pretrained LLM, vision input is undergone a pretrain feature extractor, then to a trainable network to produce a fixed length vector for each image/video input.

Dataset is crawl from webpage, image is replaced by special token

The vision module produce a fixed number of tokens. These tokens are treated as word tokens.

Method

Input example:

In more details …

Data collection:

LLM knowledge retrieval

Setting: Given a dataset of text pairs (x, y), like x: question, y: answer.

Idea

LLM contains knowledge somehow, and can be seen to have a parametric memory. Let’s extend that by adding a non-parametric external memory, in this case from Wiki. So given, for example, a question, model uses its internal knowledge, retrieve external resource, combine them and generate an answer.

More concretely, authors proposed a probabilistic model with 2 ways to do inference approximately: RAG-Sequence Model and RAG-Token Model,

Thoughts?


← Back to all articles