Transformers & Large Language Models - 1 of 9


• Background on NLP and tasks

NLP Tasks

1. Classification

- Sentimental analysis :  Amazon reviews, IMDB critiques, Twitter

- Intent detection

- Language detection

- Topic modeling

2. "Multi"-Classification

- Part of speech tagging

- Named entity recognition (NER): Dataset = annotated Reuters newspaper (CONLL-2003, CONLL+)

- Dependency parsing

- Constituency parsing

3. Generation

- Machine translation: Dataset = WMT'14

- Question answering

- Summarization

- Text generation

History of LLM

1980 RNN

1997 LSTM (Theoretical Foundation) 

2013 Word2Vec

2020s LLM

• Tokenization

1. Arbitrary (n/a)

2. Word (multiple tokens with similar meanings need same embedding, so Word variations not handled)  

3. sub-word : focus on common root. Increase sequence length. Tokenization more complex

4. character level: can correct mis-spelled word & CasINg. Sequence length is much longer. No OOV

• Embeddings

Word (Token) Representation by vector

OHE = One Hot Encoding

cosine similarity 

• Word2vec, RNN, LSTM

1. Word2Vec

It is ANN with proxy-task

1. CBOW: Continuous Bag of Words. You predict the target word 

2. Skip-gram : You take the target word and predict words around it

Word order does not matter

Embeddings is not context aware

Dimension size example 768

Special token to indicate "end of sequence" 

2. RNN Recurrent Neural Network

Connection forms a temporal sequence

H = Hidden state = A = Activation Vector = Context Vector. 

RNN is used for all 3 NLP tasks

1. Classification

2. "Multi"-Classification

3. Generation

RNN is keep forgetting the past. This phenomena is called "vanishing gradient"

Word order matters in RNN

3. LSTM = Long short-term memory

1. hidden state

2. cell state

• Attention mechanism

Attention tries to have a direct link between next word that we are predicting and something from the past. 

"self-attention" is main principle of "Attention is all you need" 2017 paper

"self-attention" = Instead of sequential, let direct connection with all part of text at once. 

Concept of Query, Key and Value

We compare Q to K. How they are similar and then take corresponding value

Softmax converts unnormalized network output into probability for different class such that value is [0,1] and sum is 1. 

Formula – Given a query Q, we want to know which key K the query should pay "attention" to with respect to the associated value V. 

attention = softmax ( Q * K ^ T / Sqrt (dimension of K) ) * V

There are three attention layers

1. Attention layer at encoder to compute embeddings from input

2. Decoder-decoder attention OR self-attention layer in decoder, It is is masked, because it only look at those token that are translated. It determines: what other token of output sentence is useful to predict next token. 

3. cross-attention layer : expressed as function of what is seen in input. Last part of encoder. it is fetch to decoder. 

We have direct link to all token. So order words does not matter. (unlike RNN).  So we have Position Encoding: to inform position of word in sequence. 


BOS Token: Beginning of Sequence. 

EOS Token: End of Sequence


• Transformer architecture

Self-attention is achieved by transformer = encoder and decoder

1. Encoder computes meaningful embedding from input text. We have N such encoders. Input layer generates position aware embedding matrix with size d = model size and length = length of input sequence = n

Encoder projects input sequence on 3 spaces Wk, Wq and Wv. so model learns. 

attention = softmax ( Q * K ^ T / Sqrt (dimension of K) ) * V

Projecting on Wq gives a matrix where each row represents a given query Q. So we get matrix Wo that is project back to original dimension of embedding. 

K^T is each column represents key of each token. 

When we multiple K^T and Wq, Each row represents projection of query over each key and then get probability distribution. 

Now multiple with matrix V 

This is self-attention mechanism. means compute representation of each token as function of other tokens. it is done by attention layer. 

Multi-Head Attention (MHA) means this computation is done in different way. So model can learn 

- different representation

- different projections

so all token of input text attend each other. 

It is masked self-attention layer. 

A Multi-Head Attention (MHA) layer performs attention computations across multiple heads, then projects the result in the output space.

Variations of MHA
* Grouped-Query Attention (GQA) and 
* Multi-Query Attention (MQA) 
that reduce computational overhead by sharing keys and values across attention heads.

Head is term given to project matrix that we used to obtain Q, K, V. With more heads, model learns different projection. It is like multiple filters in convolution layer in computer vision. 
h = number of heads
For having h number of heads, the output of attention is h such matrices. Here, because of gradient decent every time we get different result. Each objective function with degree of freedom. We concatenate output of all headers with respect to columns. 

2. FFNN (Feed Forward Neural Network) : so model learn another kind of projection

so we get rich representation of input token

In LLM, hidden layer has higher dimension. So model has enough degree of freedom to learn useful representation. 

3. output is for decoder

It takes Q from output. 

K, V from encoder. 

we have N decoders. 

New Terms

  • Perplexity is an evaluation matrix for machine translation. It quantifies how 'surprised' the model is to see some words together. Lower is better. 
  • OOV = out of vocabulary
  • RNN is keep forgetting the past. This phenomena is called "vanishing gradient"
  • Label Smoothing Purpose

    - prevent overfitting

    - introduce noise

    - let model be little unsure about prediction. 

    It improves accuracy and BLEU score of translation.

References

https://cme295.stanford.edu/

Syllabus : https://cme295.stanford.edu/syllabus/

CheatSheet 

https://cme295.stanford.edu/cheatsheet/ 

https://github.com/afshinea/stanford-cme-295-transformers-large-language-models/tree/main/en 

https://www.youtube.com/watch?v=Ub3GoFaUcds

Text Book Super Study Guides


------------------------------------------------------

Some more relevant stuff: 

Each layer has 

1. Attention and 

2. Fast Forward


Between two layers we have high dimension 'hidden state vector' in activation space. 


LLM encodes concepts as distributed patterns accross layers = Superposition. 

Antropic has series of papers on superposition and monosemanticity

https://www.youtube.com/watch?v=F2jd5WuT-zg

https://www.neuronpedia.org

https://huggingface.co/collections/dlouapre/sparse-auto-encoders-saes-for-mechanistic-interpretability

https://huggingface.co/spaces/dlouapre/eiffel-tower-llama

------------------------------------------------------------