Epic history of LLM


RNN. Seq to seq NLP tasks. 

1. Many to one: Sentimental Analysis

2. One to Many: Image caption

3. Many to Many: 

- Synch many to many: # input = # output. E.g. Part of speech tagging, Named Entity Recognition

- Asynch many to many: translation, text summarization, question and answer, chatboat, speech to text, 

Seq2seq model is used for Many to Many

Stage 1: 2014 Encoder decoder network

Encoder and decoder are LSTM. RNN and GRU are other options. 

It is good for small sentences. Not for 30+ words

BLEU score

Stage 2: 2015 Attention Mechanism

Encoder is same

Attention Mechanism: Attention layer at decoder finds out which hidden state is useful at each stage of decoder and generate context vector for that stage. So, Multiple context vectors based on  encoder's (hidden state of LSTM = ctht vector) are available to decoder. 

Training time is more. 

2015 to 2017: May types of Attention Mechanisms were introduced. 

Stage 3: 2017 Transformer

No LSTM

No RNN Cell

Self-attention was introduced

Both encoder and decoder uses attention

Transformer can process all words in parallel 

1. Attention layer = Multi Head Attention

2. Normalization Layer

3. Dense Layer

4. Input embeddings

It needs hardware, time, and data

Stage 4: 2018 Jan Transfer Learning

Challenges

1 Single model cannot perform all tasks like sentimental, translation, summarization

2 lots of labeled data

Universal Language Model Fine-tuning ULMFiT proposed to use Language modelling  as Pre-training. Language modelling is NLP task to predict next word. Advantages

1. Rich feature training

2. unsupervised task

model: AWD LSTM model

data set: wikipedia

finetuning changed output as classifier with many data set 

Scratch 10000 data. Now fine tune 100 data still better result

- No transformer

Now in 2018, we have two technolgoies

1. architecture: transformer

2. training. Pretrain and transfer learning

Stage 5: 2018 Oct LLM

Transfer learning on transformer

1. Google : BERT (encoder only model) 

2. OpenAI: GPT (decoder only model)

LM to LLM

1. data

2 hardware GPU clusters

3 time : days to weeks

4. cost =  h/w + electricity + people + infra

5. energy consumption 

---------------

GPT3 - > chatGPT

1. RLHF : Reinforcement Learning from Human Feedback

2. incorporate safety and ethical guidelines 

3. improvement in contextual point

4. dialogue specific 

5. continuous improvement based on user feedback


Reference https://www.youtube.com/watch?v=8fX3rOjTloc&list=PPSV

0 comments:

Post a Comment