Prefix-Tuning: Optimizing Continuous Prompts for Generation (2021) 논문 리뷰

Data Science/Paper Review

Prefix-Tuning: Optimizing Continuous Prompts for Generation (2021) 논문 리뷰

준나이 2023. 5. 7. 23:06

Abstract

Fine-tuining
- Fine-tuining은 large pre-trained language models(LM)로 downstream tasks를 수행하기 위해 널리 사용되는 방법
- 하지만 LM의 parameters를 모두 update해야해서 각 task마다 LM의 full copy를 필요로 함

Prefix-tuning
- fine-tuning 보다 더 가벼운 대안으로 LM의 parameters는 freezing 시킨 채, prefix라고 불리는 작은 continuous task-specific vector를 최적화하는 방법
- prompting에서 영감을 받았으며, prefix는 virtual tokens 처럼 동작하며 prefix 뒤에 나오는 tokens에 영향을 줌

Experiment
- tasks: table-to-text task (GPT2), summerisation task (BART)
- results:
  - full dataset setting에서는 fine-tuning 결과와 comparable
  - low dataset setting, extrapolation에서는 fine-tuning 결과를 outperform

1.Introduction

Fine-tuning
- LMs을 downstream tasks에서 활용하기 위한 일반적인 방법
- 하지만 각 task마다 별도의 LMs 두어야해서 매우 비쌈
- e.g. GPT2(774M), GPT3(175B)

Adapter-tuning
- pre-trained LMs 내 layers 사이에 task-specific layers를 추가하는 방법
- 약 2-4%의 추가 parameters를 필요로 함

Prompting (= in-context learning)
- GPT3에 사용된 방법으로 task-specific tunning을 사용하는 대신, 자연어로된 instruction과 몇 가지 examples를 input text에 덧붙이는 방법
- 약 2-4%의 추가 parameters를 필요로 함

Prefix-tuning
- natural language generation (NLG) tasks를 위한 fine-tuning을 대체할 수 있는 비교적 가벼운 방법
- 하나의 data table의 description을 생성해내는 task를 고려할 때 (table-to-text task), input은 linearised table이 되고 output은 textual description이 됨 (e.g. input = {name: starbucks, type: coffee shop}, output = starbucks serves coffee)
- 정확히는 prefix라고 불리는 continuous task specific vectors를 input에 덧붙이는 방법
- Transformer는 prefix 이후에 오는 tokens를 위해 prefix를 virtual tokens처럼 첨조 (하지만 prompting과는 다르게 prefix는 실제 tokens로 이루어지지 않음)
- task마다 LM의 모든 parameters를 tuning 시켜 각 copy를 따로 저장해야하는 fine-tuning 과는 다르게, prefix-tuning은 prefix만 optimisation 시키면 돼서 훨씬 효율적임
- prefix-tuning은 task마다 prefix를 별도로 저장할 수 있어서 모듈화 관점에서도 더 나은 모델
- personalisation 문제에 측면에서 각 user를 task라고 생각했을 때, data cross-contamination을 막을 수 있는 등 여러 이점이 존재
- table-to-text generation task과 summerisation task를 downstream tasks로 실험 했고, 각각 GPT2와 BART가 pre-trained LM으로 사용됨
- full dataset setting에서는 fine-tuning고 필적할만한 결과를 보여줬고, low dataset setting과 extrapolation에서는 fine-tuning 보다 더 나은 성능을 보여줌

2. Replated Work

Fine-tuning for natural language generation

NLG tasks와 관련된 기존 연구들 소개
이 논문에서는 prefix-tuning에 table-to-text generation과 summerisation만 실험하지만 machine translation과 dialogue generation 같은 다양한 tasks에도 적용 가능함

Lightwiehgt fine-tuning

pre-trained model의 parameters 대부분을 freezing 시키고, 작은 trainable moudles를 추가하여 model을 update 하는 방법
key challenge: 높은 성능을 낼 수 있는 modules의 architecture 및 pre-trained model 내 parameters를 알아내는 것
removing parameters: pre-trained model 내 몇몇 weights를 masking 시켜 무력화시킴
adding parameters: task-specific layers(adapters)를 pre-trained LM의 layers 사이에 추가 (약 3.0% 추가 parameters 필요하고 이는 prefix-tuning 대비 약 30배에 해당하는 수치)

Prompting

자연어로된 instruction과 예제 몇개를 text input에 덧붙여서 output을 생성시키는 방법
GPT3는 prompt를 manually 설계하여 각기 다른 tasks 결과를 생성시키는데 사용하고, in-context learning이라고도 불림
하지만 in-context learning은 LMs이 처리할 수 있는 sequence의 길이에 제약을 받음
prompt engineering:
- topic이나 sentiment를 control 할 수 있는 keywords를 이용해 prompting하는 방법
- e.g. AutoPrompt: sentiment 내 factual knowledge를 유도할 수 있는 discrete한 형태의 sequence를 찾는 방법
- prefix-tuning은 discrete보다 더 expressive한 continous vectors를 사용

continuous vectors 예시
- input text를 continuous vector represention으로 최적화하여 pre-trained LM이 임의의 문장을 reconstruct 할 수 있는 방식이 존재 (input-specific)
- prefix-tuning은 task-specific한 방법으로 하나의 task 내 instances에 적용가능

Contorallable generation

pre-trained model을 sentence-level atrribute에 충족하도록 유도하는 방법
e.g. training time: pre-trained LM을 keyword나 URL 같은 metadata도 고려할 수 있도록 학습
e.g. decoding time: weighed decoding이나 과거의 activations를 지속적으로 udpate 시킬 수 있음
하지만 NLG 같은 tasks 적용할 수 있는 이렇다 할만한 정교한 방법이 없음

3. Problem Statement

input: a context $x$
output: a sequence of tokens $y$
table-to-text: $x$ = a linearised data table, $y$ = a textual description
summerisation: $x$ = an article, $y$ = a short summary3.1. Autoregressive LM

3.1. Autoregressive LM (GPT)

$P_{\phi}(y|x)$: Transformer-based autoregressive model parameterised by $\phi$
$z=[x;y]$: concatenation of $x$ and $y$
$\mathtt{X_{idx}}$: a sequence of $x$ indicies
$\mathtt{Y_{idx}}$: a sequence of $y$ indicies
$h_i = [h_i^{(1)}; ... ; h_i^{(n)}]$: concatenation of all activation layers at time step $i$
$$ h_i = LM_{\phi}(z_i, h_{<i})$$
autoregressive Transformer model은 $z_i$의 $h_i$와 left context에 해당하는 지난 activations을 계산
$h_i$의 마지막 layer는 다음 token을 예측하기 위해 사용
$P_{\phi}(z_{i+1}|h_{\leq i}) = softmax(W_{\phi}h_i^{(n)})$: pre-trained matrix인 $W_{\phi}$는 $h_i^{(n)}$과 mapping되어서 vocabulary의 logits을 구하게 됨

3.2. Encoder-Decoder Architecture (BART)

$x$: bidirectional encoder에 의해 인코딩 됨
$y$: decoder에 autoregressively predict 됨
이외 notations는 상동

3.3. Method: Fine-tuning

$\phi$: pretrained parameters
$p_{\phi}$: trainable LM distribution으로 다음 log-likelihood를 objective function으로 설정하여 학습
$$\max_{\phi} \log p_{\phi}(y|x) = \sum_{i\in\mathtt{Y_{idx}}} \log p_{\phi}(z_i|h_{<i})$$

4. Prefix-Tuning

conditonal generation tasks를 위해 fine-tuning을 대체할 수 있는 방법

4.1. Intuition

prompting은 모델 내부에 존재하는 parameters를 직접적으로 update 하지 않아도 모델을 원하는 방향으로 유도할 수 있다는 것을 보여줌
e.g. 특정한 toaken($Obama$)를 생성시키고 싶을 때, 같이 등장하는 collocation($Barack$)을 context로 input에 덧붙일 수 있음
prefix-tuning은 단순히 하나의 단어나 문장을 넘어서, NLG tasks를 잘 풀 수 있는 context를 찾는것
context 역할: $x$를 인코딩할 때는 무엇을 extract 할지 가이드를 해주고, $y$를 생성할 때는 next token distribution에 영향을 줌 (하지만 이러한 context가 실제로 존재하는지는 미지수)
prompting처럼 자연어로 instruction을 주는 것은 사람에게는 효과적이지만 대부분은 pre-trained model에서는 큰 효과를 보기 힘듦
data 기반으로 context를 구하는건 도움이 될 수 있지만, discrete한 경우에는 계산상에 어려움을 겪을 수 있음

4.2. Method

prefix-tuning은 $z=[prefix;x;y]$ 혹은 $z=[prefix;x;prefix';y]$ 를 pre-trained LM에 input으로 사용할 수 있도록 prefix를 붙임
$\mathtt{P_{idx}}$: a sequence of $prefix$ indicies
$|\mathtt{P_{idx}}|$: the length of $prefix$
prefix parameters를 저장하기 위해 $\theta$로 표현되는 $P_{\theta} \in \mathbb{R}^{|\mathtt{P_{idx}}| \times dim(h_i)}$를 initialisation
cost function은 3.3. 에서 언급한 log-likelihood 사용하지만, pre-trained LM의 parameters인 $\phi$은 fixed된 상태에서 $\theta$만 학습시킴

$h_i$
- $i \in \mathtt{P_{idx}}$ 일 때 $P_{\theta}[i,:]$: $P_{\theta}$의 값을 그대로 복사해옴
- $i \notin \mathtt{P_{idx}}$ 일 때 $LM_{\phi}(z_i, h_{<i})$: prefix activations이 항상 left context로 존재해서 오른쪽에 존재하는 activations에 영향을 끼치기 때문에, 이 경우에도 $h_i$는 여전에 $P_{\theta}$에 depent함

4.3. Parameterasation of $P_{\theta}$

실험적으로 $P_{\theta}$를 직접적으로 학습시키는 것은 학습을 불안정하게하고 성능을 하락 시킴
$$P_{\theta}[i,:] = MLP_{\theta}(P'_{\theta}[i,:])$$
사이즈가 더 작은 matrix $P'{\theta}$와 large feed-forward network을 이용해서 $P{\theta}$를 reparameterisation 함
$P_{\theta}$와 $P'_{\theta}$의 row 수는 동일하고 column의 dimension만 다름
traning이 완료되면, $P'{\theta}$는 제거하고 $P{\theta}$만 저장해서 사용가능

5. Experimental Setup

5.1. Datasets and Metrics

table-to-text task
- datasets: E2E, WebLNG, DART
- metrics: BLUE, METEOR, TER, ROUGE, BertScore, BLEURT etc

summarisation task
- datasets: XSUM
- metrics: ROUGE family

5.2. Methods

table-to-text task: fine-tuning, fine-tuning only the top 2 layers, and adapter-tuning
summarisation task: fine-tuning BART

5.3. Architectures and Hyperparameters

6. Main Results

6.1. Table-to-text Generation

prefix-tuning을 위해 단 0.1%의 task-specific parameters를 추가해서 fine-tuning에 상응하고 lightweight baselines 보다는 높은 성능 기록
parameter 수를 조정하여 실험한 결과, prefix-tuning이 더 Pareto efficient 함을 알 수 있음
training 시 보지 못한 categories와 domains에 대한 일반화 능력도 더 뛰어남 (extrapolation performance)
more time- and space-efficient, more epressive
scales from GPT to GPT_large

6.2. Summerisation

fine-tuning 보다는 낮은 결과 기록
논문 저자는 XSUM(summerisation task에 사용된 dataset)이 table-to-text task에 사용된 datasets가 다르기 때문이라 설명
XSUM은 4배 이상 많은 examples을 가지고 있고 text 길이 또만 약 17 배 이상 더 길면서, summerisation task 자체가 더욱 복잡한 task이기 때문

6.3. Low-data Setting

매우 적은 수의 parameters 추가만으로 fine-tuning 기록을 모두 상회 (추가되는 parameters 수를 늘리면 격차도 더 벌어짐)
qulititive evaluation에서도 더욱 신뢰할만한 결과를 보여줌

6.3. Extrapolation

두 tasks에서 모두 extrapolation에 대해 더 나은 성능을 보여줌
adatper-tuning도 좋은 performance를 보여주는 것으로 보아, LM parameter를 유지시키는 것이 extrapolation에 긍정적인 영향을 끼치는 것을 보여줌

7. Intrinsic Evaluation

7.1. Prefix Length

prefix의 길이가 길수록 어느 정도 threshold까지는 성능이 계속 증가하다가 threshold 이후 약간의 성능 하락이 존재
inference speed 측면에서는 GPU 떄문에 큰 영향을 받지 않음

7.2. Full vs Embedding-only

이 부분은 자세히 이해가 안돼서 생략

7.3. Prefixing vs Infixing

prefixing $[prefix;x;y]$이 inifixing $[x;infix;y]$ 보다 더 뛰어남
prefixing은 $x$, $y$에 모두 영향을 끼치는데 반해 infixing은 $y$에만 영향을 끼치기 때문인라 설명

7.4. Initialisation

low-data setting에서 initialisation은 더욱 중요하게 작용
ramdom initialisation의 경우 낮은 성능, 높은 분산을 기록
real words를 이용해서 initialisation 할 경우, generalisation 측면에서 더 높은 성능을 보여줌

8. Discussion

8.1. Personalisation

prefix-tuning은 task마다 독립적으로 학습을 수행 할 수 있다는 이점이 있음
이는 수 많은 사용자들에게 개인화된 정보를 제공해야하는 presonalisation 분야에 큰 장점으로 작용 (user privacy, modularity, efficiency)

8.2. Batching Acress users

다수의 user도 하나의 batch의 배치로 구성할 수 있음
sequence 앞에 위치한 prefix 부분만 따로 처리하면 Transformer layers는 동일하기 때문에 효율적인 training이 가능해짐
그에 반해 adapter-tuning은 task-specific paramters가 Transformer layers 사이에 위치하므로 불가능

8.3. Inductive Bias of Prefix-tuning

아직 open question이지만, prefix-tuning과 adapter-tuning은 pre-trained model의 parameters를 update하지 않으므로 generalisation 더 유리

9. Conclustion

fine-tuning을 좀 더 가볍게 대체할 수 있는 continuous prefix를 덧붙이는 방법인 prefix-tuning 제안
추가되는 parameters 수가 1000 배 이상 더 적음에도 불구하고, fine-tuning 모델에 full data setting에서는 상응할만한 결과를 low data setting과 explolation setting 에서는 상회하는 결과를 보여줌

'Data Science > Paper Review' 카테고리의 다른 글

CONTROL PREFIXES for Parameter-Efficient Text Generation (2021) 논문리뷰 (0)	2023.05.13
[P-tuning] GPT Understands, Too (2021) 논문리뷰 (2)	2023.05.13
AdapterHub: A Framework for Adapting Transformers (2020) (0)	2023.05.03
K-ADAPTER: Infusing Knowledge into Pre-Trained Models with Adapters (2020) (1)	2023.05.02
BERT and PALs: Projected attention layers for efficient adaptation in multi-task learning (2019) 논문리뷰 (0)	2023.05.01

현재글Prefix-Tuning: Optimizing Continuous Prompts for Generation (2021) 논문 리뷰

준나이의 블로그

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

준나이의 블로그

Prefix-Tuning: Optimizing Continuous Prompts for Generation (2021) 논문 리뷰

Abstract

1.Introduction

2. Replated Work

Fine-tuning for natural language generation

Lightwiehgt fine-tuning

Prompting

Contorallable generation

3. Problem Statement

3.1. Autoregressive LM (GPT)

3.2. Encoder-Decoder Architecture (BART)

3.3. Method: Fine-tuning

4. Prefix-Tuning

4.1. Intuition

4.2. Method

4.3. Parameterasation of $P_{\theta}$

5. Experimental Setup

5.1. Datasets and Metrics

5.2. Methods

5.3. Architectures and Hyperparameters

6. Main Results

6.1. Table-to-text Generation

6.2. Summerisation

6.3. Low-data Setting

6.3. Extrapolation

7. Intrinsic Evaluation

7.1. Prefix Length

7.2. Full vs Embedding-only

7.3. Prefixing vs Infixing

7.4. Initialisation

8. Discussion

8.1. Personalisation

8.2. Batching Acress users

8.3. Inductive Bias of Prefix-tuning

9. Conclustion

'Data Science > Paper Review' 카테고리의 다른 글

'Data Science/Paper Review'의 다른글

티스토리툴바

Prefix-Tuning: Optimizing Continuous Prompts for Generation (2021) 논문 리뷰

Abstract

1.Introduction

2. Replated Work

Fine-tuning for natural language generation

Lightwiehgt fine-tuning

Prompting

Contorallable generation

3. Problem Statement

3.1. Autoregressive LM (GPT)

3.2. Encoder-Decoder Architecture (BART)

3.3. Method: Fine-tuning

4. Prefix-Tuning

4.1. Intuition

4.2. Method

4.3. Parameterasation of $P_{\theta}$

5. Experimental Setup

5.1. Datasets and Metrics

5.2. Methods

5.3. Architectures and Hyperparameters

6. Main Results

6.1. Table-to-text Generation

6.2. Summerisation

6.3. Low-data Setting

6.3. Extrapolation

7. Intrinsic Evaluation

7.1. Prefix Length

7.2. Full vs Embedding-only

7.3. Prefixing vs Infixing

7.4. Initialisation

8. Discussion

8.1. Personalisation

8.2. Batching Acress users

8.3. Inductive Bias of Prefix-tuning

9. Conclustion

'Data Science > Paper Review' 카테고리의 다른 글

'Data Science/Paper Review'의 다른글

관련글

티스토리툴바