BERT and PALs: Projected attention layers for efficient adaptation in multi-task learning (2019) 논문리뷰

Data Science/Paper Review

BERT and PALs: Projected attention layers for efficient adaptation in multi-task learning (2019) 논문리뷰

준나이 2023. 5. 1. 03:55

Abstract

Multi-task learning 시, task 간 information을 공유하는 것은 일반적인 방법이고, 이 때 필요한 parameter 수를 줄이는 것은 중요
기존에는 각 task 마다 모델을 별도로 fine-tuning 해야해서 $I$개의 task가 있으면 $I$개의 모델을 별도로 필요로 함
이 논문에서는 적은 수의 parameter를 이용해서 하나의 모델로 다양한 task를 수행할 수 있는 방식을 소개

1. Introduction

Adaptation을 위한 기존 연구

Pre-trained 모델의 모든 parameters를 share 하는 방식
하지만 input + output shape이 동일해야함

논문에서 제안하는 방식

대부분의 parameters share (generalistion performance 증가)
적은 수의 task-specific paramters 추가 (specific-task performance 증가)
기존 sota에 상응하거나 이를 상회하는 모델을 만들어보자

parameters를 share하지 않을 경우

task 마다 별도의 모델 사용 -> space complexity 증가, over head 증가
중복 연산 -> energy cost 증가
battery life가 한정적인 mobile device의 경우에는 문제가 더 심각해짐

기존 BERT fine-tuning 방식

BERT의 top(output에 근접한 부분)에 별도의 output layer를 추가하여 fine-tuning
이 때, BERT 모델 전체가 fine-tuning 되므로 task 마다 별도의 모델이 필요하게 됨
위의 방식은 시간적, 공간적 측면에서 매우 비효율적임

contribution

1) Projected Attention Layer (PAL): 적은 차원의 Multi-head attention(MHA)를 기존 BERT layer에 parallel 하게 추그
2) sceduling training: dataset 내 data가 task 별로 imbalanced 한 경우 생기는 문제를 해결하기 위해 별도의 sampling 기법 사용
3) self-attention 기반 모델에 적용도리 수 있는 다른 adaptation 기법을 empirical하게 비교

2. Background

Multi-task learning sharing approaches

1) hard parameter sharing: shared hidden layers + task-specific output layers
2) soft parameter sharing: task 마다 별도의 모델 + 모델 내 parameter 간의 distance에 regularisation 적용 (e.g. L2 norm, trace norm)
3) 이 논문에서는 hard parameter sharing 방식을 사용: adapter를 기존 shared layer에 추가하고 별도의 output layer를 사용함

2.1. Adaptation Parameters

Learning hidden unit contributions (LHUC)

각 hidden layer에 learnable scalar를 추가하여, 이를 곱한 값을 output으로 활용
매우 적은 parameters를 추가로 요구함

Residual adapter modules

computer vision 분야에서 사용하기 위해 제안됨
각 모듈은 1 x 1 filter bank와 skip connection을 갖고 있음
추가하는 방식에 따라 $in series$ (각 layer 사이에 추가) $parallel$ (input을 별도로 feeding) 방식으로 나뉨 (뒤에서도 이 두 방식을 모두 언급하는데 $in series$는 성능이 안 좋아서 빠짐)
각 task 마다 $CxC$ amtrix를 별도로 필요로 하는데 low-rank approximation을 통해 필요한 parameter 수를 줄일 수 있고, 이 논문에서도 많이 사용하게 됨 (자세한 설명은 뒤에서)

2.2. Fine-tuning Approaches

transfer learning trend

1) language modelling 하는 방식으로 모델을 pre-train 시킨 후
2) 각 task에 맞게 output layer를 붙여서 fine-tuning 하는 방식

BERT

위의 방식을 차용하는 모델로는 BERT가 대표적
pre-trained 방식
1) maksed language modelling
2) sentence classification

auto-encoding model:

predction 시 left + right direction 모두 참조 (참고: 반대로는 autu-regressive model이 있는데 RNN, GPT가 대표적)
transfomer encoder 기반 모델 (GPT는 decoder 사용)

Houlsby's approach

이 논문의 low-rank layers와 비슷한 방식의 adapter 제안
다만, adapter training 중에는 bert model은 freeze 시킴 (학습 시 연산량을 대폭 줄일 수 있음 - time complexity 감소)
하지만 이 논문에서는 위와 다르게 BERT 모델 전체를 fine-tuning
- 단점1: interference + forgetting of stored memory (interference 뒤에서 언급)
- 단점2: train 시, 하나의 batch에 모든 task example을 필요로 함
- 하지만 요구되는 parameter 수는 위 방식보다 적음 (space complexity 감소)

3. Adapting Self Attention

3.1. Model Architecture and Multi-head Attention

기존 BERT 모델 architecture 소개

input: a sequnce (tokens - hidden vectors)
output: a vector representation of that sequence
first token $[CLS]$: final state of $[CLS]$ 가 classification 혹은 regression task를 위한 vector로 사용

Multi-head attention - $MH(\vec{h})$

$n$ different dot-product attention
attention: a sequnce element with a weighted sum of the hidden states of all the seqence elements
MHA: the weights in the sum use dot-product similarity between transformed hidden states
각각의 head마다 dot-product attention을 적용한 final hidden state 산출
$n$ 개 heads를 모두 concat해서 사용

Self attention - $SA(\vec{h})$

$\vec{h}$와 $MH(\vec{h})$ (MHA 적용한 결과값)를 더한 값을 (정확히는 residual connection) layer normalisation ($LN$)시키고, 그 결과 값을 feed forward network ($FFN$)에 통과시킴
$$SA(\vec{h}) = FFN(LN(\vec{h}) + MH(\vec{h}))$$
$$FFN(\vec{h}) = W_2f(W_1)\vec{h} + b_1) + b_2$$

Bert Layer - $BL(\vec{h})$

최종적으로 다시 한번 $\vec{h}$와 $MH(\vec{h})$를 더한 값을 LN에 통과시켜 output 산출

output:

최종적으로 $[CLS]$의 final state를 사용 -> pooling layer 라고 불림 ($d x d$ linear transformation)
pooling layer에 non-linearity function 적용 후, output space로 projection

3.2. Adding Parameters to the Top

task-spcific layer를 BERT 모델 상단에 추가하는 방식
BERT 자체는 그대로 두고, TOP에 task-specific function$TS()$ 추가
$$\vec{h^f} = TS(BERT(\vec{{h_t}}^l_{t=0}))$$
$\vec{h^f}$: final hidden state for $[CLS]$
BERT layers ($from 0 to l$) 를 모두 통과하여 얻은 $[CLS]$에 $TS()$를 적용하는 방식

$BERT()$

모든 task 수행 시 동일하게 공유하므로, $n$개의 task를 진행하도 1회만 FF 시켜도됨

$TS()$

비교적 가벼움 + 여러 가지중 선택가능 -> experiment section에서 성능 비교 예정
linear transform + non-linearity function: 가장 간단 + 비교적 적은 수의 parameter가 추가로 필요
BERT layers: 각 task 마다 BERT layer 별도로 추가
$V^Dg(V^E\vec{h})$: low-rank approximation + $g()$

low-rank approximation

encoder matrix $V^E$ ($d_s x d_m$)
decoder matrix $V^D$ ($d_m x d_s$)
$d_s < d_m$: 원래는 $d_m x d_m$ 가 필요한데, $d_s$를 작게 설정하여 parameter 수를 줄일 수 있음 ($d_m$: model size)

$g()$

$MHA$: Projected Attention (a residual connection + layer-norm 은 optional)
$FFN$: one or two layer feed-forward network (followed by a residual connection + layer-norm)

3.3. Adding Parameters to the Top

BERT 모델 자체에 task-spcific layer를 추가하여, $BERT()$ function을 바꾸는 방식 (residual adapter modules에서 아이디어를 얻음)
위에서 언급한대로 추가하는 위치에 따라 $in parallel$ 방식과 $serial$로 나뉘는데 $in parallel$만 성능이 좋음 (이유는 뒤에서 설명)
$$\vec{h^{l+1}} = LN(\vec{h^l} + SA(\vec{h^l}) + TS(\vec{h^l}))$$
$$TS(\vec{h}) = V^Dg(V^E\vec{h})$$

$g()$

1) idendity function (just low-rank layer)
2) MHA
3) PAL: MHA + shared $V^D$, $V^E$ across layers (not tasks)
4) FFN + shared $V^D$, $V^E$

4. Multi-task Training and Experiment Setup

4.1. Sampling Tasks

1) round robin:

가장 간단한 방식으로 task 별로 순서를 두어 돌아가면서 공평하게 각 task를 훈련
data가 task 별로 imbalanced 한 경우, over fitting o& under fitting 문제가 생길 수 있음
각 task마다 regularisation hyper parameter를 추가하여 문제를 완화시킬 수 있음

2) propotional

dataset 내 각 task data의 비율을 이용하여 sampling
모든 task data가 균등하게 학습되는 장점이 있지만
비교적 많은 step을 학습한 하나의 task가 다른 tasks의 성능을 떨어뜨리는 interference가 발생할 수 있음

3) square root sampling or annealed sampling

위의 문제를 해결하기 위해 고안된 방법
$\alpha$ = 0.5 인경우 square root sampling
E: the total number of epochs
e: the current epoch
$$p_i \propto N_i^{\alpha}$$
$$\alpha = 1 - 0.8\frac{e-1}{E-1}$$

4.2. Setup

hyper parameters 소개
MHA 시, head의 수는 큰 영향을 끼치지 않음
pre-trained model을 이용하는게 scratch 부터 학습하는 것보다 더 높은 성능을 보여줌
BERT를 freezing 시키면서 PAL과 다른 방식들을 실험하기는 하지만, 주로 BERT 까지 fine-tuning 하는 방식을 주로 실험

4.3. Details of GLUE Tasks

GLUE 내 9가지 task 중에 Winograd NLI를 제외한 8가지 task를 사용

5. Expriments and Discussion

각각의 task 를 학습한 8개의 fine-tuned BERT와 주로 비교 (성능의 최대치라고 가정)
data의 수가 가장 적은 RTE task 에서 가장 높은 성능 향상이 있었는데, 다른 task로 부터 shared된 information을 직간접적으로 이용하기 떄문이라 추측

5.1. PALs and Alternative

low-rank layer와 PALs가 주로 가장 좋은 성능을 보임

5.2. Where should we add Adaptation Modules?

$within > top$
$every layer > final half > first half$ : 하지만 sharing operation 관점에서는 every layer가 최악의 선택임 (PALs는 task-specific layer라서 각 task 마다 매번 다른 $BERT()$를 수행해야 하므로)
$parallel > serial$: $parallel$는 작은 변화 (perturbation) 임에 비해, $serial$는 기존에 모델이 갖고있는 knowledge 자체를 바꾸기 때문이라 추측

6. Further Discussion

1) annealing method

어떻게 training examples를 scheduling 할 것인가?
data size에 대한 sampling probability가 미치는 영향을 점진적으로 감소시킴
평균적인 성능 뿐만 아니라 다른 seed에 따른 평균의 분산도 증가시킴

2) Projected Attention Layers & low-rank transformations

Projected Attention Layers: parameter 수 대비 다른 방식들에 비해 높은 성능을 보여줌
Low-rank transformations: 가장 간단한 방식임에도 불구하고 좋은 성능을 보여줌
parameter 수에 제약이 없는 경우: 기존 BERT 모델의 $top$에 BERT layer 추가하는 방식 추천
shared operation 제약 없는 경우: PALs to $every layer$ 추천
adapting only the final half of the base model을 이용하면 performance 와 sharing operation 에서 모두 이득을 볼 수 있음

'Data Science > Paper Review' 카테고리의 다른 글

CONTROL PREFIXES for Parameter-Efficient Text Generation (2021) 논문리뷰 (0)	2023.05.13
[P-tuning] GPT Understands, Too (2021) 논문리뷰 (2)	2023.05.13
Prefix-Tuning: Optimizing Continuous Prompts for Generation (2021) 논문 리뷰 (1)	2023.05.07
AdapterHub: A Framework for Adapting Transformers (2020) (0)	2023.05.03
K-ADAPTER: Infusing Knowledge into Pre-Trained Models with Adapters (2020) (1)	2023.05.02

현재글BERT and PALs: Projected attention layers for efficient adaptation in multi-task learning (2019) 논문리뷰

준나이의 블로그

Today :
Yesterday :

일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31