์ƒˆ์†Œ์‹

IN DEPTH CAKE/ML-WIKI

Inductive Bias, ๊ทธ๋ฆฌ๊ณ  Vision Transformer (ViT)

  • -

 

๋“ค์–ด๊ฐ€๋Š” ๋ง

 

Transformer๋Š”  (CNN๋ณด๋‹ค)  Inductive Bias๊ฐ€ ์•ฝํ•œ ๋„คํŠธ์›Œํฌ๋กœ, general-purpose ๋„คํŠธ์›Œํฌ์˜ ์ƒˆ๋กœ์šด ์ง€ํ‰์„ ์—ฐ ๊ตฌ์กฐ๋กœ ํ‰๊ฐ€๋ฐ›์Šต๋‹ˆ๋‹ค. Inductive Bias๊ฐ€ ์ ๋‹ค๋Š” ๊ฒƒ์€ ์–‘๋‚ ์˜ ๊ฒ€์ธ๋ฐ, ์ด๋ฅผ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” Inductive Bias๊ฐ€ ๋ฌด์—‡์ธ์ง€ ๊ทธ๋ฆฌ๊ณ  Inductive Bias๊ฐ€ ํ•™์Šต์— ๋ผ์น˜๋Š” ์˜ํ–ฅ์„ ์ดํ•ดํ•  ํ•„์š”๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ณธ ๊ธ€์—์„œ๋Š” ์ถ”์ƒํ™”๋œ ํ˜•ํƒœ๋กœ Inductive Bias๋ฅผ ์„ค๋ช…ํ•ด๋ณด๋ ค ํ•ฉ๋‹ˆ๋‹ค.

 

In computer vision, there has recently been a surge of interest in end-to-end Transformers, prompting efforts to replace hand-wired features or inductive biases with general-purpose neural architectures powered by data-driven training.
Chen et al. 2022 ICLR

 

 

 

 

_________________

๐Ÿซฅ ์ผ๋‹จ ํ•œ๋งˆ๋””๋กœ ์„ค๋ช…ํ•ด ๋ณด์ž๋ฉด,

 

 

Inductive Bias๋ž€, ‘๋ชจ๋ธ’์ด ์ž์ฒด์ ์œผ๋กœ (๊ตฌ์กฐ์ ์œผ๋กœ) ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ํŽธ๊ฒฌ์ด๋‹ค.

 

์ฐธ๊ณ ๋กœ, ์—ฌ๊ธฐ์„œ ๋งํ•˜๋Š” “๋ชจ๋ธ์ด ์ž์ฒด์ ์œผ๋กœ”๋ผ๋Š” ๋ง์€ ๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ๊ฐ’์— ๋Œ€ํ•œ ์ด์•ผ๊ธฐ๋ผ๊ธฐ๋ณด๋‹ค๋Š” ๋ชจ๋ธ์˜ ๊ตฌ์กฐ๋ฅผ ๋งํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค๋ฉด RNN์ด๋ผ๊ณ  ํ•œ๋‹ค๋ฉด ‘์ž…๋ ฅ์—์„œ ์ถœ๋ ฅ์œผ๋กœ๊นŒ์ง€ ๋‚˜์˜ค๊ธฐ๊นŒ์ง€ ๋ชจ๋ธ์ด ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์˜ ํŠน์ง•๋“ค์„ ์žฌ๊ท€์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ตฌ์กฐ’, CNN์ด๋ผ๊ณ  ํ•œ๋‹ค๋ฉด ‘์ž…๋ ฅ ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด์„œ ์ง€์—ญ์ (local)์ธ ํ”ฝ์…€ ๊ฐ’๋“ค์„ ๊ฐ€์ง€๊ณ  ๊ทธ๋‹ค์Œ ์ƒ์œ„ ์ •๋ณด๋ฅผ ์‚ฐ์ถœํ•˜๋Š” ๊ตฌ์กฐ’ ๋ง์ด๋‹ค.

 

๋‹ค๋ฅด๊ฒŒ ๋งํ•˜๋ฉด,

 

 

๋ชจ๋ธ์ด ์ž…๋ ฅ๋ฐ์ดํ„ฐ์™€ ์ถœ๋ ฅ ๋ฐ์ดํ„ฐ์˜ ๊ด€๊ณ„์— ๋Œ€ํ•ด
๋‚ด์žฌ์ ์œผ๋กœ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ์ผ์ข…์˜ ๊ด€์ ์ด๋ผ๊ณ  ๋งํ•  ์ˆ˜ ์žˆ๊ฒ ๋‹ค.

 

 

 

๐Ÿค” ์กฐ๊ธˆ ๋” ์ž์„ธํžˆ ์„ค๋ช…ํ•ด ๋ณด์ž,

 

Inductive Bias์— ๋Œ€ํ•ด์„œ ์กฐ๊ธˆ ๋” ์ž์„ธํžˆ ์„ค๋ช…ํ•˜๋ ค๋ฉด, ๊ธฐ๊ณ„ ํ•™์Šต์—์„œ ๋งํ•˜๋Š” ‘ํ•™์Šต (learning)์ด๋ž€ ๋ฌด์—‡์ธ๊ฐ€’๋กœ ๊ฑฐ์Šฌ๋Ÿฌ ์˜ฌ๋ผ๊ฐ€์•ผ ํ•œ๋‹ค. 

 

๊ธฐ๊ณ„ํ•™์Šต์—์„œ ํ•™์Šต์€ ๊ฒฐ๊ตญ ์ตœ์ ์˜ ํ•จ์ˆ˜๋ฅผ ์ฐพ๋Š” ๊ณผ์ •์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๋ฌธ์ œ๋Š” ์„ธ์ƒ์˜ ๋ชจ๋“  ํ•จ์ˆ˜ (hypothesis)๋“ค ์ค‘์—์„œ '์ตœ์ '์„ ์ฐพ๋Š” ๊ฒƒ์ด ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ์— ๊ฐœ๋ฐœ์ž๋“ค์€ '์ด ์ค‘์—์„œ ์ฐพ์•„'๋ผ๊ณ  ์ผ์ข…์˜ ํ•œ์ •๋œ ์ง‘๋‹จ์„ ์ •ํ•ด์ค€๋‹ค. ์ด๋ฅผ Hypothesis Set์ด๋ผ๊ณ  ํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ธฐ๊ณ„ ํ•™์Šต์—์„œ Hypothesis Set์€ ๋ชจ๋ธ์˜ ๊ตฌ์กฐ์— ์˜ํ•ด์„œ ๊ฒฐ์ •๋œ๋‹ค.

๋”๋ณด๊ธฐ

์ฐธ๊ณ ๋กœ, ์ตœ์ ์˜ hypothesis๋ฅผ ์ฐพ๋Š”๋‹ค๋Š” ๊ฒƒ์€ ํ•™์Šต ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ๋งŒ ์ž˜ ๋งž์„ ๋ฟ ์•„๋‹ˆ๋ผ ๋ณด์ง€ ๋ชปํ•œ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ๋„ ์ž˜ ๋งž์ถ˜๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค (์ฆ‰, ์ผ๋ฐ˜ํ™”(generalization)๊ฐ€ ์ž˜ ๋˜๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค)

 

 

 

์˜ˆ๋ฅผ ๋“ค์–ด, ๋ฌธ์ œ๊ฐ€ ์ฃผ์–ด์ง€๊ณ  ์ด๋ฅผ Linear Regression ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ด์„œ ํ’€๊ณ ์ž ํ•œ๋‹ค๊ณ  ์ƒ๊ฐํ•ด ๋ณด์ž. ๊ทธ๋Ÿฌ๋ฉด Hypothesis Set์€ ์ •์˜ํ•œ Linear Regression ๋ชจ๋ธ์ด ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋Š” ์ง‘ํ•ฉ์œผ๋กœ ํ•œ์ •๋  ๊ฒƒ์ด๊ณ , Linear Regression ๋ชจ๋ธ์˜ ์ผ๋ฐ˜ํ™” ์ •ํ™•๋„๋ฅผ ์ตœ๋Œ€๋กœ ํ•  ์ˆ˜ ์žˆ๋Š” ์ตœ์ ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•ด ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•  ๊ฒƒ์ด๋‹ค.

์ด๋•Œ, Linear Reression ๋ชจ๋ธ์„ ์„ ํƒํ•œ๋‹ค๋Š” ๊ฒƒ์€ ์ข…์† ๋ณ€์ˆ˜ y์™€ ํ•œ ๊ฐœ ์ด์ƒ์˜ ๋…๋ฆฝ ๋ณ€์ˆ˜ X ์‚ฌ์ด์— ์„ ํ˜• ๊ด€๊ณ„๊ฐ€ ์žˆ๋‹ค๋Š” ๊ฐ€์ •์ด ๋‚ด์žฌ๋˜๋Š” ๊ฒƒ์ด๋‹ค. ๋‹ค์‹œ ๋งํ•ด์„œ, Linear Regression ๋ชจ๋ธ์€ ์ข…์† ๋ณ€์ˆ˜์™€ ๋…๋ฆฝ ๋ณ€์ˆ˜ ์‚ฌ์ด์˜ ์„ ํ˜• ๊ด€๊ณ„๋ผ๋Š” ์ผ์ข…์˜ "๊ท€๋‚ฉ์ ์ธ ํŽธํ–ฅ"์„ ๊ฐ–๊ฒŒ ๋˜๋Š” ๊ฒƒ์ด๋‹ค.

 

๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ๋“ค๊ณผ Inductive Bias

 

์ด์ œ ์ „ํ˜•์ ์ธ ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ๋“ค๋กœ ๋„˜์–ด์™€์„œ ์กฐ๊ธˆ ๋” ์ด์•ผ๊ธฐํ•ด ๋ณด์ž. ๊ทธ๋Ÿฌ๋ฉด Fully Connected ๋ชจ๋ธ์ด๋‚˜ CNN, RNN ๋“ฑ์ด ๊ฐ–๋Š” Inductive Bias๋Š” ๋ฌด์—‡์ผ๊นŒ? ์•„๋ž˜ Table 1๊ณผ Figure 1์€ (Battaglia et al. 18)์— ์ •๋ฆฌ๋œ ๋‚ด์šฉ์ด๋‹ค. ๋Œ€ํ‘œ์ ์œผ๋กœ CNN์„ ๋“ค์—ฌ๋‹ค๋ณด๋ฉด, CNN์˜ convolution layer๋Š” weight parameter๋ฅผ ์ง€์—ญ์ ์œผ๋กœ ๊ณต์œ ํ•จ์œผ๋กœ์จ "์ธ์ ‘ํ•œ ๊ฐ’๋“ค ๊ฐ„์— ์œ ์˜๋ฏธํ•œ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๊ฐ–๋Š”๋‹ค"๋Š” ํŽธํ–ฅ์„ ๊ฐ–๊ฒŒ ๋œ๋‹ค. 

 

 

Inductive Bias์˜ ์—ญํ• 

Inductive Bias๋Š” ์•„๊นŒ ๋งํ•œ ๊ฒƒ์ฒ˜๋Ÿผ Hypothesis Space๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค. Hypothesis Space๋Š” ๋‹ค๋ฅธ ํ‘œํ˜„์œผ๋กœ ์„ค๋ช…ํ•˜์ž๋ฉด '์ตœ์ ์˜ ๋ชจ๋ธ์„ ์ฐพ๋Š” ๊ณต๊ฐ„'์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ์ „์ฒด ํƒ์ƒ‰ ๊ณต๊ฐ„์˜ ํฌ๊ธฐ๊ฐ€ ํฌ๋‹ค๋ฉด ๋ฐ์ดํ„ฐ์˜ ์ผ๋ฐ˜ํ™” ๊ด€๊ณ„๋ฅผ ๋” ์ž˜ ํ‘œํ˜„ํ•˜๋Š” hypothesis๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•ด ์ƒ๋Œ€์ ์œผ๋กœ ๋งŽ์€ ๋ฐ์ดํ„ฐ๊ฐ€ ํ•„์š”ํ•˜๊ฒŒ ๋˜๊ณ  ํƒ์ƒ‰ ๊ณต๊ฐ„์˜ ํฌ๊ธฐ๊ฐ€ ์ž‘๊ณ  ์ ์ ˆํ•œ ๋ฒ”์œ„๋กœ ์ œํ•œ์‹œ์ผœ ์ค€๋‹ค๋ฉด ์ƒ๋Œ€์ ์œผ๋กœ ์ ์€ ๋ฐ์ดํ„ฐ๋กœ๋„ ์ตœ์ ์˜ hypothesis๋ฅผ ์ฐพ๋Š” ๊ฒƒ์ด ์‰ฌ์›Œ์ง‘๋‹ˆ๋‹ค. ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— ๋” ์ ์ ˆํ•œ Inductive Bias๋ฅผ ์ œ๊ณตํ•ด ์ค„ ์ˆ˜ ์žˆ๋‹ค๋ฉด ๋” ์ ์€ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ๋„ ์ผ๋ฐ˜ํ™”๋œ ๋ชจ๋ธ์„ ์ž˜ ์ฐพ์„ ์ˆ˜ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

ํ•˜์ง€๋งŒ ์—ญ์œผ๋กœ ์ƒ๊ฐํ•ด ๋ณด๋ฉด Inductive Bias๋Š” ๋ง ๊ทธ๋Œ€๋กœ Bias, ์ฆ‰ ํŽธ๊ฒฌ์„ ์ œ๊ณตํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ๋ชจ๋ธ์ด ๊ฐ–๊ณ  ์žˆ๋Š” Inductive Bias๊ฐ€ ๋ฐ์ดํ„ฐ์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ์ถฉ๋ถ„ํžˆ ์ž˜ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค๋ฉด ๋ฌธ์ œ ๋˜์ง€ ์•Š๊ฒ ์ง€๋งŒ, ๊ทธ๋ ‡์ง€ ์•Š๋‹ค๋ฉด ์˜คํžˆ๋ ค ํ‘œํ˜„ํ•ด์•ผ ํ•˜๋Š” ์˜์—ญ์„ ์•„์˜ˆ ํ‘œํ˜„ํ•˜์ง€ ๋ชปํ•˜๊ฒŒ ๋  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— Inductive Bias๋Š” ์–‘๋‚ ์˜ ๊ฒ€์ด ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿฌํ•œ ๊ด€์ ์—์„œ Vision Transformer (ViT)๋Š” Inductive Bias๊ฐ€ ์ƒ๋Œ€์ ์œผ๋กœ ๋‚ฎ์€ ๋ชจ๋ธ๋กœ์„œ ๊ธฐ์กด์˜ CNN ์ด ์ง€์—ญ์  (local) ํŠน์ง•์œผ๋กœ๋ถ€ํ„ฐ ์ „์—ญ์  (global) ํŠน์ง•์„ ์ฐพ์•„๊ฐ”๋˜ ๊ฒƒ๊ณผ ๋‹ฌ๋ฆฌ, ์ฒ˜์Œ๋ถ€ํ„ฐ ์ „์—ญ์ ์ธ ํŠน์ง•์„ ์ฐพ์œผ๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ํ•œ๋งˆ๋””๋กœ, ํŽธ๊ฒฌ ์—†์ด ๋ฌธ์ œ๋ฅผ ํ’€๋ ค๊ณ  ํ•œ๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค. CNN์€ ์ด๋ฏธ์ง€๋ผ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ํ•ด์„ํ•˜๋Š” ๋ฐฉ์‹์„ ์ง€์—ญ์  ํŠน์ง• โžฃ ๊ณ ์ฐจ์›์  ํŠน์ง•์„ ํ•ด์„ํ•˜๋„๋ก ์ œ์‹œํ•˜๊ณ  ์žˆ๋Š” ๋ฐ˜๋ฉด Transformer๋Š” ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ํ•œ ๋ฒˆ์— ํ•ด์„ํ•˜๋ ค๊ณ  ํ•œ๋‹ค๊ณ  ์ดํ•ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— 'ํŽธ๊ฒฌ ์—†๋Š” ๋” ์ผ๋ฐ˜ํ™”๋œ hypothesis'๋ฅผ ์ฐพ์„ ์ˆ˜๋„ ์žˆ์ง€๋งŒ, ๋ฐ˜๋ฉด์— ๊ทธ๋งŒํผ ์ถฉ๋ถ„ํ•œ ๋ฐ์ดํ„ฐ๊ฐ€ ์—†์œผ๋ฉด ๋ถˆ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. Transformer์—์„œ ๋ฐ์ดํ„ฐ๊ฐ€ ์ ์€ ๊ฒฝ์šฐ์— ๋Œ€ํ•œ ์—ฐ๊ตฌ (Pre-Training, Augmentation ๋“ฑ)๊ฐ€ ํ™œ๋ฐœํ•œ ์ด๋ฃจ์–ด์ง„ ์ด์œ ์ž…๋‹ˆ๋‹ค.

 

Transformers lack some of the inductive biases inherent to CNNs, such as translation equivariance and locality, and therefore do not generalize well when trained on insufficient amounts of data.

 

 

 

์ถ”๊ฐ€๋กœ ์กฐ๊ธˆ ๋” ์„ค๋ช…ํ•ด๋ณด์ž๋ฉด

 

Vision Transformer (ViT) ๋…ผ๋ฌธ์€ ์™œ ViT๊ฐ€ CNN์— ๋น„ํ•ด์„œ ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ Inductive Bias๊ฐ€ ๋ถ€์กฑํ•œ์ง€๋ฅผ ์ถ”๊ฐ€๋กœ ์„ค๋ช…ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•˜๋Š” ๋ฐ ์žˆ์–ด์„œ CNN์€ parameter sharing์„ ํ†ตํ•œ hierarchical view๋ฅผ ์ œ๊ณตํ•˜๋Š” ๋ฐ˜๋ฉด, ViT์—์„œ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ณต์œ ๋Š” MLP ๋ ˆ์ด์–ด๊ฐ€ ๊ฑฐ์˜ ์œ ์ผํ•ฉ๋‹ˆ๋‹ค. ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ViT๋Š” ๊ตฌ์กฐ์ ์œผ๋กœ ํŒจ์น˜ ๊ฐ„์˜ ์ƒ๊ด€ ๊ด€๊ณ„ ํ•ด์„ ๋ฐฉ์‹ ์กฐ์ฐจ๋„ ์ž์ฒด์ ์œผ๋กœ ํ•™์Šตํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. (CNN์€ ํ”ฝ์…€ ๊ฐ„ ํ•ด์„ ๋ฐฉ์‹์„ 'hierarchical'์ด๋ผ๊ณ  ๊ตฌ์กฐ์ ์œผ๋กœ ์ œ์‹œ) ์ฆ‰, ์ด๋ฏธ์ง€ ํ•ด์„์— ๋Œ€ํ•ด ํŒจ์น˜ ๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„ ์ž์ฒด๋ฅผ ํ•™์Šตํ•˜๋„๋ก ํ•˜๋Š” ์ž์œ ๋ฅผ ๋” ์ฃผ๊ธฐ ๋•Œ๋ฌธ์— ์ƒ๋Œ€์ ์œผ๋กœ Inductive Bias๊ฐ€ ๋” ๋‚ฎ๊ณ  ์ด๋ฅผ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ ์ˆ˜๊ฐ€ ๋” ๋งŽ๋‹ค๊ณ  ์ด์•ผ๊ธฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

๐Ÿ“š Reference

  • Battaglia et al. "Relational inductive biases, deep learning, and graph networks"
  • Chen et al. "When Vision Transformer Outperform ResNets Without Pre-Training or Stron Data Ayugmentations", ICLR 2022.
  • Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020).

 

 

๋ฐ˜์‘ํ˜•
Contents

ํฌ์ŠคํŒ… ์ฃผ์†Œ๋ฅผ ๋ณต์‚ฌํ–ˆ์Šต๋‹ˆ๋‹ค

์ด ๊ธ€์ด ๋„์›€์ด ๋˜์—ˆ๋‹ค๋ฉด ๊ณต๊ฐ ๋ถ€ํƒ๋“œ๋ฆฝ๋‹ˆ๋‹ค.