[논문 읽기] GPT2 - Language Models are Unsupervised Multitask Learners

논문읽기 - 생각의 흐름대로 논문을 따라 읽고 제 나름대로 정리하는 글입니다.

정제된 글을 원하셨다면 그러지 못한점 양해부탁드립니다.

본격적으로 LLM에 대해서 논문을 읽고 공부해 나가려고 합니다.

Transformer, BERT, GPT1에 이어서

GPT2 - Language Models are Unsupervised Multitask Learners(OpenAI, 2019)

https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

논문을 읽고 공부해보겠습니다.

Language Model에서의 Zero shot setting이 핵심이 되는 주제입니다.

1. Abstract

Natural language processing tasks, such as question answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on taskspecific datasets.

자연어 처리 작업은 보통 작업에 따른 dataset들로 supervised learning을 통해

각각의 작업에 맞는 모델을 생성합니다.

예를들어,

question answering은 SQUAD

machine translation은 WMT,

reading comprehension은 CoQA

summarization은 CNN/Daily Mail 등의

데이터셋을 이용해서 모델을 훈련하고 각 task에 특화된 모델을 생성합니다.

We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText.

GPT-2는 WebText라는 수백만개의 웹페이지 데이터셋으로 학습하며 supervision의 경계없이작업의 학습을 진행합니다.

(supervised learning을 사용하지 않아도 된다..?)

When conditioned on a document plus questions, the answers generated by the language model reach 55 F1 on the CoQA dataset - matching or exceeding the performance of 3 out of 4 baseline systems without using the 127,000+ training examples.

문서와 질문을 입력값으로 입력해서 127,000개 이상의 CoQA 데이터 셋 훈련 없이 F1 score 55에 도달했습니다.

The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks.

모델의 용량(크기, parameter개수)을 늘리는 것이 *zero-shot task transger의 성공에 필수고,

이는 선형 *log-linear 그래프로 개선되어 증가된다고 합니다.

이는 모델이 커질수록 성능은 점점 더 느리게 좋아지지만, 꾸준히 향상한다는 말입니다.

* zero-shot task transfer: 특정 task에 대해 학습 없이도 모델이 일반 언어 능력만으로 문제를 해결할 수 있는 능력.

*log-linear : x축(용량)을 로그로 볼 때, y축(성능)이 직선처럼 증가.

Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText.

GPT-2는 15억 개의 paremeter의 Transformer로 구성되어 있고, zero-shot으로 LM datasets의 8개의 테스트 중 7개의 SOTA를 달성했습니다. 그리고 아직 WebText 데이터에 과소적합 되어있다고 합니다.

용량을 늘리면 발전 가능성이 더 남아있다라는 의미입니다.

Samples from the model reflect these improvements and contain coherent paragraphs of text. These findings suggest a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.

모델이 생성한 샘플은 성능 향상을 반영하며 문단 단위에서도 일관성 있는 텍스트를 생성합니다.

즉, 좀 더 정확하고 자연스러운 출력을 하고, 문맥 흐름을 잘 표현한 문단 단위의 일관된 글을 생성합니다.

이러한 결과는, 인간처럼 자연스럽게 관찰된 언어 사용 예시만으로 태스크를 수행할 수 있는 언어 처리 시스템을 구축하는 데에 있어 유망한 방향을 제시합니다.

(=> 인간처럼 문서만 보고 언어 처리 시스템인 question answering, machine translation, reading comprehension, summarization 등을

자연스럽게 구현하는 동작을 목표로 한다)

2. Introduction

Machine learning systems now excel (in expectation) at tasks they are trained for by using a combination of large datasets, high-capacity models, and supervised learning. Yet these systems are brittle and sensitive to slight changes in the data distribution and task specification. Current systems are better characterized as narrow experts rather than competent generalists.

현재 머신러닝 시스템은 대규모 데이터셋, 고용량 모델, 그리고 supervised learning을 결합하여 훈련한 작업에서는 그 성능이 뛰어납니다.

하지만 이런 시스템들은 데이터 분포나 태스크 정의가 조금만 달라져도 쉽게 무너진다고 합니다.

즉, 현재 ML시스템은 특정한 과제에만 집중되어 competent generalists(다방면 문제를 잘 해결하는 모델)가 될 수는 없습니다.

We would like to move towards more general systems which can perform many tasks – eventually without the need to manually create and label a training dataset for each one.

GPT2는 각 데이터의 훈련데이터가 없이 많은 작업 수행을 하는 일반적인 시스템이 목표입니다.

Our suspicion is that the prevalence of single task training on single domain datasets is a major contributor to the lack of generalization observed in current systems.

GPT2가 추구하는 일반적인 시스템을 구현할 수 없는 가장 큰 문제는 위에서 말했듯. 하나의 Task에 하나의 도메인 datasets를 이용한 훈련 방식이라고 합니다.

Current ML systems need hundreds to thousands of examples to induce functions which generalize well. This suggests that multitask training many need just as many effective training pairs to realize its promise with current approaches. It will be very difficult to continue to scale the creation of datasets and the design of objectives to the degree that may be required to brute force our way there with current techniques.

ML시스템도 하나의 잘 일반적인 좋은 functions(모델)을 만들려면 수백, 수천 샘플이 필요합니다.

Multitask 방식도 이런 방식으로 진행한다면 엄청난 데이터가 필요하겠죠.

그래서 GPT2는 이런 기술 방식을 벗어나서 새로운 관점으로 접근하기로 합니다.

We demonstrate language models can perform down-stream tasks in a zero-shot setting – without any parameter or architecture modification. We demonstrate this approach shows potential by highlighting the ability of language models to perform a wide range of tasks in a zero-shot setting. We achieve promising, competitive, and state of the art results depending on the task.

그래서 관점을 다르게 봐서 down-stream task를 어떠한 파라미터나 아키텍처 수정 없는 zero-shot setting으로 구현한다고 합니다. zero shot setting을 이용해 광범위한 task범위를 LM으로 수행할 수 있습니다.

이런 방법론을 적용하면서 GPT2는 task에 따라서 SOTA(각 task dataset의 최고 모델)를 달성했다고 하네요.

3. Approach

At the core of our approach is language modeling. Language modeling is usually framed as unsupervised distribution estimation from a set of examples (x1, x2, ..., xn) each composed of variable length sequences of symbols (s1, s2, ..., sn). Since language has a natural sequential ordering, it is common to factorize the joint probabilities oversymbols as the product of conditional probabilities

GPT2의 Approach는 LM(Language Model)을 만드는 것입니다.

즉, 특정 task에 맞게 훈련되는 게 아니라 다음 단어를 예측하는 로직의 모델을 생성합니다.

Language Model은 각 문장, 문단, 단어 등의 Sequence(s1, s2, ...)들로 구성된 답안인 set of examples(x1, x2, ...)의 분포를 따라하도록 비지도 학습으로 훈련합니다. 훈련 방식은 순서에 따른 *전체 시퀀스의 결합 확률(joint probability)에 대한

*조건부 확률 곱(product of confitional probabilities)으로 인수분해 해서 표현됩니다.

*전체 시퀀스의 결합확률(joint probability)

*조건부 확률의 곱(product of conditional probabilities)으로 인수분해

이런 LM에 대한 아키텍쳐 공부는 LM 포스팅을 다루면서 진행했었습니다.

https://aigaeddo.tistory.com/88

언어 모델(Language Model)이란?

언어 모델(LM)에 대해서 공부해보겠습니다. 1. 언어모델(LM,Language Model) 사람의 언어(자연어)를 이해하고 생성할 수 있도록 컴퓨터에 학습시킨 모델입니다.문장의 생성은 확률로서 구현할 수 있습

aigaeddo.tistory.com

Learning to perform a single task can be expressed in a probabilistic framework as estimating a conditional distribution p(output|input).

single task는 p(output | input)의 조건부 확률의 추정의 구조입니다.

Since a general system should be able to perform many different tasks, even for the same input, it should condition not only on the input but also on the task to be performed. That is, it should model p(output|input, task).

범용 시스템은 동일한 입력에 대해서 다양한 task를 실행해야하므로, p(output|input, task) 입력데이터 뿐만 아니라 task도 입력으로 같이 넣어주는 구조여야 합니다.

Language modeling is also able to, in principle, learn the tasks of McCann et al. (2018) without the need for explicit supervision of which symbols are the outputs to be predicted.

언어모델링에서는 어떤게 예측될 출력인지 구분하지 않고도 위의 task를 학습할 수 있습니다.

Our speculation is that a language model with sufficient capacity will begin to learn to infer and perform the tasks demonstrated in natural language sequences in order to better predict them, regardless of their method of procurement. If a language model is able to do this it will be, in effect, performing unsupervised multitask learning.

LM의 충분한 capacity를 가진다면(데이터와 모델로서), 자연어로 된 sequence들의 task를 직접적인 task를 알려주지 않더라도 예측할 수 있을거라 생각하고, 이는 unsupervised multitask learning을 수행하는 것이 된다고 합니다.

즉, LM의 충분한 capacity가 확립되면, 입출력 구조를 벗어나 자연어 text만 입력받아도 task를 구분하고, 해당 task에 대한 처리가 가능하도록 하는 기존 형식에서 벗어난 방식을 구현합니다.

We test whether this is the case by analyzing the performance of language models in a zero-shot setting on a wide variety of tasks.

이를 확인하기 위해서 zero-shot setting에서 다양한 작업에서 LM성능을 분석하고 입증을 진행하고자 합니다.

2.1. Training Dataset

Most prior work trained language models on a single domain of text, such as news articles (Jozefowicz et al., 2016), Wikipedia (Merity et al., 2016), or fiction books (Kiros et al., 2015).

대부분의 이전 LM연구에서는 텍스트의 single domain taxt를 가지고 훈련을 진행했었습니다.

single domain이란 뉴스기사, Wikipedia, 소설 등의 특색이 진하고 역할이 확실한 데이터를 의미하는 듯 합니다.

Instead, we created a new web scrape which emphasizes document quality. To do this we only scraped web pages which have been curated/filtered by humans. Manually filtering a full web scrape would be exceptionally expensive so as a starting point, we scraped all outbound links from Reddit, a social media platform, which received at least 3 karma. This can be thought of as a heuristic indicator for whether other users found the link interesting, educational, or just funny.

GPT-2는 이런 single domain말고 진짜 사람들의 다양한 주제의 맥락이 있는 데이터로 훈련을 하고 싶었습니다.

그래서 Reddit 소셜미디어 플랫폼의 글 중 최소 3karma의 페이지의 Outbound Link(외부링크)을 스크랩했습니다.

3karma의 의미는 레딧에서 투표를 많이 받으면 얻는 자산입니다.

해당 기준이 직접 모든 글을 판단할 수 없으니 karma가 3을 넘으면 글의 품질이 좋겠거니 한 heuristic indicator가 됩니다.

즉 이런 간단한 방식으로 인간에 의해서 선별/필터링 된 데이터로 품질 체크를 합니다...!(어느정도는 heuristic하게.)

* heuristic :

출처: 위키피디아https://ko.wikipedia.org/wiki/%ED%9C%B4%EB%A6%AC%EC%8A%A4%ED%8B%B1_%EC%9D%B4%EB%A1%A0

The resulting dataset, WebText, contains the text subset of these 45 million links. To extract the text from HTML responses we use a combination of the Dragnet (Peters & Lecocq, 2013) and Newspaper1 content extractors. All results presented in this paper use a preliminary version of WebText which does not include links created after Dec 2017 and which after de-duplication and some heuristic based cleaning contains slightly over 8 million documents for a total of 40 GB of text. We removed all Wikipedia documents from WebText since it is a common data source for other datasets and could complicate analysis due to overlapping training data with test evaluation tasks.

이렇게 수집한 4500만개의 링크 중 텍스트(text subset)를 모은 데이터셋이 GPT-2의 데이터셋 WebText입니다.

(크롤링 방식은 HTML에서의 응답에서 텍스트를 추출했고, Dragnet, Nespaper콘텐츠 추출기 조합을 사용)

해당 WebText 데이터셋은 2017년 12월 이후에 생성된 데이터 중에서 중복 제거 및 일부 heuristic기반 정리(3karma방식으로?)를 거친 후의 800만개의 문서, 40GB 텍스트를 포함한다고 합니다.

WebText에서 Wikipedia 문서는 제거했는데, Wikipedia는 른 데이터셋들에서도 자주 사용되는 공통 소스이므로 추후 분석이 복잡해 질수 있다고 생각했다고 합니다.

(다른 NLP데이터셋에 Wikipedia 데이터가 많이 들어가 있으므로, 평가 시에 Data Leakage가 될 수 있어 정확한 평가가 힘들어 지겠다고 생각한 듯 합니다.)

WebText 훈련 데이터셋에는 French - English 통역이 문맥에 따라 자연스럽게 들어가 있음을 보여주는 예시

2.2. Input Representation

A general language model (LM) should be able to compute the probability of (and also generate) any string. Current large scale LMs include pre-processing steps such as lowercasing, tokenization, and out-of-vocabulary tokens which restrict the space of model-able strings.

LM은 어떠한 String의 확률을 계산할 수 있어야합니다.

지금까지의 LM은 이런 문자를 소문자로 바꾸거나, OOV 같은 전처리를 진행합니다.

*Out-of-vocabulary token, OOV: LM모델의 사전에 없던 token(Train set에 없던 Token)

Byte Pair Encoding (BPE) (Sennrich et al., 2015) is a practical middle ground between character and word level language modeling which effectively interpolates between word level inputs for frequent symbol sequences and character level inputs for infrequent symbol sequences.

However, directly applying BPE to the byte sequence results in suboptimal merges due to BPE using a greedy frequency based heuristic for building the token vocabulary. We observed BPE including many versions of common words like dog since they occur in many variations such as dog. dog! dog? . This results in a sub-optimal allocation of limited vocabulary slots and model capacity.

Byte Pair Encoding은 문자와 단어 단위의 중간 지점입니다. (BERT에서도 사용했던 방식)

빈번하게 연속적으로 나오는 sequence는 단어화 하고, 자주 나오지 않는 sequence는 문자화 해서 효과으로 interpolates한다고 하네요.

(보간이 중간 값등을 생성하는 거니, 중제한다는 뜻인가.. can not 등 자주 나오는 단어는 "can not"으로 묶어버림, fjdfljdsl등의 의미없이 적게 나오는 단어는 "f","j"..등 문자단위로 잘라버림)

그러나, BPE는 이런 방식의 문제가 있다고 합니다. dog. dog! dog? 같은 변형되는 토큰들을 전부 생성해서 변형이 많은 Sequence들이 많이 포함될수록 공간을 많이 차지하고 비효율적입니다.

To avoid this, we prevent BPE from merging across character categories for any byte sequence.

이런 문제를 피하기 위해 GPT2는 BPE는 바이트 단위(Byte Pair Encoding은 바이트 단위 처리)라도 문자 범주를 나눠서 다른 범주끼리 묶이는 것을 제한했습니다.

"문자 + 문자를 묶는다 => 가능"

"문자 + 특수기호를 묶는다 => dog! dog? 등의 BPE 불가능"

이로서 압축 효율성을 향상시킨다고 합니다.

This input representation allows us to combine the empirical benefits of word-level LMs with the generality of byte-level approaches. Since our approach can assign a probability to any Unicode string, this allows us to evaluate our LMs on any dataset regardless of pre-processing, tokenization, or vocab size.

이런 입력 표현으로 단어 기반 모델의 성능과 바이트 기반 모델의 범용성을 모두 가져올 수 있게 해줍니다.

이 방법으로 모든 *Unicode 문자열(모든 표현 가능한 문자열)에 확률을 할당이 가능합니다.

또한, 따로 전처리, 토큰화, vocabulary 크기에 상관 없이 모든 데이터 세트에 GPT-2모델을 평가할 수 있습니다.

* Unicode는 전 세계의 모든 문자·기호를 하나의 표준 체계로 통합한 문자 인코딩 체계이므로

2.3. Model

We use a Transformer (Vaswani et al., 2017) based architecture for our LMs. The model largely follows the details of the OpenAI GPT model (Radford et al., 2018) with a few modifications. Layer normalization (Ba et al., 2016) was moved to the input of each sub-block, similar to a pre-activation residual network (He et al., 2016) and an additional layer normalization was added after the final selfattention block. A modified initialization which accounts for the accumulation on the residual path with model depth is used. We scale the weights of residual layers at initialization by a factor of 1/ √ N where N is the number of residual layers. The vocabulary is expanded to 50,257. We also increase the context size from 512 to 1024 tokens and a larger batchsize of 512 is used.

GPT1과 마찬가지로 Transformer 기반 구조입니다. (GPT1의 특징은 Transformer의 Decoder only 구조)

GPT1에서 몇가지 변경점이 있다고 합니다.

1. GPT1에서는 LayerNorm을 출력쪽에 적용했다면, GPT2에서는 입력에 먼저 적용을 했음

2. 또, LayerNorm을 모델의 마지막 self attention 부분에 추가했다고 합니다.

3. 초기 가중치를 모델 깊이에 따라 1/√N (N은 residual layer 수)비율로 스케일링하여 안정화

4. vocabulary 수가 50,257개로 늘어남.

5. context window를 512에서 1024개로 늘리고, batchsize를 512로 늘림.

GPT-2의 총 4개의 모델의 하이퍼파라미터. Transformer decode only의 Layer수와, 차원수를 기재.

3. Experiments

We trained and benchmarked four LMs with approximately log-uniformly spaced sizes. The architectures are summarized in Table 2. The smallest model is equivalent to the original GPT, and the second smallest equivalent to the largest model from BERT (Devlin et al., 2018). Our largest model, which we call GPT-2, has over an order of magnitude more parameters than GPT. The learning rate of each model was manually tuned for the best perplexity on a 5% held-out sample of WebText. All models still underfit WebText and held-out perplexity has as of yet improved given more training time.

위에 Table2에 나온 총 4개의 모델(117M, 345M, 762M 1542M)을 훈련하고 벤치마킹을 진행했습니다.

특히나 가장 작은 모델(117M)이 GPT1과 크기가 동일하고,

두번째로 작은 모델(345M)는 BERT의 가장 큰 모델과 사이즈가 같습니다.

(지금은 오히려 더 줄이려고 하지만,,당시의 치열한 모델 크기 늘리기 경쟁..)

모델마다 learning rate를 수동조절해서 *Perplexity를 모델의 평가 성능 지표로 최소화하도록 훈련되었고,

해당 평가 데이터셋은 WortText의 5%만 사용함.

*Perplexity : 다음 단어를 얼마나 잘 예측하는지의 평가지표.

자연로그에 CrossEntropy를 올려서 평가.

로그평균을 지수평균으로 변환해서 더 직관적인 평가가 가능하도록 함.

값이 작을수록 더 잘 예측.

3.1. Language Modeling

As an initial step towards zero-shot task transfer, we are interested in understanding how WebText LM’s perform at zero-shot domain transfer on the primary task they are trained for – language modeling.

구축한 WebText로 훈련한 LM(WebText LM)이 각각의 zero-shot domain 전이학습을 어떻게 수행되는지 이해를 하는게 흥미로웠다고 합니다.

Since our model operates on a byte level and does not require lossy pre-processing or tokenization, we can evaluate it on any language model benchmark.

GPT-2의 tokenization과 불필요한 전처리가 필요없고, byte level에서의 처리 때문에

어떠한 LM 벤치마크(기준 데이터셋에 대한 정량적 평가)가 가능합니다.

(Tokenization으로 훈련된다면, 평가 자체도 동일하게 토큰화 해야하고, 벤치마크 데이터에 따라 평가가 불가능함.

WebText LM는 토크나이저를 사용하지 않았으므로 어떠한 벤치마크 데이터도 평가가 가능함.

또한, 2.2.에서 설명했듯, byte level 단위 처리로 인해서 어떠한 데이터셋의 언어, 이모지 까지의 모든 문자를 시퀀스로 표현가능함.)

For many of these datasets, WebText LMs would be tested significantly out of-distribution, having to predict aggressively standardized text, tokenization artifacts such as disconnected punctuation and contractions, shuffled sentences, and even the string <UNK> which is extremely rare in WebText - occurring only 26 times in 40 billion bytes.

반대로, WebText LM은 상단한 out-of-distribution에서 테스트됩니다.

훈련한 WebText 데이터와 다른 끊어진 문장 부호나 축약같은 tokenization에 필요한 artifacts와 400억 바이트 중 26번만 나타난 <UNK> 문자열까지 예측해야 합니다.

(훈련되지 않은 데이터를 맞춰야함. 그러니까 더 대단한거 아니냐고,,,? 강조)

Table3. GPT-2의 다양한 벤치마크 데이터셋에 대한 제로샷 결과. Fine-tuning이나 training이 되지 않은 결과이다.

We observe gains of 2.5 to 5 perplexity for GPT-2 with these de-tokenizers.

WebText LMs transfer well across domains and datasets, improving the state of the art on 7 out of the 8 datasets in a zero-shot setting.

WebText LM은 디토크나이저를 진행하며 성능을 크게 올렸습니다(perplexity 2.5->5, 완전한 문장들로 훈련되었기 때문에 디토크나이저로 성능이 크게 오름.)

도메인과 데이터셋에 잘 transfer되어 제로샷으로 데이터셋 8개 벤치마킹 데이터셋에서 7개에서 SOTA를 갈아치웠습니다.

Our model is still significantly worse than prior work on the One Billion Word Benchmark (Chelba et al., 2013). This is likely due to a combination of it being both the largest dataset and having some of the most destructive pre-processing - 1BW’s sentence level shuffling removes all long-range structure

그러나 아쉽게도 하나의 데이터셋에서 (Table 3의 마지막에 있는 One Billion Word, 1BW 벤치마크 데이터셋) 에서는 기존 연구보다 낮은 성능을 보였습니다.

이유는 1BW는 가장 크고 파괴적인(?) 전처리 때문에 문장 순서를 섞어버려서 긴 맥락 정보를 잃어서 그렇다고 하네요.

4. Discussion

Much research has been dedicated to learning (Hill et al., 2016), understanding (Levy & Goldberg, 2014), and critically evaluating (Wieting & Kiela, 2019) the representations of both supervised and unsupervised pre-training methods. Our results suggest that unsupervised task learning is an additional promising area of research to explore. These findings potentially help explain the widespread success of pre-training techniques for down-stream NLP tasks as we show that, in the limit, one of these pre-training techniques begins to learn to perform tasks directly without the need for supervised adaption or modification.

Supervided 와 Unsupervised pre-training 방식의 어떻게 학습되고 이해하고 평가하는 연구가 많이 이루어져 왔는데,

GPT-2의 결과는 Unsupervised task learning에서 추가적으로 유망한 영역을 제시합니다.

(Zero shot은 Unsupervised task learning으로 task자체를 transfer해냄..)

해당 연구에서 충분한 학습을 하면서 down-stream NLP tasks를 위한 pre-training 기술이 효과가 있다는것을 증명했으니까요.

However, on other tasks such as summarization, while it is qualitatively performing the task, its performance is still only rudimentary according to quantitative metrics. While suggestive as a research result, in terms of practical applications, the zero-shot performance of GPT-2 is still far from use-able.

하지만, 요약 같은 다른 Task에서는 정량적 지표로 보면 기초적인 수준입니다.

저자는 해당 모델이 실제 사용하기에는 부족한 성능이라고 말합니다.

There are undoubtedly many practical tasks where the performance of GPT-2 is still no better than random.

random과 다를바 없는 실제 Task들도 많다고 합니다.

Even on common tasks that we evaluated on, such as question answering and translation, language models only begin to outperform trivial baselines when they have sufficient capacity.

질문 응답이나, 번역같은 일반적인 Task에서도 모델 크기(capacity)가 충분하게 커야지 의미 있는 성능이 나옵니다.

Given the prior success of fine-tuning GPT, we plan to investigate fine-tuning on benchmarks such as decaNLP and GLUE, especially since it is unclear whether the additional training data and capacity of GPT-2 is sufficient to overcome the inefficiencies of uni-directional representations demonstrated by BERT (Devlin et al., 2018).

그래서 해당 경험을 바탕으로 decaNLP 및 GLUE (여러 NLP Task를 측정할 수 있는 datasets)같은 벤치마크에서 fine-tuning을 조사할 계획입니다.(좋은 성능을 내겠다.)

아직 추가적인 training data와 capacity를 늘려 BERT가 제시한 단방향 구조적 한계를 극복할 수 있는지의 여부는 아직 불확실하기 때문입니다.

=> 더 많은 연구를 통해서 극복해내겠다..(현재 너무나도 극복했죠..)

5. Conclusion

When a large language model is trained on a sufficiently large and diverse dataset it is able to perform well across many domains and datasets. GPT-2 zero-shots to state of the art performance on 7 out of 8 tested language modeling datasets. The diversity of tasks the model is able to perform in a zero-shot setting suggests that high-capacity models trained to maximize the likelihood of a sufficiently varied text corpus begin to learn how to perform a surprising amount of tasks without the need for explicit supervision

결론.

제로샷 짱짱...

'인공지능 개발하기 > 기술 & 논문 리뷰' 카테고리의 다른 글

언어 모델(Language Model)이란? (0)	2025.05.19
[논문 읽기] TabNet: Attentive Interpretable Tabular Learning (0)	2025.05.10
[논문 리뷰] Alpa: Automationg Inter-and Intra-Operator Parallelism for Distrivuted Deep Learning (3)	2025.01.20
분산 학습 1. Data Parallelism, Model Parallelism (0)	2024.11.21
[논문 리뷰] ViT(An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale) (0)	2024.06.16

이게또오류