어휘 확장 - 딥러닝 언어 모델

어휘와 임베딩 크기¶

BERT¶

from transformers import AutoTokenizer, AutoModel

def 임베딩관찰(model_id, texts):
    print(model_id)
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModel.from_pretrained(model_id)
    print(model.embeddings)
    어휘수 = len(tokenizer)
    print(f"어휘수: {어휘수:,}")
    assert 어휘수 == model.embeddings.word_embeddings.num_embeddings
    print('특수 토큰:', tokenizer.special_tokens_map)
    encoded_texts = tokenizer(texts)
    for i, text in enumerate(texts):
        input_ids = encoded_texts['input_ids'][i]
        unknown_count = sum(1 for token_id in input_ids if token_id == tokenizer.unk_token_id)
        print(f"원본 텍스트: {text}")
        print(f"토큰화된 텍스트: {input_ids} (어휘밖단어: {unknown_count}/{len(input_ids) - 2})")
        print('/'.join(tokenizer.convert_ids_to_tokens(input_ids)))

texts = ['I ate an apple in the Apple Store.', '배 타고 배 멀미한다.']
임베딩관찰('google-bert/bert-base-uncased', texts)

google-bert/bert-base-uncased
BertEmbeddings(
  (word_embeddings): Embedding(30522, 768, padding_idx=0)
  (position_embeddings): Embedding(512, 768)
  (token_type_embeddings): Embedding(2, 768)
  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
  (dropout): Dropout(p=0.1, inplace=False)
)
어휘수: 30,522
특수 토큰: {'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}
원본 텍스트: I ate an apple in the Apple Store.
토큰화된 텍스트: [101, 1045, 8823, 2019, 6207, 1999, 1996, 6207, 3573, 1012, 102] (어휘밖단어: 0/9)
[CLS]/i/ate/an/apple/in/the/apple/store/./[SEP]
원본 텍스트: 배 타고 배 멀미한다.
토큰화된 텍스트: [101, 1460, 30007, 1467, 30006, 29991, 30011, 1460, 30007, 1459, 30008, 30022, 29995, 30019, 30005, 30006, 30021, 29993, 30006, 1012, 102] (어휘밖단어: 0/19)
[CLS]/ᄇ/##ᅢ/ᄐ/##ᅡ/##ᄀ/##ᅩ/ᄇ/##ᅢ/ᄆ/##ᅥ/##ᆯ/##ᄆ/##ᅵ/##ᄒ/##ᅡ/##ᆫ/##ᄃ/##ᅡ/./[SEP]

texts = ['I ate an apple in the Apple Store.', '배 타고 배 멀미한다.']
임베딩관찰('google-bert/bert-base-cased', texts)

google-bert/bert-base-cased
BertEmbeddings(
  (word_embeddings): Embedding(28996, 768, padding_idx=0)
  (position_embeddings): Embedding(512, 768)
  (token_type_embeddings): Embedding(2, 768)
  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
  (dropout): Dropout(p=0.1, inplace=False)
)
어휘수: 28,996
특수 토큰: {'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}
원본 텍스트: I ate an apple in the Apple Store.
토큰화된 텍스트: [101, 146, 8756, 1126, 12075, 1107, 1103, 7302, 10422, 119, 102] (어휘밖단어: 0/9)
[CLS]/I/ate/an/apple/in/the/Apple/Store/./[SEP]
원본 텍스트: 배 타고 배 멀미한다.
토큰화된 텍스트: [101, 100, 100, 100, 100, 119, 102] (어휘밖단어: 4/5)
[CLS]/[UNK]/[UNK]/[UNK]/[UNK]/./[SEP]

texts = ['I ate an apple in the Apple Store.', '배 타고 배 멀미한다.']
임베딩관찰('google-bert/bert-base-multilingual-uncased', texts)

google-bert/bert-base-multilingual-uncased
BertEmbeddings(
  (word_embeddings): Embedding(105879, 768, padding_idx=0)
  (position_embeddings): Embedding(512, 768)
  (token_type_embeddings): Embedding(2, 768)
  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
  (dropout): Dropout(p=0.1, inplace=False)
)
어휘수: 105,879
특수 토큰: {'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}
원본 텍스트: I ate an apple in the Apple Store.
토큰화된 텍스트: [101, 151, 12811, 10144, 17006, 10104, 10103, 17006, 12913, 119, 102] (어휘밖단어: 0/9)
[CLS]/i/ate/an/apple/in/the/apple/store/./[SEP]
원본 텍스트: 배 타고 배 멀미한다.
토큰화된 텍스트: [101, 1170, 26179, 1179, 67384, 1170, 26179, 1169, 84098, 22699, 14624, 119, 102] (어휘밖단어: 0/11)
[CLS]/ᄇ/##ᅢ/ᄐ/##ᅡ고/ᄇ/##ᅢ/ᄆ/##ᅥᆯ/##미/##한다/./[SEP]

texts = ['I ate an apple in the Apple Store.', '배 타고 배 멀미한다.']
임베딩관찰('google-bert/bert-base-multilingual-cased', texts)

google-bert/bert-base-multilingual-cased
BertEmbeddings(
  (word_embeddings): Embedding(119547, 768, padding_idx=0)
  (position_embeddings): Embedding(512, 768)
  (token_type_embeddings): Embedding(2, 768)
  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
  (dropout): Dropout(p=0.1, inplace=False)
)
어휘수: 119,547
특수 토큰: {'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}
원본 텍스트: I ate an apple in the Apple Store.
토큰화된 텍스트: [101, 146, 10160, 10112, 10151, 72894, 10284, 10106, 10105, 17216, 21812, 119, 102] (어휘밖단어: 0/11)
[CLS]/I/at/##e/an/app/##le/in/the/Apple/Store/./[SEP]
원본 텍스트: 배 타고 배 멀미한다.
토큰화된 텍스트: [101, 9330, 9845, 11664, 9330, 9268, 22458, 14102, 119, 102] (어휘밖단어: 0/8)
[CLS]/배/타/##고/배/멀/##미/##한다/./[SEP]

GPT 2¶

model_id = 'gpt2'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)
print(f'어휘수: {len(tokenizer):,}, 임베딩 크기: {model.wte.weight.shape}')
assert model.wte.weight.shape[0] == len(tokenizer)
print('특수 토큰:', tokenizer.special_tokens_map)

texts = ['I ate an apple in the Apple Store.', '배 타고 배 멀미한다.']
encoded_texts = tokenizer(texts)
for i, text in enumerate(texts):
    input_ids = encoded_texts['input_ids'][i]
    unknown_count = sum(1 for token_id in input_ids if token_id == tokenizer.unk_token_id)
    print(f"원본 텍스트: {text}")
    print(f"토큰화된 텍스트: {input_ids} (어휘밖단어: {unknown_count}/{len(input_ids) - 2})")
    print('/'.join(tokenizer.convert_ids_to_tokens(input_ids)))

어휘수: 50,257, 임베딩 크기: torch.Size([50257, 768])
특수 토큰: {'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}
원본 텍스트: I ate an apple in the Apple Store.
토큰화된 텍스트: [40, 15063, 281, 17180, 287, 262, 4196, 9363, 13] (어휘밖단어: 0/7)
I/Ġate/Ġan/Ġapple/Ġin/Ġthe/ĠApple/ĠStore/.
원본 텍스트: 배 타고 배 멀미한다.
토큰화된 텍스트: [167, 108, 108, 220, 169, 225, 222, 166, 111, 254, 31619, 108, 108, 31619, 102, 222, 167, 107, 116, 47991, 250, 46695, 97, 13] (어휘밖단어: 0/22)
ë/°/°/Ġ/í/ĥ/Ģ/ê/³/ł/Ġë/°/°/Ġë/©/Ģ/ë/¯/¸/íķ/ľ/ëĭ/¤/.

KoGPT2¶

model_id = 'skt/kogpt2-base-v2'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)
print(f'어휘수: {len(tokenizer):,}, 임베딩 크기: {model.wte.weight.shape}')
assert model.wte.weight.shape[0] == len(tokenizer) - 1 # SKT KoGPT2는 특수 토큰을 임베딩에서 제외
print('특수 토큰:', tokenizer.special_tokens_map)

texts = ['I ate an apple in the Apple Store.', '배 타고 배 멀미한다.']
encoded_texts = tokenizer(texts)
for i, text in enumerate(texts):
    input_ids = encoded_texts['input_ids'][i]
    unknown_count = sum(1 for token_id in input_ids if token_id == tokenizer.unk_token_id)
    print(f"원본 텍스트: {text}")
    print(f"토큰화된 텍스트: {input_ids} (어휘밖단어: {unknown_count}/{len(input_ids) - 2})")
    print('/'.join(tokenizer.convert_ids_to_tokens(input_ids)))

어휘수: 51,201, 임베딩 크기: torch.Size([51200, 768])
특수 토큰: {'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}
원본 텍스트: I ate an apple in the Apple Store.
토큰화된 텍스트: [10054, 739, 16642, 29145, 13612, 30258, 10397, 13799, 11370, 9772, 30258, 10397, 14227, 21663, 389] (어휘밖단어: 0/13)
▁I/▁/ate/▁an/▁a/pp/le/▁in/▁the/▁A/pp/le/▁St/ore/.
원본 텍스트: 배 타고 배 멀미한다.
토큰화된 텍스트: [9208, 12814, 9208, 11781, 17045, 9016] (어휘밖단어: 0/4)
▁배/▁타고/▁배/▁멀/미한/다.

데이터셋¶

from datasets import load_dataset

dataset_id = 'traintogpb/aihub-koen-translation-integrated-small-100k'
한영쌍 = load_dataset(dataset_id)

형태소 분석¶

import sentencepiece as spm

with open('ko.txt', 'w', encoding='utf-8') as 파일:
    for 문장 in 한영쌍['train']['ko']:
        파일.write(문장 + '\n')
# 저장된 파일 확인
with open('ko.txt', 'r', encoding='utf-8') as 파일:
    for 줄, 문장 in zip(range(5), 파일):
        print(f'[{줄}]', 문장.strip())
        
spm.SentencePieceTrainer.train(
    input='ko.txt',
    model_prefix='spm_ko',
    vocab_size=10000,)

[0] 지방세법 개정안에 따른 종합부동산세는 부과징수권자의 경우를 제외하고는 대부분 종전 종 합부동산세와 동일하다.
[1] 실신 당시 5초 정도 의식소실이 있었고 바로 의식은 회복되으며 신경학적 이상증상은 나타나지 않았다고 한다.
[2] 문 대통령의 10월 방일이 성사될 경우 한·일 양국은 일본군 위안부 합의 이행 문제와 독도 문제 등으로 삐걱대는 양국 관계를 개선할 계기를 만들 것이라고 신문은 전망했다.
[3] 인덱스 펀드에 대한 투자에서는 장기적인 관점에서 수익을 얻을 수 있습니다.
[4] 돌봐야 하는 비용도 있지.

sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: ko.txt
  input_format: 
  model_prefix: spm_ko
  model_type: UNIGRAM
  vocab_size: 40000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  seed_sentencepieces_file: 
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇ 
  enable_differential_privacy: 0
  differential_privacy_noise_level: 0
  differential_privacy_clipping_threshold: 0
}
normalizer_spec {
  name: nmt_nfkc
  add_dummy_prefix: 1
  remove_extra_whitespaces: 1
  escape_whitespaces: 1
  normalization_rule_tsv: 
}
denormalizer_spec {}
trainer_interface.cc(353) LOG(INFO) SentenceIterator is not specified. Using MultiFileSentenceIterator.
trainer_interface.cc(185) LOG(INFO) Loading corpus: ko.txt
trainer_interface.cc(409) LOG(INFO) Loaded all 83332 sentences
trainer_interface.cc(425) LOG(INFO) Adding meta_piece: <unk>
trainer_interface.cc(425) LOG(INFO) Adding meta_piece: <s>
trainer_interface.cc(425) LOG(INFO) Adding meta_piece: </s>
trainer_interface.cc(430) LOG(INFO) Normalizing sentences...
trainer_interface.cc(539) LOG(INFO) all chars count=5197963
trainer_interface.cc(550) LOG(INFO) Done: 99.95% characters are covered.
trainer_interface.cc(560) LOG(INFO) Alphabet size=1365
trainer_interface.cc(561) LOG(INFO) Final character coverage=0.9995
trainer_interface.cc(592) LOG(INFO) Done! preprocessed 83332 sentences.
unigram_model_trainer.cc(265) LOG(INFO) Making suffix array...
unigram_model_trainer.cc(269) LOG(INFO) Extracting frequent sub strings... node_num=1817254
unigram_model_trainer.cc(312) LOG(INFO) Initialized 203677 seed sentencepieces
trainer_interface.cc(598) LOG(INFO) Tokenizing input sentences with whitespace: 83332
trainer_interface.cc(609) LOG(INFO) Done! 295960
unigram_model_trainer.cc(602) LOG(INFO) Using 295960 sentences for EM training
unigram_model_trainer.cc(618) LOG(INFO) EM sub_iter=0 size=110685 obj=14.9298 num_tokens=667928 num_tokens/piece=6.03449
unigram_model_trainer.cc(618) LOG(INFO) EM sub_iter=1 size=98304 obj=13.7462 num_tokens=670805 num_tokens/piece=6.82378
unigram_model_trainer.cc(618) LOG(INFO) EM sub_iter=0 size=73714 obj=13.7792 num_tokens=695971 num_tokens/piece=9.4415
unigram_model_trainer.cc(618) LOG(INFO) EM sub_iter=1 size=73616 obj=13.7319 num_tokens=696341 num_tokens/piece=9.4591
unigram_model_trainer.cc(618) LOG(INFO) EM sub_iter=0 size=55208 obj=13.9058 num_tokens=728603 num_tokens/piece=13.1974
unigram_model_trainer.cc(618) LOG(INFO) EM sub_iter=1 size=55202 obj=13.8628 num_tokens=728693 num_tokens/piece=13.2005
unigram_model_trainer.cc(618) LOG(INFO) EM sub_iter=0 size=44000 obj=14.0171 num_tokens=753397 num_tokens/piece=17.1227
unigram_model_trainer.cc(618) LOG(INFO) EM sub_iter=1 size=44000 obj=13.9812 num_tokens=753396 num_tokens/piece=17.1226
trainer_interface.cc(687) LOG(INFO) Saving model: spm_ko.model
trainer_interface.cc(699) LOG(INFO) Saving vocabs: spm_ko.vocab

한국어_형태분석기 = spm.SentencePieceProcessor(model_file='spm_ko.model')

예문 = '한국에서는 한글을 사용합니다.'
한국어_형태분석기.encode_as_pieces(예문)
어휘목록 = 한국어_형태분석기

['▁한국에서', '는', '▁한글', '을', '▁사용', '합니다', '.']

어휘 획득¶

어휘목록 = []
with open("spm_ko.vocab", encoding="utf-8") as f:
    for line in f:
        token, _ = line.strip().split("\t")
        if not token.startswith("<"):  # 특수 토큰 제외
            어휘목록.append(token)

print(어휘목록[:20])  # 상위 토큰 확인

['.', '▁', '의', '을', ',', '에', '를', '이', '는', '가', '은', '▁수', '▁있다', '과', '로', '한', '에서', '도', '으로', '와']

사전 훈련 모형¶

from transformers import GPT2LMHeadModel, PreTrainedTokenizerFast

model_id = 'skt/kogpt2-base-v2'
model = GPT2LMHeadModel.from_pretrained(model_id)
tokenizer = PreTrainedTokenizerFast.from_pretrained(
    model_id, 
    # 특수 토큰 정의
    unk_token='<unk>', # 어휘 밖(OOV; out of vocabulary) 또는 "unknown"
    bos_token='<s>', # 문장의 시작(BOS; beginning of sentence)
    eos_token='</s>', # 문장의 끝(EOS; end of sentence)
    pad_token='<pad>') # 패딩 토큰 (길이 정규화용. 빈문자열)

loading file https://huggingface.co/skt/kogpt2-base-v2/resolve/main/added_tokens.json from cache at None
loading file https://huggingface.co/skt/kogpt2-base-v2/resolve/main/special_tokens_map.json from cache at None
loading file https://huggingface.co/skt/kogpt2-base-v2/resolve/main/tokenizer_config.json from cache at None
loading file https://huggingface.co/skt/kogpt2-base-v2/resolve/main/tokenizer.json from cache at /home/me/.cache/huggingface/transformers/fd8418e6675550cbca8ad6c102d717aa89372eb7a632ad3168300c7fed43491c.db074bfdd88bec54455de5ee2400efdbc64d4acf449a44d5f314e79c1eadc611
loading configuration file https://huggingface.co/skt/kogpt2-base-v2/resolve/main/config.json from cache at /home/me/.cache/huggingface/transformers/13bb826cf24517d7849a701e02452715a67c5e560142be3d4735442b2a545809.6b384eec6effdd44287f67715cd55bd0dff2cf846d843b932b43ba7b632b8b1e
Model config GPT2Config {
  "_num_labels": 1,
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "author": "Heewon Jeon(madjakarta@gmail.com)",
  "bos_token_id": 0,
  "created_date": "2021-04-28",
  "embd_pdrop": 0.1,
  "eos_token_id": 1,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0
  },
  "layer_norm_epsilon": 1e-05,
  "license": "CC-BY-NC-SA 4.0",
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "pad_token_id": 3,
  "resid_pdrop": 0.1,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.11.3",
  "use_cache": true,
  "vocab_size": 51200
}

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'GPT2Tokenizer'. 
The class this function is called from is 'PreTrainedTokenizerFast'.
loading configuration file https://huggingface.co/skt/kogpt2-base-v2/resolve/main/config.json from cache at /home/me/.cache/huggingface/transformers/13bb826cf24517d7849a701e02452715a67c5e560142be3d4735442b2a545809.6b384eec6effdd44287f67715cd55bd0dff2cf846d843b932b43ba7b632b8b1e
Model config GPT2Config {
  "_num_labels": 1,
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "author": "Heewon Jeon(madjakarta@gmail.com)",
  "bos_token_id": 0,
  "created_date": "2021-04-28",
  "embd_pdrop": 0.1,
  "eos_token_id": 1,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0
  },
  "layer_norm_epsilon": 1e-05,
  "license": "CC-BY-NC-SA 4.0",
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "pad_token_id": 3,
  "resid_pdrop": 0.1,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.11.3",
  "use_cache": true,
  "vocab_size": 51200
}

loading weights file https://huggingface.co/skt/kogpt2-base-v2/resolve/main/pytorch_model.bin from cache at /home/me/.cache/huggingface/transformers/495b405e3742953dbcc56685d1560fa02a2d86fc50b891868990a4471b06c934.4ebf112d34c2c8fc657866680005d92d21859c52c0ef5e941fa640129b2f8f88
All model checkpoint weights were used when initializing GPT2LMHeadModel.

All the weights of GPT2LMHeadModel were initialized from the model checkpoint at skt/kogpt2-base-v2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.

어휘 추가¶

print(len(tokenizer))

tokens_to_add = [토큰 for 토큰 in 어휘목록 if 토큰 not in tokenizer.get_vocab()]
tokenizer.add_tokens(tokens_to_add)
print(len(tokenizer))
print("추가된 토큰 수:", len(tokens_to_add))

51200
53603
추가된 토큰 수: 2403

tokens_to_add[:10]

['이다', '했다', '▁있습니다', '▁것이다', '된다', '하였다', '▁AAA', '입니다', '합니다', '▁저희']

text = 'AAA는 BBB 서비스를 제공합니다.'
print("원본 텍스트:", text)
print("토큰화된 입력:", tokenizer.tokenize(text))
encoded_input = tokenizer.encode(text)
print("인코딩된 입력:", encoded_input)
inputs = tokenizer(text, truncation=True, padding="max_length", max_length=10)
print('PAD 토큰:', tokenizer.pad_token_id)
print("토큰화된 입력:", inputs['input_ids'])

원본 텍스트: AAA는 BBB 서비스를 제공합니다.
토큰화된 입력: ['AAA', '▁는', '▁', 'BBB', '▁서비스를', '▁제공', '합니다', '▁.']
인코딩된 입력: [53200, 33297, 739, 53262, 13796, 10312, 51208, 36510]
PAD 토큰: 3
토큰화된 입력: [53200, 33297, 739, 53262, 13796, 10312, 51208, 36510, 3, 3]

미세 조정¶

데이터 준비¶

def tokenize(example):
    tokens = tokenizer(example["ko"], truncation=True, padding="max_length", max_length=512)
    # 목표 출력을 입력과 동일하게 설정
    tokens["labels"] = tokens["input_ids"].copy()
    return tokens

tokenized_dataset = 한영쌍.map(
    tokenize, batched=True, remove_columns=['ko', 'en', 'source'])

tokenized_dataset.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "labels"]
)

학습 설정¶

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./kogpt2-extended",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    logging_steps=100,
    learning_rate=5e-5,
    warmup_steps=500,
    weight_decay=0.01,
    fp16=True,
    save_strategy="steps",
    # evaluation_strategy="epoch",
    evaluation_strategy='no'
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).

학습¶

from transformers import default_data_collator

model.resize_token_embeddings(len(tokenizer))

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['validation'],
    tokenizer=tokenizer,
    data_collator=default_data_collator,
)
trainer.train()

Using amp fp16 backend
/home/me/.conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py:444: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
  self.scaler = torch.cuda.amp.GradScaler()
***** Running training *****
  Num examples = 83332
  Num Epochs = 3
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 8
  Total optimization steps = 15624
/home/me/.conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py:1846: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with autocast():

Saving model checkpoint to ./kogpt2-extended/checkpoint-500
Configuration saved in ./kogpt2-extended/checkpoint-500/config.json
Model weights saved in ./kogpt2-extended/checkpoint-500/pytorch_model.bin
tokenizer config file saved in ./kogpt2-extended/checkpoint-500/tokenizer_config.json
Special tokens file saved in ./kogpt2-extended/checkpoint-500/special_tokens_map.json
/home/me/.conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py:1846: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with autocast():
Saving model checkpoint to ./kogpt2-extended/checkpoint-1000
Configuration saved in ./kogpt2-extended/checkpoint-1000/config.json
Model weights saved in ./kogpt2-extended/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in ./kogpt2-extended/checkpoint-1000/tokenizer_config.json
Special tokens file saved in ./kogpt2-extended/checkpoint-1000/special_tokens_map.json
/home/me/.conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py:1846: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with autocast():
Saving model checkpoint to ./kogpt2-extended/checkpoint-1500
Configuration saved in ./kogpt2-extended/checkpoint-1500/config.json
Model weights saved in ./kogpt2-extended/checkpoint-1500/pytorch_model.bin
tokenizer config file saved in ./kogpt2-extended/checkpoint-1500/tokenizer_config.json
Special tokens file saved in ./kogpt2-extended/checkpoint-1500/special_tokens_map.json
/home/me/.conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py:1846: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with autocast():
Saving model checkpoint to ./kogpt2-extended/checkpoint-2000
Configuration saved in ./kogpt2-extended/checkpoint-2000/config.json
Model weights saved in ./kogpt2-extended/checkpoint-2000/pytorch_model.bin
tokenizer config file saved in ./kogpt2-extended/checkpoint-2000/tokenizer_config.json
Special tokens file saved in ./kogpt2-extended/checkpoint-2000/special_tokens_map.json
/home/me/.conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py:1846: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with autocast():
Saving model checkpoint to ./kogpt2-extended/checkpoint-2500
Configuration saved in ./kogpt2-extended/checkpoint-2500/config.json
Model weights saved in ./kogpt2-extended/checkpoint-2500/pytorch_model.bin
tokenizer config file saved in ./kogpt2-extended/checkpoint-2500/tokenizer_config.json
Special tokens file saved in ./kogpt2-extended/checkpoint-2500/special_tokens_map.json
/home/me/.conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py:1846: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with autocast():
Saving model checkpoint to ./kogpt2-extended/checkpoint-3000
Configuration saved in ./kogpt2-extended/checkpoint-3000/config.json
Model weights saved in ./kogpt2-extended/checkpoint-3000/pytorch_model.bin
tokenizer config file saved in ./kogpt2-extended/checkpoint-3000/tokenizer_config.json
Special tokens file saved in ./kogpt2-extended/checkpoint-3000/special_tokens_map.json
/home/me/.conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py:1846: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with autocast():
Saving model checkpoint to ./kogpt2-extended/checkpoint-3500
Configuration saved in ./kogpt2-extended/checkpoint-3500/config.json
Model weights saved in ./kogpt2-extended/checkpoint-3500/pytorch_model.bin
tokenizer config file saved in ./kogpt2-extended/checkpoint-3500/tokenizer_config.json
Special tokens file saved in ./kogpt2-extended/checkpoint-3500/special_tokens_map.json
/home/me/.conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py:1846: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with autocast():
Saving model checkpoint to ./kogpt2-extended/checkpoint-4000
Configuration saved in ./kogpt2-extended/checkpoint-4000/config.json
Model weights saved in ./kogpt2-extended/checkpoint-4000/pytorch_model.bin
tokenizer config file saved in ./kogpt2-extended/checkpoint-4000/tokenizer_config.json
Special tokens file saved in ./kogpt2-extended/checkpoint-4000/special_tokens_map.json
/home/me/.conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py:1846: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with autocast():
Saving model checkpoint to ./kogpt2-extended/checkpoint-4500
Configuration saved in ./kogpt2-extended/checkpoint-4500/config.json
Model weights saved in ./kogpt2-extended/checkpoint-4500/pytorch_model.bin
tokenizer config file saved in ./kogpt2-extended/checkpoint-4500/tokenizer_config.json
Special tokens file saved in ./kogpt2-extended/checkpoint-4500/special_tokens_map.json
/home/me/.conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py:1846: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with autocast():
Saving model checkpoint to ./kogpt2-extended/checkpoint-5000
Configuration saved in ./kogpt2-extended/checkpoint-5000/config.json
Model weights saved in ./kogpt2-extended/checkpoint-5000/pytorch_model.bin
tokenizer config file saved in ./kogpt2-extended/checkpoint-5000/tokenizer_config.json
Special tokens file saved in ./kogpt2-extended/checkpoint-5000/special_tokens_map.json
/home/me/.conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py:1846: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with autocast():
Saving model checkpoint to ./kogpt2-extended/checkpoint-5500
Configuration saved in ./kogpt2-extended/checkpoint-5500/config.json
Model weights saved in ./kogpt2-extended/checkpoint-5500/pytorch_model.bin
tokenizer config file saved in ./kogpt2-extended/checkpoint-5500/tokenizer_config.json
Special tokens file saved in ./kogpt2-extended/checkpoint-5500/special_tokens_map.json
/home/me/.conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py:1846: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with autocast():
Saving model checkpoint to ./kogpt2-extended/checkpoint-6000
Configuration saved in ./kogpt2-extended/checkpoint-6000/config.json
Model weights saved in ./kogpt2-extended/checkpoint-6000/pytorch_model.bin
tokenizer config file saved in ./kogpt2-extended/checkpoint-6000/tokenizer_config.json
Special tokens file saved in ./kogpt2-extended/checkpoint-6000/special_tokens_map.json
/home/me/.conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py:1846: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with autocast():
Saving model checkpoint to ./kogpt2-extended/checkpoint-6500
Configuration saved in ./kogpt2-extended/checkpoint-6500/config.json
Model weights saved in ./kogpt2-extended/checkpoint-6500/pytorch_model.bin
tokenizer config file saved in ./kogpt2-extended/checkpoint-6500/tokenizer_config.json
Special tokens file saved in ./kogpt2-extended/checkpoint-6500/special_tokens_map.json
/home/me/.conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py:1846: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with autocast():
Saving model checkpoint to ./kogpt2-extended/checkpoint-7000
Configuration saved in ./kogpt2-extended/checkpoint-7000/config.json
Model weights saved in ./kogpt2-extended/checkpoint-7000/pytorch_model.bin
tokenizer config file saved in ./kogpt2-extended/checkpoint-7000/tokenizer_config.json
Special tokens file saved in ./kogpt2-extended/checkpoint-7000/special_tokens_map.json
/home/me/.conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py:1846: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with autocast():
Saving model checkpoint to ./kogpt2-extended/checkpoint-7500
Configuration saved in ./kogpt2-extended/checkpoint-7500/config.json
Model weights saved in ./kogpt2-extended/checkpoint-7500/pytorch_model.bin
tokenizer config file saved in ./kogpt2-extended/checkpoint-7500/tokenizer_config.json
Special tokens file saved in ./kogpt2-extended/checkpoint-7500/special_tokens_map.json
/home/me/.conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py:1846: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with autocast():
Saving model checkpoint to ./kogpt2-extended/checkpoint-8000
Configuration saved in ./kogpt2-extended/checkpoint-8000/config.json
Model weights saved in ./kogpt2-extended/checkpoint-8000/pytorch_model.bin
tokenizer config file saved in ./kogpt2-extended/checkpoint-8000/tokenizer_config.json
Special tokens file saved in ./kogpt2-extended/checkpoint-8000/special_tokens_map.json
/home/me/.conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py:1846: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with autocast():
Saving model checkpoint to ./kogpt2-extended/checkpoint-8500
Configuration saved in ./kogpt2-extended/checkpoint-8500/config.json
Model weights saved in ./kogpt2-extended/checkpoint-8500/pytorch_model.bin
tokenizer config file saved in ./kogpt2-extended/checkpoint-8500/tokenizer_config.json
Special tokens file saved in ./kogpt2-extended/checkpoint-8500/special_tokens_map.json
/home/me/.conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py:1846: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with autocast():
Saving model checkpoint to ./kogpt2-extended/checkpoint-9000
Configuration saved in ./kogpt2-extended/checkpoint-9000/config.json
Model weights saved in ./kogpt2-extended/checkpoint-9000/pytorch_model.bin
tokenizer config file saved in ./kogpt2-extended/checkpoint-9000/tokenizer_config.json
Special tokens file saved in ./kogpt2-extended/checkpoint-9000/special_tokens_map.json
/home/me/.conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py:1846: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with autocast():
Saving model checkpoint to ./kogpt2-extended/checkpoint-9500
Configuration saved in ./kogpt2-extended/checkpoint-9500/config.json
Model weights saved in ./kogpt2-extended/checkpoint-9500/pytorch_model.bin
tokenizer config file saved in ./kogpt2-extended/checkpoint-9500/tokenizer_config.json
Special tokens file saved in ./kogpt2-extended/checkpoint-9500/special_tokens_map.json
/home/me/.conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py:1846: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with autocast():
Saving model checkpoint to ./kogpt2-extended/checkpoint-10000
Configuration saved in ./kogpt2-extended/checkpoint-10000/config.json
Model weights saved in ./kogpt2-extended/checkpoint-10000/pytorch_model.bin
tokenizer config file saved in ./kogpt2-extended/checkpoint-10000/tokenizer_config.json
Special tokens file saved in ./kogpt2-extended/checkpoint-10000/special_tokens_map.json
/home/me/.conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py:1846: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with autocast():
Saving model checkpoint to ./kogpt2-extended/checkpoint-10500
Configuration saved in ./kogpt2-extended/checkpoint-10500/config.json
Model weights saved in ./kogpt2-extended/checkpoint-10500/pytorch_model.bin
tokenizer config file saved in ./kogpt2-extended/checkpoint-10500/tokenizer_config.json
Special tokens file saved in ./kogpt2-extended/checkpoint-10500/special_tokens_map.json
/home/me/.conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py:1846: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with autocast():
Saving model checkpoint to ./kogpt2-extended/checkpoint-11000
Configuration saved in ./kogpt2-extended/checkpoint-11000/config.json
Model weights saved in ./kogpt2-extended/checkpoint-11000/pytorch_model.bin
tokenizer config file saved in ./kogpt2-extended/checkpoint-11000/tokenizer_config.json
Special tokens file saved in ./kogpt2-extended/checkpoint-11000/special_tokens_map.json
/home/me/.conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py:1846: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with autocast():
Saving model checkpoint to ./kogpt2-extended/checkpoint-11500
Configuration saved in ./kogpt2-extended/checkpoint-11500/config.json
Model weights saved in ./kogpt2-extended/checkpoint-11500/pytorch_model.bin
tokenizer config file saved in ./kogpt2-extended/checkpoint-11500/tokenizer_config.json
Special tokens file saved in ./kogpt2-extended/checkpoint-11500/special_tokens_map.json
/home/me/.conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py:1846: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with autocast():
Saving model checkpoint to ./kogpt2-extended/checkpoint-12000
Configuration saved in ./kogpt2-extended/checkpoint-12000/config.json
Model weights saved in ./kogpt2-extended/checkpoint-12000/pytorch_model.bin
tokenizer config file saved in ./kogpt2-extended/checkpoint-12000/tokenizer_config.json
Special tokens file saved in ./kogpt2-extended/checkpoint-12000/special_tokens_map.json
/home/me/.conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py:1846: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with autocast():
Saving model checkpoint to ./kogpt2-extended/checkpoint-12500
Configuration saved in ./kogpt2-extended/checkpoint-12500/config.json
Model weights saved in ./kogpt2-extended/checkpoint-12500/pytorch_model.bin
tokenizer config file saved in ./kogpt2-extended/checkpoint-12500/tokenizer_config.json
Special tokens file saved in ./kogpt2-extended/checkpoint-12500/special_tokens_map.json
/home/me/.conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py:1846: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with autocast():
Saving model checkpoint to ./kogpt2-extended/checkpoint-13000
Configuration saved in ./kogpt2-extended/checkpoint-13000/config.json
Model weights saved in ./kogpt2-extended/checkpoint-13000/pytorch_model.bin
tokenizer config file saved in ./kogpt2-extended/checkpoint-13000/tokenizer_config.json
Special tokens file saved in ./kogpt2-extended/checkpoint-13000/special_tokens_map.json
/home/me/.conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py:1846: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with autocast():
Saving model checkpoint to ./kogpt2-extended/checkpoint-13500
Configuration saved in ./kogpt2-extended/checkpoint-13500/config.json
Model weights saved in ./kogpt2-extended/checkpoint-13500/pytorch_model.bin
tokenizer config file saved in ./kogpt2-extended/checkpoint-13500/tokenizer_config.json
Special tokens file saved in ./kogpt2-extended/checkpoint-13500/special_tokens_map.json
/home/me/.conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py:1846: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with autocast():
Saving model checkpoint to ./kogpt2-extended/checkpoint-14000
Configuration saved in ./kogpt2-extended/checkpoint-14000/config.json
Model weights saved in ./kogpt2-extended/checkpoint-14000/pytorch_model.bin
tokenizer config file saved in ./kogpt2-extended/checkpoint-14000/tokenizer_config.json
Special tokens file saved in ./kogpt2-extended/checkpoint-14000/special_tokens_map.json
/home/me/.conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py:1846: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with autocast():
Saving model checkpoint to ./kogpt2-extended/checkpoint-14500
Configuration saved in ./kogpt2-extended/checkpoint-14500/config.json
Model weights saved in ./kogpt2-extended/checkpoint-14500/pytorch_model.bin
tokenizer config file saved in ./kogpt2-extended/checkpoint-14500/tokenizer_config.json
Special tokens file saved in ./kogpt2-extended/checkpoint-14500/special_tokens_map.json
/home/me/.conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py:1846: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with autocast():
Saving model checkpoint to ./kogpt2-extended/checkpoint-15000
Configuration saved in ./kogpt2-extended/checkpoint-15000/config.json
Model weights saved in ./kogpt2-extended/checkpoint-15000/pytorch_model.bin
tokenizer config file saved in ./kogpt2-extended/checkpoint-15000/tokenizer_config.json
Special tokens file saved in ./kogpt2-extended/checkpoint-15000/special_tokens_map.json
/home/me/.conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py:1846: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with autocast():
Saving model checkpoint to ./kogpt2-extended/checkpoint-15500
Configuration saved in ./kogpt2-extended/checkpoint-15500/config.json
Model weights saved in ./kogpt2-extended/checkpoint-15500/pytorch_model.bin
tokenizer config file saved in ./kogpt2-extended/checkpoint-15500/tokenizer_config.json
Special tokens file saved in ./kogpt2-extended/checkpoint-15500/special_tokens_map.json
/home/me/.conda/envs/pytorch/lib/python3.10/site-packages/transformers/trainer.py:1846: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with autocast():


Training completed. Do not forget to share your model on huggingface.co/models =)

TrainOutput(global_step=15624, training_loss=0.185698296098421, metrics={'train_runtime': 7202.4335, 'train_samples_per_second': 34.71, 'train_steps_per_second': 2.169, 'total_flos': 6.5320917663744e+16, 'train_loss': 0.185698296098421, 'epoch': 3.0})

model.save_pretrained("kogpt2-finetuned")
tokenizer.save_pretrained("kogpt2-finetuned")

Configuration saved in kogpt2-finetuned/config.json
Model weights saved in kogpt2-finetuned/pytorch_model.bin
tokenizer config file saved in kogpt2-finetuned/tokenizer_config.json
Special tokens file saved in kogpt2-finetuned/special_tokens_map.json

('kogpt2-finetuned/tokenizer_config.json',
 'kogpt2-finetuned/special_tokens_map.json',
 'kogpt2-finetuned/tokenizer.json')

응용¶

from transformers import pipeline

generator = pipeline("text-generation", model="kogpt2-finetuned", tokenizer=tokenizer)
outputs = generator("AAA가 BBB에게", max_length=100)
print(outputs[0]['generated_text'])

loading configuration file kogpt2-finetuned/config.json
Model config GPT2Config {
  "_name_or_path": "skt/kogpt2-base-v2",
  "_num_labels": 1,
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "author": "Heewon Jeon(madjakarta@gmail.com)",
  "bos_token_id": 0,
  "created_date": "2021-04-28",
  "embd_pdrop": 0.1,
  "eos_token_id": 1,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0
  },
  "layer_norm_epsilon": 1e-05,
  "license": "CC-BY-NC-SA 4.0",
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "pad_token_id": 3,
  "resid_pdrop": 0.1,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "torch_dtype": "float32",
  "transformers_version": "4.11.3",
  "use_cache": true,
  "vocab_size": 53603
}

loading configuration file kogpt2-finetuned/config.json
Model config GPT2Config {
  "_name_or_path": "skt/kogpt2-base-v2",
  "_num_labels": 1,
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "author": "Heewon Jeon(madjakarta@gmail.com)",
  "bos_token_id": 0,
  "created_date": "2021-04-28",
  "embd_pdrop": 0.1,
  "eos_token_id": 1,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0
  },
  "layer_norm_epsilon": 1e-05,
  "license": "CC-BY-NC-SA 4.0",
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "pad_token_id": 3,
  "resid_pdrop": 0.1,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "torch_dtype": "float32",
  "transformers_version": "4.11.3",
  "use_cache": true,
  "vocab_size": 53603
}

loading weights file kogpt2-finetuned/pytorch_model.bin
All model checkpoint weights were used when initializing GPT2LMHeadModel.

All the weights of GPT2LMHeadModel were initialized from the model checkpoint at kogpt2-finetuned.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.

AAA가 BBB에게 도달했네.