llama.cpp 소개 - 딥러닝 언어 모델

llama.cpp는 다양한 하드웨어에서 최소한의 설정으로 대규모 언어 모델(LLM)의 추론을 가능하게 하며, 최첨단 성능을 제공하는 것을 목표로 하는 오픈 소스 프로젝트입니다. 이 프로젝트는 로컬 및 클라우드 환경 모두에서 효율적인 LLM 추론을 지원합니다.

주요 특징¶

순수 C/C++ 구현: 의존성 없이 순수 C/C++로 작성되어 다양한 플랫폼에서 쉽게 활용할 수 있습니다.
Apple Silicon 최적화: ARM NEON, Accelerate, Metal 프레임워크를 통해 Apple Silicon에서 최적의 성능을 발휘합니다.
x86 아키텍처 지원: AVX, AVX2, AVX512, AMX 명령어 세트를 활용하여 성능을 향상시킵니다.
양자화 지원: 1.5비트부터 8비트까지의 정수 양자화를 통해 추론 속도를 높이고 메모리 사용량을 줄입니다.
GPU 가속: NVIDIA GPU를 위한 CUDA 커널과 AMD GPU를 위한 HIP, Moore Threads MTT GPU를 위한 MUSA를 지원합니다.
다양한 백엔드 지원: Vulkan과 SYCL 백엔드를 통해 다양한 하드웨어에서의 실행을 지원합니다.
CPU+GPU 하이브리드 추론: VRAM 용량을 초과하는 모델도 부분적으로 가속하여 실행할 수 있습니다.

또한, llama.cpp 프로젝트는 GGML 라이브러리의 새로운 기능을 개발하는 주요 플랫폼으로 활용되고 있습니다.

설치 및 사용 방법¶

저장소 클론 및 빌드:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

llama.cpp 지원 모델¶

llama.cpp는 다양한 언어 모델의 추론을 지원하며, 기본 모델뿐만 아니라 해당 모델을 미세 조정한 버전도 대부분 호환됩니다.

주요 오픈소스 모델

제작사	모델명
Meta	LLaMA 🦙, LLaMA 2 🦙🦙, LLaMA 3 🦙🦙🦙
Microsoft	Phi-2, Phi-3 Mini, Phi-3 Small, Phi-3 Medium
Google	mT5, Gemini

모델 추가¶

llama.cpp에 새로운 모델 아키텍처를 추가하려면 다음 단계를 수행해야 합니다.

모델을 GGUF 형식으로 변환하기
- Python에서 gguf 라이브러리를 사용하는 변환 스크립트를 통해 모델을 GGUF 형식으로 변환합니다.
- 모델 아키텍처에 따라 convert_hf_to_gguf.py 또는 examples/convert_legacy_llama.py 스크립트를 사용합니다.
- 이 과정에서 모델 구성, 토크나이저, 텐서 이름 및 데이터를 GGUF 메타데이터와 텐서로 변환합니다.
llama.cpp에 모델 아키텍처 정의하기
- 새로운 llm_arch를 정의하고, 모델의 하이퍼파라미터와 텐서 레이아웃을 설정합니다.
- 필요한 경우, 모델의 RoPE(Rotary Position Embedding) 유형을 지정합니다.
GGML 그래프 구현 구축하기:
- 새로운 모델 아키텍처에 대한 추론 그래프를 llama_build_graph 함수 내에 구현합니다.
- 기존 구현(build_llama, build_dbrx, build_bert 등)을 참고하여 새로운 그래프를 작성합니다.
- 기본 GGML 백엔드(CUDA, METAL, CPU)에서 새로운 아키텍처가 제대로 작동하는지 확인합니다.

이러한 단계를 완료한 후, Pull Request(PR)를 제출하여 변경 사항을 공유할 수 있습니다.

모델 변환 및 양자화¶

Docker Image¶

docker pull ghcr.io/ggml-org/llama.cpp:full

허깅페이스 모델을 GGUF 모델로 변환

docker run -v ${pwd}:/models ghcr.io/ggml-org/llama.cpp:full --convert /models/Qwen3-8B

docker run -v ${pwd}:/models ghcr.io/ggml-org/llama.cpp:full --quantize /models/Qwen3-8B/Qwen3-8B-F16.gguf Q8_0

main: build = 6135 (25ff6f76)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing '/models/Qwen3-8B/Qwen3-8B-F16.gguf' to '/models/Qwen3-8B/ggml-model-Q8_0.gguf' as Q8_0
llama_model_loader: loaded meta data with 27 key-value pairs and 399 tensors from /models/Qwen3-8B/Qwen3-8B-F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 8B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3
llama_model_loader: - kv   4:                         general.size_label str              = 8B
llama_model_loader: - kv   5:                          qwen3.block_count u32              = 36
llama_model_loader: - kv   6:                       qwen3.context_length u32              = 40960
llama_model_loader: - kv   7:                     qwen3.embedding_length u32              = 4096
llama_model_loader: - kv   8:                  qwen3.feed_forward_length u32              = 12288
llama_model_loader: - kv   9:                 qwen3.attention.head_count u32              = 32
llama_model_loader: - kv  10:              qwen3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  11:                       qwen3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  12:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  13:                 qwen3.attention.key_length u32              = 128
llama_model_loader: - kv  14:               qwen3.attention.value_length u32              = 128
llama_model_loader: - kv  15:                          general.file_type u32              = 1
llama_model_loader: - kv  16:               general.quantization_version u32              = 2
llama_model_loader: - kv  17:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  18:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  19:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  21:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  22:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  23:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  24:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  25:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - type  f32:  145 tensors
llama_model_loader: - type  f16:  254 tensors
[   1/ 399]                        output.weight - [ 4096, 151936,     1,     1], type =    f16, converting to q8_0 .. size =  1187.00 MiB ->   630.59 MiB
[   2/ 399]                   output_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[   3/ 399]                    token_embd.weight - [ 4096, 151936,     1,     1], type =    f16, converting to q8_0 .. size =  1187.00 MiB ->   630.59 MiB
[   4/ 399]                  blk.0.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[   5/ 399]             blk.0.attn_k_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[   6/ 399]               blk.0.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[   7/ 399]             blk.0.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[   8/ 399]                  blk.0.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[   9/ 399]             blk.0.attn_q_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[  10/ 399]                  blk.0.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[  11/ 399]                blk.0.ffn_down.weight - [12288,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[  12/ 399]                blk.0.ffn_gate.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[  13/ 399]                blk.0.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  14/ 399]                  blk.0.ffn_up.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[  15/ 399]                  blk.1.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[  16/ 399]             blk.1.attn_k_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[  17/ 399]               blk.1.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  18/ 399]             blk.1.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[  19/ 399]                  blk.1.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[  20/ 399]             blk.1.attn_q_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[  21/ 399]                  blk.1.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[  22/ 399]                blk.1.ffn_down.weight - [12288,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[  23/ 399]                blk.1.ffn_gate.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[  24/ 399]                blk.1.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  25/ 399]                  blk.1.ffn_up.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[  26/ 399]                  blk.2.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[  27/ 399]             blk.2.attn_k_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[  28/ 399]               blk.2.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  29/ 399]             blk.2.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[  30/ 399]                  blk.2.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[  31/ 399]             blk.2.attn_q_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[  32/ 399]                  blk.2.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[  33/ 399]                blk.2.ffn_down.weight - [12288,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[  34/ 399]                blk.2.ffn_gate.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[  35/ 399]                blk.2.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  36/ 399]                  blk.2.ffn_up.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[  37/ 399]                  blk.3.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[  38/ 399]             blk.3.attn_k_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[  39/ 399]               blk.3.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  40/ 399]             blk.3.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[  41/ 399]                  blk.3.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[  42/ 399]             blk.3.attn_q_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[  43/ 399]                  blk.3.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[  44/ 399]                blk.3.ffn_down.weight - [12288,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[  45/ 399]                blk.3.ffn_gate.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[  46/ 399]                blk.3.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  47/ 399]                  blk.3.ffn_up.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[  48/ 399]                  blk.4.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[  49/ 399]             blk.4.attn_k_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[  50/ 399]               blk.4.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  51/ 399]             blk.4.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[  52/ 399]                  blk.4.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[  53/ 399]             blk.4.attn_q_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[  54/ 399]                  blk.4.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[  55/ 399]                blk.4.ffn_down.weight - [12288,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[  56/ 399]                blk.4.ffn_gate.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[  57/ 399]                blk.4.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  58/ 399]                  blk.4.ffn_up.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[  59/ 399]                  blk.5.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[  60/ 399]             blk.5.attn_k_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[  61/ 399]               blk.5.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  62/ 399]             blk.5.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[  63/ 399]                  blk.5.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[  64/ 399]             blk.5.attn_q_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[  65/ 399]                  blk.5.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[  66/ 399]                blk.5.ffn_down.weight - [12288,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[  67/ 399]                blk.5.ffn_gate.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[  68/ 399]                blk.5.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  69/ 399]                  blk.5.ffn_up.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[  70/ 399]                  blk.6.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[  71/ 399]             blk.6.attn_k_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[  72/ 399]               blk.6.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  73/ 399]             blk.6.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[  74/ 399]                  blk.6.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[  75/ 399]             blk.6.attn_q_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[  76/ 399]                  blk.6.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[  77/ 399]                blk.6.ffn_down.weight - [12288,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[  78/ 399]                blk.6.ffn_gate.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[  79/ 399]                blk.6.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  80/ 399]                  blk.6.ffn_up.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[  81/ 399]                  blk.7.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[  82/ 399]             blk.7.attn_k_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[  83/ 399]               blk.7.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  84/ 399]             blk.7.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[  85/ 399]                  blk.7.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[  86/ 399]             blk.7.attn_q_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[  87/ 399]                  blk.7.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[  88/ 399]                blk.7.ffn_down.weight - [12288,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[  89/ 399]                blk.7.ffn_gate.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[  90/ 399]                blk.7.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  91/ 399]                  blk.7.ffn_up.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[  92/ 399]                  blk.8.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[  93/ 399]             blk.8.attn_k_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[  94/ 399]               blk.8.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  95/ 399]             blk.8.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[  96/ 399]                  blk.8.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[  97/ 399]             blk.8.attn_q_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[  98/ 399]                  blk.8.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[  99/ 399]                blk.8.ffn_down.weight - [12288,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 100/ 399]                blk.8.ffn_gate.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 101/ 399]                blk.8.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 102/ 399]                  blk.8.ffn_up.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 103/ 399]                  blk.9.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 104/ 399]             blk.9.attn_k_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 105/ 399]               blk.9.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 106/ 399]             blk.9.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 107/ 399]                  blk.9.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 108/ 399]             blk.9.attn_q_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 109/ 399]                  blk.9.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 110/ 399]                blk.9.ffn_down.weight - [12288,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 111/ 399]                blk.9.ffn_gate.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 112/ 399]                blk.9.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 113/ 399]                  blk.9.ffn_up.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 114/ 399]                 blk.10.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 115/ 399]            blk.10.attn_k_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 116/ 399]              blk.10.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 117/ 399]            blk.10.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 118/ 399]                 blk.10.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 119/ 399]            blk.10.attn_q_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 120/ 399]                 blk.10.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 121/ 399]               blk.10.ffn_down.weight - [12288,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 122/ 399]               blk.10.ffn_gate.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 123/ 399]               blk.10.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 124/ 399]                 blk.10.ffn_up.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 125/ 399]                 blk.11.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 126/ 399]            blk.11.attn_k_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 127/ 399]              blk.11.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 128/ 399]            blk.11.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 129/ 399]                 blk.11.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 130/ 399]            blk.11.attn_q_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 131/ 399]                 blk.11.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 132/ 399]               blk.11.ffn_down.weight - [12288,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 133/ 399]               blk.11.ffn_gate.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 134/ 399]               blk.11.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 135/ 399]                 blk.11.ffn_up.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 136/ 399]                 blk.12.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 137/ 399]            blk.12.attn_k_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 138/ 399]              blk.12.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 139/ 399]            blk.12.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 140/ 399]                 blk.12.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 141/ 399]            blk.12.attn_q_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 142/ 399]                 blk.12.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 143/ 399]               blk.12.ffn_down.weight - [12288,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 144/ 399]               blk.12.ffn_gate.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 145/ 399]               blk.12.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 146/ 399]                 blk.12.ffn_up.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 147/ 399]                 blk.13.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 148/ 399]            blk.13.attn_k_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 149/ 399]              blk.13.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 150/ 399]            blk.13.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 151/ 399]                 blk.13.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 152/ 399]            blk.13.attn_q_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 153/ 399]                 blk.13.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 154/ 399]               blk.13.ffn_down.weight - [12288,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 155/ 399]               blk.13.ffn_gate.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 156/ 399]               blk.13.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 157/ 399]                 blk.13.ffn_up.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 158/ 399]                 blk.14.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 159/ 399]            blk.14.attn_k_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 160/ 399]              blk.14.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 161/ 399]            blk.14.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 162/ 399]                 blk.14.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 163/ 399]            blk.14.attn_q_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 164/ 399]                 blk.14.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 165/ 399]               blk.14.ffn_down.weight - [12288,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 166/ 399]               blk.14.ffn_gate.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 167/ 399]               blk.14.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 168/ 399]                 blk.14.ffn_up.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 169/ 399]                 blk.15.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 170/ 399]            blk.15.attn_k_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 171/ 399]              blk.15.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 172/ 399]            blk.15.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 173/ 399]                 blk.15.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 174/ 399]            blk.15.attn_q_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 175/ 399]                 blk.15.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 176/ 399]               blk.15.ffn_down.weight - [12288,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 177/ 399]               blk.15.ffn_gate.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 178/ 399]               blk.15.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 179/ 399]                 blk.15.ffn_up.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 180/ 399]                 blk.16.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 181/ 399]            blk.16.attn_k_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 182/ 399]              blk.16.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 183/ 399]            blk.16.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 184/ 399]                 blk.16.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 185/ 399]            blk.16.attn_q_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 186/ 399]                 blk.16.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 187/ 399]               blk.16.ffn_down.weight - [12288,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 188/ 399]               blk.16.ffn_gate.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 189/ 399]               blk.16.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 190/ 399]                 blk.16.ffn_up.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 191/ 399]                 blk.17.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 192/ 399]            blk.17.attn_k_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 193/ 399]              blk.17.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 194/ 399]            blk.17.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 195/ 399]                 blk.17.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 196/ 399]            blk.17.attn_q_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 197/ 399]                 blk.17.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 198/ 399]               blk.17.ffn_down.weight - [12288,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 199/ 399]               blk.17.ffn_gate.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 200/ 399]               blk.17.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 201/ 399]                 blk.17.ffn_up.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 202/ 399]                 blk.18.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 203/ 399]            blk.18.attn_k_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 204/ 399]              blk.18.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 205/ 399]            blk.18.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 206/ 399]                 blk.18.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 207/ 399]            blk.18.attn_q_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 208/ 399]                 blk.18.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 209/ 399]               blk.18.ffn_down.weight - [12288,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 210/ 399]               blk.18.ffn_gate.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 211/ 399]               blk.18.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 212/ 399]                 blk.18.ffn_up.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 213/ 399]                 blk.19.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 214/ 399]            blk.19.attn_k_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 215/ 399]              blk.19.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 216/ 399]            blk.19.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 217/ 399]                 blk.19.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 218/ 399]            blk.19.attn_q_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 219/ 399]                 blk.19.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 220/ 399]               blk.19.ffn_down.weight - [12288,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 221/ 399]               blk.19.ffn_gate.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 222/ 399]               blk.19.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 223/ 399]                 blk.19.ffn_up.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 224/ 399]                 blk.20.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 225/ 399]            blk.20.attn_k_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 226/ 399]              blk.20.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 227/ 399]            blk.20.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 228/ 399]                 blk.20.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 229/ 399]            blk.20.attn_q_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 230/ 399]                 blk.20.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 231/ 399]               blk.20.ffn_down.weight - [12288,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 232/ 399]               blk.20.ffn_gate.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 233/ 399]               blk.20.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 234/ 399]                 blk.20.ffn_up.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 235/ 399]                 blk.21.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 236/ 399]            blk.21.attn_k_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 237/ 399]              blk.21.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 238/ 399]            blk.21.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 239/ 399]                 blk.21.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 240/ 399]            blk.21.attn_q_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 241/ 399]                 blk.21.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 242/ 399]               blk.21.ffn_down.weight - [12288,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 243/ 399]               blk.21.ffn_gate.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 244/ 399]               blk.21.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 245/ 399]                 blk.21.ffn_up.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 246/ 399]                 blk.22.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 247/ 399]            blk.22.attn_k_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 248/ 399]              blk.22.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 249/ 399]            blk.22.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 250/ 399]                 blk.22.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 251/ 399]            blk.22.attn_q_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 252/ 399]                 blk.22.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 253/ 399]               blk.22.ffn_down.weight - [12288,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 254/ 399]               blk.22.ffn_gate.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 255/ 399]               blk.22.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 256/ 399]                 blk.22.ffn_up.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 257/ 399]                 blk.23.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 258/ 399]            blk.23.attn_k_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 259/ 399]              blk.23.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 260/ 399]            blk.23.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 261/ 399]                 blk.23.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 262/ 399]            blk.23.attn_q_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 263/ 399]                 blk.23.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 264/ 399]               blk.23.ffn_down.weight - [12288,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 265/ 399]               blk.23.ffn_gate.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 266/ 399]               blk.23.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 267/ 399]                 blk.23.ffn_up.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 268/ 399]                 blk.24.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 269/ 399]            blk.24.attn_k_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 270/ 399]              blk.24.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 271/ 399]            blk.24.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 272/ 399]                 blk.24.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 273/ 399]            blk.24.attn_q_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 274/ 399]                 blk.24.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 275/ 399]               blk.24.ffn_down.weight - [12288,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 276/ 399]               blk.24.ffn_gate.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 277/ 399]               blk.24.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 278/ 399]                 blk.24.ffn_up.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 279/ 399]                 blk.25.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 280/ 399]            blk.25.attn_k_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 281/ 399]              blk.25.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 282/ 399]            blk.25.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 283/ 399]                 blk.25.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 284/ 399]            blk.25.attn_q_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 285/ 399]                 blk.25.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 286/ 399]               blk.25.ffn_down.weight - [12288,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 287/ 399]               blk.25.ffn_gate.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 288/ 399]               blk.25.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 289/ 399]                 blk.25.ffn_up.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 290/ 399]                 blk.26.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 291/ 399]            blk.26.attn_k_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 292/ 399]              blk.26.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 293/ 399]            blk.26.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 294/ 399]                 blk.26.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 295/ 399]            blk.26.attn_q_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 296/ 399]                 blk.26.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 297/ 399]               blk.26.ffn_down.weight - [12288,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 298/ 399]               blk.26.ffn_gate.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 299/ 399]               blk.26.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 300/ 399]                 blk.26.ffn_up.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 301/ 399]                 blk.27.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 302/ 399]            blk.27.attn_k_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 303/ 399]              blk.27.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 304/ 399]            blk.27.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 305/ 399]                 blk.27.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 306/ 399]            blk.27.attn_q_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 307/ 399]                 blk.27.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 308/ 399]               blk.27.ffn_down.weight - [12288,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 309/ 399]               blk.27.ffn_gate.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 310/ 399]               blk.27.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 311/ 399]                 blk.27.ffn_up.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 312/ 399]                 blk.28.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 313/ 399]            blk.28.attn_k_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 314/ 399]              blk.28.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 315/ 399]            blk.28.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 316/ 399]                 blk.28.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 317/ 399]            blk.28.attn_q_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 318/ 399]                 blk.28.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 319/ 399]               blk.28.ffn_down.weight - [12288,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 320/ 399]               blk.28.ffn_gate.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 321/ 399]               blk.28.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 322/ 399]                 blk.28.ffn_up.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 323/ 399]                 blk.29.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 324/ 399]            blk.29.attn_k_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 325/ 399]              blk.29.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 326/ 399]            blk.29.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 327/ 399]                 blk.29.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 328/ 399]            blk.29.attn_q_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 329/ 399]                 blk.29.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 330/ 399]               blk.29.ffn_down.weight - [12288,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 331/ 399]               blk.29.ffn_gate.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 332/ 399]               blk.29.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 333/ 399]                 blk.29.ffn_up.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 334/ 399]                 blk.30.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 335/ 399]            blk.30.attn_k_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 336/ 399]              blk.30.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 337/ 399]            blk.30.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 338/ 399]                 blk.30.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 339/ 399]            blk.30.attn_q_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 340/ 399]                 blk.30.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 341/ 399]               blk.30.ffn_down.weight - [12288,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 342/ 399]               blk.30.ffn_gate.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 343/ 399]               blk.30.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 344/ 399]                 blk.30.ffn_up.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 345/ 399]                 blk.31.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 346/ 399]            blk.31.attn_k_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 347/ 399]              blk.31.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 348/ 399]            blk.31.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 349/ 399]                 blk.31.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 350/ 399]            blk.31.attn_q_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 351/ 399]                 blk.31.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 352/ 399]               blk.31.ffn_down.weight - [12288,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 353/ 399]               blk.31.ffn_gate.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 354/ 399]               blk.31.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 355/ 399]                 blk.31.ffn_up.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 356/ 399]                 blk.32.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 357/ 399]            blk.32.attn_k_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 358/ 399]              blk.32.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 359/ 399]            blk.32.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 360/ 399]                 blk.32.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 361/ 399]            blk.32.attn_q_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 362/ 399]                 blk.32.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 363/ 399]               blk.32.ffn_down.weight - [12288,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 364/ 399]               blk.32.ffn_gate.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 365/ 399]               blk.32.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 366/ 399]                 blk.32.ffn_up.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 367/ 399]                 blk.33.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 368/ 399]            blk.33.attn_k_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 369/ 399]              blk.33.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 370/ 399]            blk.33.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 371/ 399]                 blk.33.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 372/ 399]            blk.33.attn_q_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 373/ 399]                 blk.33.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 374/ 399]               blk.33.ffn_down.weight - [12288,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 375/ 399]               blk.33.ffn_gate.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 376/ 399]               blk.33.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 377/ 399]                 blk.33.ffn_up.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 378/ 399]                 blk.34.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 379/ 399]            blk.34.attn_k_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 380/ 399]              blk.34.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 381/ 399]            blk.34.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 382/ 399]                 blk.34.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 383/ 399]            blk.34.attn_q_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 384/ 399]                 blk.34.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 385/ 399]               blk.34.ffn_down.weight - [12288,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 386/ 399]               blk.34.ffn_gate.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 387/ 399]               blk.34.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 388/ 399]                 blk.34.ffn_up.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 389/ 399]                 blk.35.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 390/ 399]            blk.35.attn_k_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 391/ 399]              blk.35.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 392/ 399]            blk.35.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 393/ 399]                 blk.35.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[ 394/ 399]            blk.35.attn_q_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
[ 395/ 399]                 blk.35.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[ 396/ 399]               blk.35.ffn_down.weight - [12288,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 397/ 399]               blk.35.ffn_gate.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
[ 398/ 399]               blk.35.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 399/ 399]                 blk.35.ffn_up.weight - [ 4096, 12288,     1,     1], type =    f16, converting to q8_0 .. size =    96.00 MiB ->    51.00 MiB
llama_model_quantize_impl: model size  = 15623.18 MB
llama_model_quantize_impl: quant size  =  8300.36 MB

main: quantize time = 179290.17 ms
main:    total time = 179290.17 ms

Server¶

도커 호스트가 윈도우인 경우, 경로 예시는 다음과 같습니다.

호스트의 모델 경로: $HOST_MODEL_DIR="D:\models"
컨테이너에 마운트 경로: $MODEL_PATH="/models/Qwen3-8B/Qwen3-8B-Q8_0.gguf"

docker run --name llama -v $HOST_MODEL_DIR:/models -p 12345:12345 --gpus all ghcr.io/ggml-org/llama.cpp:full-cuda --server -m $MODEL_PATH --port 12345 --n-gpu-layers 999 --host 0.0.0.0