llama.cpp 소개
llama.cpp는 다양한 하드웨어에서 최소한의 설정으로 대규모 언어 모델(LLM)의 추론을 가능하게 하며, 최첨단 성능을 제공하는 것을 목표로 하는 오픈 소스 프로젝트입니다. 이 프로젝트는 로컬 및 클라우드 환경 모두에서 효율적인 LLM 추론을 지원합니다.
주요 특징¶
순수 C/C++ 구현: 의존성 없이 순수 C/C++로 작성되어 다양한 플랫폼에서 쉽게 활용할 수 있습니다.
Apple Silicon 최적화: ARM NEON, Accelerate, Metal 프레임워크를 통해 Apple Silicon에서 최적의 성능을 발휘합니다.
x86 아키텍처 지원: AVX, AVX2, AVX512, AMX 명령어 세트를 활용하여 성능을 향상시킵니다.
양자화 지원: 1.5비트부터 8비트까지의 정수 양자화를 통해 추론 속도를 높이고 메모리 사용량을 줄입니다.
GPU 가속: NVIDIA GPU를 위한 CUDA 커널과 AMD GPU를 위한 HIP, Moore Threads MTT GPU를 위한 MUSA를 지원합니다.
다양한 백엔드 지원: Vulkan과 SYCL 백엔드를 통해 다양한 하드웨어에서의 실행을 지원합니다.
CPU+GPU 하이브리드 추론: VRAM 용량을 초과하는 모델도 부분적으로 가속하여 실행할 수 있습니다.
또한, llama.cpp 프로젝트는 GGML 라이브러리의 새로운 기능을 개발하는 주요 플랫폼으로 활용되고 있습니다.
설치 및 사용 방법¶
저장소 클론 및 빌드:
git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make
llama.cpp 지원 모델¶
llama.cpp는 다양한 언어 모델의 추론을 지원하며, 기본 모델뿐만 아니라 해당 모델을 미세 조정한 버전도 대부분 호환됩니다.
주요 오픈소스 모델
| 제작사 | 모델명 |
|---|---|
| Meta | LLaMA 🦙, LLaMA 2 🦙🦙, LLaMA 3 🦙🦙🦙 |
| Microsoft | Phi-2, Phi-3 Mini, Phi-3 Small, Phi-3 Medium |
| mT5, Gemini |
모델 추가¶
llama.cpp에 새로운 모델 아키텍처를 추가하려면 다음 단계를 수행해야 합니다.
모델을 GGUF 형식으로 변환하기
Python에서 gguf 라이브러리를 사용하는 변환 스크립트를 통해 모델을 GGUF 형식으로 변환합니다.
모델 아키텍처에 따라 convert_hf_to_gguf.py 또는 examples/convert_legacy_llama.py 스크립트를 사용합니다.
이 과정에서 모델 구성, 토크나이저, 텐서 이름 및 데이터를 GGUF 메타데이터와 텐서로 변환합니다.
llama.cpp에 모델 아키텍처 정의하기
새로운 llm_arch를 정의하고, 모델의 하이퍼파라미터와 텐서 레이아웃을 설정합니다.
필요한 경우, 모델의 RoPE(Rotary Position Embedding) 유형을 지정합니다.
GGML 그래프 구현 구축하기:
새로운 모델 아키텍처에 대한 추론 그래프를 llama_build_graph 함수 내에 구현합니다.
기존 구현(build_llama, build_dbrx, build_bert 등)을 참고하여 새로운 그래프를 작성합니다.
기본 GGML 백엔드(CUDA, METAL, CPU)에서 새로운 아키텍처가 제대로 작동하는지 확인합니다.
이러한 단계를 완료한 후, Pull Request(PR)를 제출하여 변경 사항을 공유할 수 있습니다.
모델 변환 및 양자화¶
Docker Image¶
docker pull ghcr.io/ggml-org/llama.cpp:full허깅페이스 모델을 GGUF 모델로 변환
docker run -v ${pwd}:/models ghcr.io/ggml-org/llama.cpp:full --convert /models/Qwen3-8Bdocker run -v ${pwd}:/models ghcr.io/ggml-org/llama.cpp:full --quantize /models/Qwen3-8B/Qwen3-8B-F16.gguf Q8_0main: build = 6135 (25ff6f76)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing '/models/Qwen3-8B/Qwen3-8B-F16.gguf' to '/models/Qwen3-8B/ggml-model-Q8_0.gguf' as Q8_0
llama_model_loader: loaded meta data with 27 key-value pairs and 399 tensors from /models/Qwen3-8B/Qwen3-8B-F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3 8B
llama_model_loader: - kv 3: general.basename str = Qwen3
llama_model_loader: - kv 4: general.size_label str = 8B
llama_model_loader: - kv 5: qwen3.block_count u32 = 36
llama_model_loader: - kv 6: qwen3.context_length u32 = 40960
llama_model_loader: - kv 7: qwen3.embedding_length u32 = 4096
llama_model_loader: - kv 8: qwen3.feed_forward_length u32 = 12288
llama_model_loader: - kv 9: qwen3.attention.head_count u32 = 32
llama_model_loader: - kv 10: qwen3.attention.head_count_kv u32 = 8
llama_model_loader: - kv 11: qwen3.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 12: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 13: qwen3.attention.key_length u32 = 128
llama_model_loader: - kv 14: qwen3.attention.value_length u32 = 128
llama_model_loader: - kv 15: general.file_type u32 = 1
llama_model_loader: - kv 16: general.quantization_version u32 = 2
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 22: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 24: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 25: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 26: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - type f32: 145 tensors
llama_model_loader: - type f16: 254 tensors
[ 1/ 399] output.weight - [ 4096, 151936, 1, 1], type = f16, converting to q8_0 .. size = 1187.00 MiB -> 630.59 MiB
[ 2/ 399] output_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 3/ 399] token_embd.weight - [ 4096, 151936, 1, 1], type = f16, converting to q8_0 .. size = 1187.00 MiB -> 630.59 MiB
[ 4/ 399] blk.0.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 5/ 399] blk.0.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 6/ 399] blk.0.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 7/ 399] blk.0.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 8/ 399] blk.0.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 9/ 399] blk.0.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 10/ 399] blk.0.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 11/ 399] blk.0.ffn_down.weight - [12288, 4096, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 12/ 399] blk.0.ffn_gate.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 13/ 399] blk.0.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 14/ 399] blk.0.ffn_up.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 15/ 399] blk.1.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 16/ 399] blk.1.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 17/ 399] blk.1.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 18/ 399] blk.1.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 19/ 399] blk.1.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 20/ 399] blk.1.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 21/ 399] blk.1.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 22/ 399] blk.1.ffn_down.weight - [12288, 4096, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 23/ 399] blk.1.ffn_gate.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 24/ 399] blk.1.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 25/ 399] blk.1.ffn_up.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 26/ 399] blk.2.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 27/ 399] blk.2.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 28/ 399] blk.2.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 29/ 399] blk.2.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 30/ 399] blk.2.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 31/ 399] blk.2.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 32/ 399] blk.2.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 33/ 399] blk.2.ffn_down.weight - [12288, 4096, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 34/ 399] blk.2.ffn_gate.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 35/ 399] blk.2.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 36/ 399] blk.2.ffn_up.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 37/ 399] blk.3.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 38/ 399] blk.3.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 39/ 399] blk.3.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 40/ 399] blk.3.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 41/ 399] blk.3.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 42/ 399] blk.3.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 43/ 399] blk.3.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 44/ 399] blk.3.ffn_down.weight - [12288, 4096, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 45/ 399] blk.3.ffn_gate.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 46/ 399] blk.3.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 47/ 399] blk.3.ffn_up.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 48/ 399] blk.4.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 49/ 399] blk.4.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 50/ 399] blk.4.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 51/ 399] blk.4.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 52/ 399] blk.4.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 53/ 399] blk.4.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 54/ 399] blk.4.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 55/ 399] blk.4.ffn_down.weight - [12288, 4096, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 56/ 399] blk.4.ffn_gate.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 57/ 399] blk.4.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 58/ 399] blk.4.ffn_up.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 59/ 399] blk.5.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 60/ 399] blk.5.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 61/ 399] blk.5.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 62/ 399] blk.5.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 63/ 399] blk.5.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 64/ 399] blk.5.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 65/ 399] blk.5.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 66/ 399] blk.5.ffn_down.weight - [12288, 4096, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 67/ 399] blk.5.ffn_gate.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 68/ 399] blk.5.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 69/ 399] blk.5.ffn_up.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 70/ 399] blk.6.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 71/ 399] blk.6.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 72/ 399] blk.6.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 73/ 399] blk.6.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 74/ 399] blk.6.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 75/ 399] blk.6.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 76/ 399] blk.6.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 77/ 399] blk.6.ffn_down.weight - [12288, 4096, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 78/ 399] blk.6.ffn_gate.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 79/ 399] blk.6.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 80/ 399] blk.6.ffn_up.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 81/ 399] blk.7.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 82/ 399] blk.7.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 83/ 399] blk.7.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 84/ 399] blk.7.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 85/ 399] blk.7.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 86/ 399] blk.7.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 87/ 399] blk.7.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 88/ 399] blk.7.ffn_down.weight - [12288, 4096, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 89/ 399] blk.7.ffn_gate.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 90/ 399] blk.7.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 91/ 399] blk.7.ffn_up.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 92/ 399] blk.8.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 93/ 399] blk.8.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 94/ 399] blk.8.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 95/ 399] blk.8.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 96/ 399] blk.8.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 97/ 399] blk.8.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 98/ 399] blk.8.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 99/ 399] blk.8.ffn_down.weight - [12288, 4096, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 100/ 399] blk.8.ffn_gate.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 101/ 399] blk.8.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 102/ 399] blk.8.ffn_up.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 103/ 399] blk.9.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 104/ 399] blk.9.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 105/ 399] blk.9.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 106/ 399] blk.9.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 107/ 399] blk.9.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 108/ 399] blk.9.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 109/ 399] blk.9.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 110/ 399] blk.9.ffn_down.weight - [12288, 4096, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 111/ 399] blk.9.ffn_gate.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 112/ 399] blk.9.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 113/ 399] blk.9.ffn_up.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 114/ 399] blk.10.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 115/ 399] blk.10.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 116/ 399] blk.10.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 117/ 399] blk.10.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 118/ 399] blk.10.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 119/ 399] blk.10.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 120/ 399] blk.10.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 121/ 399] blk.10.ffn_down.weight - [12288, 4096, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 122/ 399] blk.10.ffn_gate.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 123/ 399] blk.10.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 124/ 399] blk.10.ffn_up.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 125/ 399] blk.11.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 126/ 399] blk.11.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 127/ 399] blk.11.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 128/ 399] blk.11.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 129/ 399] blk.11.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 130/ 399] blk.11.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 131/ 399] blk.11.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 132/ 399] blk.11.ffn_down.weight - [12288, 4096, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 133/ 399] blk.11.ffn_gate.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 134/ 399] blk.11.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 135/ 399] blk.11.ffn_up.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 136/ 399] blk.12.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 137/ 399] blk.12.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 138/ 399] blk.12.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 139/ 399] blk.12.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 140/ 399] blk.12.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 141/ 399] blk.12.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 142/ 399] blk.12.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 143/ 399] blk.12.ffn_down.weight - [12288, 4096, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 144/ 399] blk.12.ffn_gate.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 145/ 399] blk.12.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 146/ 399] blk.12.ffn_up.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 147/ 399] blk.13.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 148/ 399] blk.13.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 149/ 399] blk.13.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 150/ 399] blk.13.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 151/ 399] blk.13.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 152/ 399] blk.13.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 153/ 399] blk.13.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 154/ 399] blk.13.ffn_down.weight - [12288, 4096, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 155/ 399] blk.13.ffn_gate.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 156/ 399] blk.13.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 157/ 399] blk.13.ffn_up.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 158/ 399] blk.14.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 159/ 399] blk.14.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 160/ 399] blk.14.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 161/ 399] blk.14.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 162/ 399] blk.14.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 163/ 399] blk.14.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 164/ 399] blk.14.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 165/ 399] blk.14.ffn_down.weight - [12288, 4096, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 166/ 399] blk.14.ffn_gate.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 167/ 399] blk.14.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 168/ 399] blk.14.ffn_up.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 169/ 399] blk.15.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 170/ 399] blk.15.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 171/ 399] blk.15.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 172/ 399] blk.15.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 173/ 399] blk.15.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 174/ 399] blk.15.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 175/ 399] blk.15.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 176/ 399] blk.15.ffn_down.weight - [12288, 4096, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 177/ 399] blk.15.ffn_gate.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 178/ 399] blk.15.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 179/ 399] blk.15.ffn_up.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 180/ 399] blk.16.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 181/ 399] blk.16.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 182/ 399] blk.16.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 183/ 399] blk.16.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 184/ 399] blk.16.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 185/ 399] blk.16.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 186/ 399] blk.16.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 187/ 399] blk.16.ffn_down.weight - [12288, 4096, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 188/ 399] blk.16.ffn_gate.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 189/ 399] blk.16.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 190/ 399] blk.16.ffn_up.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 191/ 399] blk.17.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 192/ 399] blk.17.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 193/ 399] blk.17.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 194/ 399] blk.17.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 195/ 399] blk.17.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 196/ 399] blk.17.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 197/ 399] blk.17.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 198/ 399] blk.17.ffn_down.weight - [12288, 4096, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 199/ 399] blk.17.ffn_gate.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 200/ 399] blk.17.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 201/ 399] blk.17.ffn_up.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 202/ 399] blk.18.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 203/ 399] blk.18.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 204/ 399] blk.18.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 205/ 399] blk.18.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 206/ 399] blk.18.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 207/ 399] blk.18.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 208/ 399] blk.18.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 209/ 399] blk.18.ffn_down.weight - [12288, 4096, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 210/ 399] blk.18.ffn_gate.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 211/ 399] blk.18.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 212/ 399] blk.18.ffn_up.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 213/ 399] blk.19.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 214/ 399] blk.19.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 215/ 399] blk.19.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 216/ 399] blk.19.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 217/ 399] blk.19.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 218/ 399] blk.19.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 219/ 399] blk.19.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 220/ 399] blk.19.ffn_down.weight - [12288, 4096, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 221/ 399] blk.19.ffn_gate.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 222/ 399] blk.19.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 223/ 399] blk.19.ffn_up.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 224/ 399] blk.20.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 225/ 399] blk.20.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 226/ 399] blk.20.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 227/ 399] blk.20.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 228/ 399] blk.20.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 229/ 399] blk.20.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 230/ 399] blk.20.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 231/ 399] blk.20.ffn_down.weight - [12288, 4096, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 232/ 399] blk.20.ffn_gate.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 233/ 399] blk.20.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 234/ 399] blk.20.ffn_up.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 235/ 399] blk.21.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 236/ 399] blk.21.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 237/ 399] blk.21.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 238/ 399] blk.21.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 239/ 399] blk.21.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 240/ 399] blk.21.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 241/ 399] blk.21.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 242/ 399] blk.21.ffn_down.weight - [12288, 4096, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 243/ 399] blk.21.ffn_gate.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 244/ 399] blk.21.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 245/ 399] blk.21.ffn_up.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 246/ 399] blk.22.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 247/ 399] blk.22.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 248/ 399] blk.22.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 249/ 399] blk.22.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 250/ 399] blk.22.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 251/ 399] blk.22.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 252/ 399] blk.22.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 253/ 399] blk.22.ffn_down.weight - [12288, 4096, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 254/ 399] blk.22.ffn_gate.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 255/ 399] blk.22.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 256/ 399] blk.22.ffn_up.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 257/ 399] blk.23.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 258/ 399] blk.23.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 259/ 399] blk.23.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 260/ 399] blk.23.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 261/ 399] blk.23.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 262/ 399] blk.23.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 263/ 399] blk.23.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 264/ 399] blk.23.ffn_down.weight - [12288, 4096, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 265/ 399] blk.23.ffn_gate.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 266/ 399] blk.23.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 267/ 399] blk.23.ffn_up.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 268/ 399] blk.24.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 269/ 399] blk.24.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 270/ 399] blk.24.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 271/ 399] blk.24.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 272/ 399] blk.24.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 273/ 399] blk.24.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 274/ 399] blk.24.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 275/ 399] blk.24.ffn_down.weight - [12288, 4096, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 276/ 399] blk.24.ffn_gate.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 277/ 399] blk.24.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 278/ 399] blk.24.ffn_up.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 279/ 399] blk.25.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 280/ 399] blk.25.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 281/ 399] blk.25.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 282/ 399] blk.25.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 283/ 399] blk.25.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 284/ 399] blk.25.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 285/ 399] blk.25.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 286/ 399] blk.25.ffn_down.weight - [12288, 4096, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 287/ 399] blk.25.ffn_gate.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 288/ 399] blk.25.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 289/ 399] blk.25.ffn_up.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 290/ 399] blk.26.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 291/ 399] blk.26.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 292/ 399] blk.26.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 293/ 399] blk.26.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 294/ 399] blk.26.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 295/ 399] blk.26.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 296/ 399] blk.26.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 297/ 399] blk.26.ffn_down.weight - [12288, 4096, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 298/ 399] blk.26.ffn_gate.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 299/ 399] blk.26.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 300/ 399] blk.26.ffn_up.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 301/ 399] blk.27.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 302/ 399] blk.27.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 303/ 399] blk.27.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 304/ 399] blk.27.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 305/ 399] blk.27.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 306/ 399] blk.27.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 307/ 399] blk.27.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 308/ 399] blk.27.ffn_down.weight - [12288, 4096, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 309/ 399] blk.27.ffn_gate.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 310/ 399] blk.27.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 311/ 399] blk.27.ffn_up.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 312/ 399] blk.28.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 313/ 399] blk.28.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 314/ 399] blk.28.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 315/ 399] blk.28.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 316/ 399] blk.28.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 317/ 399] blk.28.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 318/ 399] blk.28.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 319/ 399] blk.28.ffn_down.weight - [12288, 4096, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 320/ 399] blk.28.ffn_gate.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 321/ 399] blk.28.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 322/ 399] blk.28.ffn_up.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 323/ 399] blk.29.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 324/ 399] blk.29.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 325/ 399] blk.29.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 326/ 399] blk.29.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 327/ 399] blk.29.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 328/ 399] blk.29.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 329/ 399] blk.29.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 330/ 399] blk.29.ffn_down.weight - [12288, 4096, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 331/ 399] blk.29.ffn_gate.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 332/ 399] blk.29.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 333/ 399] blk.29.ffn_up.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 334/ 399] blk.30.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 335/ 399] blk.30.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 336/ 399] blk.30.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 337/ 399] blk.30.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 338/ 399] blk.30.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 339/ 399] blk.30.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 340/ 399] blk.30.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 341/ 399] blk.30.ffn_down.weight - [12288, 4096, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 342/ 399] blk.30.ffn_gate.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 343/ 399] blk.30.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 344/ 399] blk.30.ffn_up.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 345/ 399] blk.31.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 346/ 399] blk.31.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 347/ 399] blk.31.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 348/ 399] blk.31.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 349/ 399] blk.31.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 350/ 399] blk.31.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 351/ 399] blk.31.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 352/ 399] blk.31.ffn_down.weight - [12288, 4096, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 353/ 399] blk.31.ffn_gate.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 354/ 399] blk.31.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 355/ 399] blk.31.ffn_up.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 356/ 399] blk.32.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 357/ 399] blk.32.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 358/ 399] blk.32.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 359/ 399] blk.32.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 360/ 399] blk.32.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 361/ 399] blk.32.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 362/ 399] blk.32.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 363/ 399] blk.32.ffn_down.weight - [12288, 4096, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 364/ 399] blk.32.ffn_gate.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 365/ 399] blk.32.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 366/ 399] blk.32.ffn_up.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 367/ 399] blk.33.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 368/ 399] blk.33.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 369/ 399] blk.33.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 370/ 399] blk.33.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 371/ 399] blk.33.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 372/ 399] blk.33.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 373/ 399] blk.33.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 374/ 399] blk.33.ffn_down.weight - [12288, 4096, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 375/ 399] blk.33.ffn_gate.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 376/ 399] blk.33.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 377/ 399] blk.33.ffn_up.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 378/ 399] blk.34.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 379/ 399] blk.34.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 380/ 399] blk.34.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 381/ 399] blk.34.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 382/ 399] blk.34.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 383/ 399] blk.34.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 384/ 399] blk.34.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 385/ 399] blk.34.ffn_down.weight - [12288, 4096, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 386/ 399] blk.34.ffn_gate.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 387/ 399] blk.34.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 388/ 399] blk.34.ffn_up.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 389/ 399] blk.35.attn_k.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 390/ 399] blk.35.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 391/ 399] blk.35.attn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 392/ 399] blk.35.attn_output.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 393/ 399] blk.35.attn_q.weight - [ 4096, 4096, 1, 1], type = f16, converting to q8_0 .. size = 32.00 MiB -> 17.00 MiB
[ 394/ 399] blk.35.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[ 395/ 399] blk.35.attn_v.weight - [ 4096, 1024, 1, 1], type = f16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
[ 396/ 399] blk.35.ffn_down.weight - [12288, 4096, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 397/ 399] blk.35.ffn_gate.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
[ 398/ 399] blk.35.ffn_norm.weight - [ 4096, 1, 1, 1], type = f32, size = 0.016 MB
[ 399/ 399] blk.35.ffn_up.weight - [ 4096, 12288, 1, 1], type = f16, converting to q8_0 .. size = 96.00 MiB -> 51.00 MiB
llama_model_quantize_impl: model size = 15623.18 MB
llama_model_quantize_impl: quant size = 8300.36 MB
main: quantize time = 179290.17 ms
main: total time = 179290.17 msServer¶
도커 호스트가 윈도우인 경우, 경로 예시는 다음과 같습니다.
호스트의 모델 경로:
$HOST_MODEL_DIR="D:\models"컨테이너에 마운트 경로:
$MODEL_PATH="/models/Qwen3-8B/Qwen3-8B-Q8_0.gguf"
docker run --name llama -v $HOST_MODEL_DIR:/models -p 12345:12345 --gpus all ghcr.io/ggml-org/llama.cpp:full-cuda --server -m $MODEL_PATH --port 12345 --n-gpu-layers 999 --host 0.0.0.0