前不久,Meta前脚发布完开源大语言模型LLaMA,后脚就被网友放出了无门槛下载链接,「惨遭」开放。消息一出,圈内瞬间就热闹了起来,大家纷纷开始下载测试。但那些手头没有顶级显卡的朋友们,就只能望模型兴叹了。不过,问题不大。Georgi Gerganov在最近做了一个名为「llama.cpp」的项目——没有GPU也能跑LLaMA。
-p '第一个登上月球的人是' 第一个登上月球的人是38岁的宇航员Neil A. Armstrong。 1969年7月20日,阿波罗11号在月球上着陆。 阿波罗11号是第一个登陆月球的载人任务,也是太空竞赛的高潮。1969年7月,Neil Armstrong和Edwin "Buzz" Aldrin成为第一批在月球上登陆的人类。 阿波罗11号于7月16日从肯尼迪航天中心发射。
-p 'def open_and_return_content(filename):'
def open_and_return_content(filename): """ Opens file (returning the content) and performs basic sanity checks """if os.path.isfile(filename): with open(filename) as f: content = f.read() return contentelse: print('WARNING: file "{}" does not exist'.format(filename), file=sys.stderr) return ''def get_file_info(filename, fullpath): """ Get file information (i.e., permission, owner, group, size) """
xcode-select --install
brew install pkgconfig cmake
在环境的配置上,假如你用的是Python 3.11,则可以创建一个虚拟环境:
/opt/homebrew/bin/python3.11 -m venv venv
. venv/bin/activate.fish
pip3 install --pre torch torchvision --extra-index-url https://download.pytorch.org/whl/nightly/cpu
pythonPython 3.11.2 (main, Feb 16 2023, 02:55:59) [Clang 14.0.0 (clang-1400.0.29.202)] on darwinType "help", "copyright", "credits" or "license" for more information.>>> import torch; torch.backends.mps.is_available()True
第三步:编译LLaMA CPP
git clone git@github.com:ggerganov/llama.cpp.git
makeI llama.cpp build info:I UNAME_S: DarwinI UNAME_P: armI UNAME_M: arm64I CFLAGS: -I. -O3 -DNDEBUG -std=c11 -fPIC -pthread -DGGML_USE_ACCELERATEI CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthreadI LDFLAGS: -framework AccelerateI CC: Apple clang version 14.0.0 (clang-1400.0.29.202)I CXX: Apple clang version 14.0.0 (clang-1400.0.29.202)cc -I. -O3 -DNDEBUG -std=c11 -fPIC -pthread -DGGML_USE_ACCELERATE -c ggml.c -o ggml.oc++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread -c utils.cpp -o utils.oc++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread main.cpp ggml.o utils.o -o main -framework Accelerate./main -husage: ./main [options]options: -h, --help show this help message and exit -s SEED, --seed SEED RNG seed (default: -1) -t N, --threads N number of threads to use during computation (default: 4) -p PROMPT, --prompt PROMPT prompt to start generation with (default: random) -n N, --n_predict N number of tokens to predict (default: 128) --top_k N top-k sampling (default: 40) --top_p N top-p sampling (default: 0.9) --temp N temperature (default: 0.8) -b N, --batch_size N batch size for prompt processing (default: 8) -m FNAME, --model FNAME model path (default: models/llama-7B/ggml-model.bin)c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread quantize.cpp ggml.o utils.o -o quantize -framework Accelerate
假设你已经把模型放在llama.cpp repo中的models/下。
python convert-pth-to-ggml.py models/7B 1
{'dim': 4096, 'multiple_of': 256, 'n_heads': 32, 'n_layers': 32, 'norm_eps': 1e-06, 'vocab_size': 32000}n_parts = 1Processing part 0Processing variable: tok_embeddings.weight with shape: torch.Size([32000, 4096]) and type: torch.float16Processing variable: norm.weight with shape: torch.Size([4096]) and type: torch.float16 Converting to float32Processing variable: output.weight with shape: torch.Size([32000, 4096]) and type: torch.float16Processing variable: layers.0.attention.wq.weight with shape: torch.Size([4096, 4096]) and type: torch.float16Processing variable: layers.0.attention.wk.weight with shape: torch.Size([4096, 4096]) and type: torch.float16Processing variable: layers.0.attention.wv.weight with shape: torch.Size([4096, 4096]) and type: torch.float16Processing variable: layers.0.attention.wo.weight with shape: torch.Size([4096, 4096]) and type: torch.float16Processing variable: layers.0.feed_forward.w1.weight with shape: torch.Size([11008, 4096]) and type: torch.float16Processing variable: layers.0.feed_forward.w2.weight with shape: torch.Size([4096, 11008]) and type: torch.float16Processing variable: layers.0.feed_forward.w3.weight with shape: torch.Size([11008, 4096]) and type: torch.float16Processing variable: layers.0.attention_norm.weight with shape: torch.Size([4096]) and type: torch.float16...Done. Output file: models/7B/ggml-model-f16.bin, (part 0 )
./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin 2
llama_model_quantize: loading model from './models/7B/ggml-model-f16.bin'llama_model_quantize: n_vocab = 32000llama_model_quantize: n_ctx = 512llama_model_quantize: n_embd = 4096llama_model_quantize: n_mult = 256llama_model_quantize: n_head = 32llama_model_quantize: n_layer = 32llama_model_quantize: f16 = 1...layers.31.attention_norm.weight - [ 4096, 1], type = f32 size = 0.016 MBlayers.31.ffn_norm.weight - [ 4096, 1], type = f32 size = 0.016 MBllama_model_quantize: model size = 25705.02 MBllama_model_quantize: quant size = 4017.27 MBllama_model_quantize: hist: 0.000 0.022 0.019 0.033 0.053 0.078 0.104 0.125 0.134 0.125 0.104 0.078 0.053 0.033 0.019 0.022 main: quantize time = 29389.45 msmain: total time = 29389.45 ms
./main -m ./models/7B/ggml-model-q4_0.bin \ -t 8 \ -n 128 \ -p 'The first president of the USA was '
main: seed = 1678615879llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...llama_model_load: n_vocab = 32000llama_model_load: n_ctx = 512llama_model_load: n_embd = 4096llama_model_load: n_mult = 256llama_model_load: n_head = 32llama_model_load: n_layer = 32llama_model_load: n_rot = 128llama_model_load: f16 = 2llama_model_load: n_ff = 11008llama_model_load: n_parts = 1llama_model_load: ggml ctx size = 4529.34 MBllama_model_load: memory_size = 512.00 MB, n_mem = 16384llama_model_load: loading model part 1/1 from './models/7B/ggml-model-q4_0.bin'llama_model_load: .................................... donellama_model_load: model size = 4017.27 MB / num tensors = 291main: prompt: 'The first president of the USA was 'main: number of tokens in prompt = 9 1 -> '' 1576 -> 'The' 937 -> ' first' 6673 -> ' president' 310 -> ' of' 278 -> ' the' 8278 -> ' USA' 471 -> ' was' 29871 -> ' 'sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000 The first president of the USA was 57 years old when he assumed office (George Washington). Nowadays, the US electorate expects the new president to be more young at heart. President Donald Trump was 70 years old when he was inaugurated. In contrast to his predecessors, he is physically fit, healthy and active. And his fitness has been a prominent theme of his presidency. During the presidential campaign, he famously said he would be the “most active president ever” — a statement Trump has not yet achieved, but one that fits his approach to the office. His tweets demonstrate his physical activity. main: mem per token = 14434244 bytesmain: load time = 1311.74 msmain: sample time = 278.96 msmain: predict time = 7375.89 ms / 54.23 ms per tokenmain: total time = 9216.61 ms