使用数据缓存加速大模型加载和应用启动-开发者社区-阿里云

从一个典型的应用场景说起

我在HuggingFace上找到了一个语言模型 stabilityai/stablelm-base-alpha-7b，然后下载到开发环境，和应用打包成一个容器镜像，大约35GB（应用本身2GB，模型32GB+），并制作容器镜像缓存，然后部署容器启动应用，主要的流程可以概括为如下图所示：

应用启动后，测试下效果：

可以看到该模型的效果一般，因为这个模型是没有调优过的。

于是我继续寻找更好的模型，当然专业的模型开发也会对模型进行调优然后生成新的模型，当我找到了另一个调优好的模型 stabilityai/stablelm-tuned-alpha-7b，需要验证下它的效果，然后开始重复容器镜像构建，替换掉之前打包的模型，并重新制作容器镜像缓存，然后部署容器，启动新的应用等：

可以看到基于调优的模型后，推理结果好了很多。

然而大语言模型的选择太多了，https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

截屏2023-07-21 下午1.15.47.png

我们在找到最合适的模型之前，每次调试都要重新发布应用镜像。

模型调优好了，我发现应用的界面过于简单，于是准备优化一下应用，这时候又得重复制作容器镜像、制作容器镜像缓存、部署容器以及启动新的应用等流程，这样周而复始。

在前面的整个流程里，我发现几个问题：

1、虽然我们通过容器镜像直接打包模型的方式，天然地利用了容器镜像的版本管理能力，以及容器镜像缓存能力，但是带来的问题就是容器镜像的体积过于庞大，迭代频率也明显增加，当应用和模型任何一部分需要进行修改的时候，都得重复容器镜像版本发布的工作。

2、随着MaaS的概念不断深入，模型逐渐开始具备相对独立的版本管理以及仓库存储能力，比如阿里云的魔搭以及HuggingFace，甚至和容器镜像的OCI也越来越相似，比如[1][2]。另一个方面，庞大的应用往往都可以进行微服务化拆分部署，而模型却不行，所以导致模型往往远比镜像本身大得多。

因此，容器镜像与模型的分离也成为了必然趋势。

模型存储选型

回到上一节的案例，如果能够把镜像和模型分离，那么我的容器镜像基本可以维持在2GB左右，远远低于合并后的大小。不论是镜像构建、仓库推送，还是镜像缓存的制作，流程的耗时都会有明显压缩，而且模型本身的迭代都不用镜像发版，维护成本上也有明显的降低。但是，问题也随之而来：通过把模型打入镜像后，我们可以借助容器镜像原生的缓存能力起到应用启动加速的能力，解耦后如何实现同等甚至更好的应用启动效果呢？

首先，关于模型的存储，目前主要有如下选择：

1、存入类似镜像仓库的模型仓库源，用的时候直接下载，比如modelDB[1]、ModelScope[4]、HuggingFace[5]。

2、存入NAS/OSS等文件存储或者类似的外部存储，比如阿里云的PAI主要就是采用这种方式存储模型。

3、可以在node上分发模型，直接通过容器的hostpath挂载进容器，实现单机、集群级别的共享。

方案1和2属于同一种，方案1虽然会把模型包一层以更便于管理，但是背后的文件存储还是基于OSS等，依然需要网络下载，这种方式在并发度高的场景就会遇到带宽的瓶颈问题，后文会有详细的分析。虽然通过类似fluid[3]缓存加速技术可以缓解数据源的带宽问题，但是缓存节点的成本、维护都是非常复杂的问题，对用户来说学习成本也略高。

方案3不适用于我们云上应用，尤其是serverless、多租场景，更多是在线下IDC里会用到的方法。

对于我们云上serverless场景，又该如何解决模型加速的问题呢？

数据缓存加速

既然有容器镜像缓存，我们也一样可以做大模型的数据缓存，如果我们能够把容器镜像和模型分别缓存到不同的缓存对象里，就能实现解耦+加速双重效果。

我们的数据缓存加速就是在这样的背景下开始设计，新的应用发布的流程就变成了如下图所示：

数据源

用户数据存储一般都是NFS、OSS，像大模型还可能是阿里云的魔搭ModelScope[4]以及HuggingFace[5]仓库等，我们对以上的数据源都提供了缓存支持。

此外，我们针对ModelScope和HuggingFace还做了优化，对于这些仓库中比较常见的、热门的一些大模型我们都做了提前预热，用户秒级就可以完成模型的缓存，即刻获取模型启动加速，无需走用户网络以及公网流量拉取模型数据。比如创建一个80GB的 decapoda-research/llama-30b-hf 模型缓存可以在1s内完成。

API设计

我们提供标准的阿里云openAPI以及k8s CRD的方式来管理缓存资源。

缓存统一通过bucket进行管理，每个bucket都是一个独立的完整的文件系统，用户可以自定义每个路径下缓存的具体的文件内容。比如，用户可以在下面的不同的bucket和目录里分别存放对应的Huggingface以及ModelScope的模型：

对于NFS、OSS，我们缓存创建的api的参数形式跟k8s的csi参数形式完全兼容，比如模型如果已经存放在NAS的某个目录，只需设置下面的参数就可以完成自动缓存：

{
"options": {
"path": "/models/damo/cv_gpen_image-portrait-enhancement/",
"server": "test-ort70.cn-hangzhou.nas.aliyuncs.com",
"vers": "3",
"options": "nolock,proto=tcp,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport"    },
"type": "NAS"}

对于ModelScope和HuggingFace等主流的模型仓库，我们只需要用户设置仓库类型、仓库名以及版本就可以非常便捷地完成数据的缓存，无需用户再下载到本地开发环境或者其他的中间过渡的存储系统中。

阿里云魔搭的模型：

{
"options": {
"repoId": "damo/cv_gpen_image-portrait-enhancement",
"repoSource": "ModelScope/Model"    },
"type": "URL"}

HuggingFace的模型：

{
"options": {
"repoId": "decapoda-research/llama-13b-hf",
"repoSource": "HuggingFace/Model"    },
"type": "URL"}

使用方式

在使用方式上，我们没有增加新的存储卷类型，而是直接基于k8s原生的hostpath，即每个bucket+目录唯一对应serverless场景下的一个虚拟host路径。然后用户每次启动容器的时候，只需要通过hostpath的方式将缓存好的目录挂载进容器的任意目录，跟直接通过hostpath挂载本地数据一样的效果。

比如我将模型llama-7b-hf存入了缓存的/models/llama-7b-hf目录下，将权重alpaca-lora-7b存入了缓存的/weights/alpaca-lora-7b目录下，创建k8s pod的时候只需通过如下的hostpath的方式进行挂载即可：

volumes: - name: llama-model
   hostPath:   path: /models/llama-7b-hf
 - name: alpacalora-weight
   hostPath:   path: /weights/alpaca-lora-7b

登入容器后查看模型文件：

HelloWord—ModeScope

我们以ModelScope上阿里达摩院开源的中文分词模型 damo/nlp_structbert_word-segmentation_chinese-base 为例介绍模型缓存的使用流程。

创建模型缓存

apiVersion: eci.aliyun.com/v1alpha1
kind: DataCache
metadata:  name: word-seg
spec:  path: /model/ms/ # 挂载路径  dataSource:    type: URL # 数据来源类型    options:      repoSource: ModelScope/Model # ModelScope与HuggingFace支持直接指定repoId      repoId: damo/nlp_structbert_word-segmentation_chinese-base
  securityGroupId: sg-*** # 指定用于访问数据的安全组  vSwitchId: vsw-*** # 指定可以访问数据的交换机

除了k8s crd的方式，也可以采用api[6]的方式：

request=CreateDataCacheRequest()
request.set_Name("word-seg")
request.set_Path("/model/ms")
data_source= {
"Type": "URL",
"Options": {
"repoId": "damo/nlp_structbert_word-segmentation_chinese-base",
"repoSource": "ModelScope/Model"    }
}
request.set_DataSource(data_source)
request.set_Size(20)
request.set_VSwitchId("vsw-***")
request.set_SecurityGroupId("sg-***")
#eip_create_param = {'Bandwidth': 100}#request.set_EipCreateParam(eip_create_param)

创建完成后可以登录eci的控制台查看缓存的进度以及缓存的信息，待状态变成Available后即可使用。

制作容器镜像

参考：https://modelscope.cn/models/damo/nlp_structbert_word-segmentation_chinese-base/quickstart

只需要安装pyton环境以及modelscope包即可。

部署应用

apiVersion: v1
kind: Pod
metadata:  name: word-seg
  labels:     alibabacloud.com/eci: "true"  annotations:    k8s.aliyun.com/eci-data-cache-bucket: "default"spec:  containers:    - name: modelscope
      image: registry.cn-hangzhou.aliyuncs.com/modelscope-repo/modelscope:ubuntu20.04-py38-torch1.11.0-tf1.15.5-1.6.1
      command: ["sleep","999999"]      volumeMounts:        - name: "model"          mountPath: "/model"  volumes:     - name: "model"      hostPath:        path: "/model/ms/"

验证

准备脚本：

frommodelscope.modelsimportModelfrommodelscope.pipelinesimportpipelinefrommodelscope.utils.constantimportTasksfrommodelscope.preprocessorsimportTokenClassificationTransformersPreprocessormodel_id='/model'# 模型文件在容器中的挂载路径model=Model.from_pretrained(model_id)
tokenizer=TokenClassificationTransformersPreprocessor(model.model_dir)
pipeline_ins=pipeline(task=Tasks.word_segmentation, model=model, preprocessor=tokenizer)
result=pipeline_ins(input="今天天气不错，适合出去游玩")
print (result)

执行：

kubectl cp code.py word-seg:/code.py
kubectl exec word-seg -- python /code.py
2023-07-0620:02:48,259 - modelscope - INFO - PyTorch version 1.11.0+cpu Found.
2023-07-0620:02:48,264 - modelscope - INFO - Loading ast index from /mnt/workspace/.cache/modelscope/ast_indexer
2023-07-0620:02:48,296 - modelscope - INFO - Loading done! Current index file version is 1.6.1, with md5 3eb986e537c8ebddda950f15ed9a2bae and a total number of 849 components indexed
2023-07-0620:02:49,524 - modelscope - INFO - initialize model from /model
2023-07-0620:02:55,689 - modelscope - INFO - cuda is not available, using cpu instead.
{'output': ['今天', '天气', '不错', '，', '适合', '出去', '游玩']}

HelloWord—HuggingFace

我们以HuggingFace上开源的llama语言模型 decapoda-research/llama-7b-hf 为例介绍模型缓存的使用流程。

创建模型缓存

apiVersion: eci.aliyun.com/v1alpha1
kind: DataCache
metadata:  name: llama-7b-hf
spec:  bucket: huggingFace-model # 缺省为default  path: /models/llama-7b-hf # 挂载路径  dataSource:    type: URL # 数据来源类型    options:      repoSource: HuggingFace/Model # ModelScope与HuggingFace支持直接指定repoId      repoId: decapoda-research/llama-7b-hf
   securityGroupId: sg-*** # 指定用于访问数据的安全组   vSwitchId: vsw-*** # 指定可以访问数据的交换机

除了k8s crd的方式，也可以采用api[6]的方式：

request=CreateDataCacheRequest()
request.set_Name("llama-7b-hf")
request.set_Bucket("huggingFace-model")
request.set_Path("/models/llama-7b-hf")
data_source= {
"Type": "URL",
"Options": {
"repoId": "decapoda-research/llama-7b-hf",
"repoSource": "HuggingFace/Model"    }
}
request.set_DataSource(data_source)
request.set_Size(20)
request.set_VSwitchId("vsw-***")
request.set_SecurityGroupId("sg-***")
#eip_create_param = {'Bandwidth': 100}#request.set_EipCreateParam(eip_create_param)

创建完成后可以登录eci的控制台查看缓存的进度以及缓存的信息，待状态变成Available后即可使用。

创建模型的权重缓存

apiVersion: eci.aliyun.com/v1alpha1
kind: DataCache
metadata:  name: alpaca-lora-7b
spec:  bucket: huggingFace-model
  path: /weights/alpaca-lora-7b # 挂载路径  dataSource:    type: URL # 数据来源类型    options:      repoSource: HuggingFace/Model # ModelScope与HuggingFace支持直接指定repoId      repoId: tloen/alpaca-lora-7b
  securityGroupId: sg-*** # 指定用于访问数据的安全组  vSwitchId: vsw-*** # 指定可以访问数据的交换机

因为我们已经进行了预热，所以都是秒级创建完成。

制作容器镜像

参考：https://github.com/tloen/alpaca-lora

也可以直接使用ECI公开的镜像：registry.cn-hangzhou.aliyuncs.com/eci_open/alpaca-lora:1.0.0，且已经制作好公共镜像缓存，所有用户都可以免拉取。

部署应用

{
"metadata": {
"annotations": {
"k8s.aliyun.com/eci-image-cache": "true",
"k8s.aliyun.com/eci-data-cache-provisionedIops": "35000",
"k8s.aliyun.com/eci-data-cache-burstingEnabled": "true",
"k8s.aliyun.com/eci-use-specs": "64-128G",
"k8s.aliyun.com/eci-data-cache-bucket": "huggingFace-model",
"k8s.aliyun.com/eci-extra-ephemeral-storage": "30Gi"        },
"name": "alpacalora",
"namespace": "default"    },
"spec": {
"containers": [
            {
"args": [
"-c",
"python3.10 generate.py --load_8bit --base_model /data/model/llama-7b-hf --lora_weights /data/weight/alpaca-lora-7b"                ],
"command": [
"/bin/sh"                ],
"image": "registry.cn-hangzhou.aliyuncs.com/eci_open/alpaca-lora:1.0.0",
"imagePullPolicy": "IfNotPresent",
"name": "alpacalora",
"volumeMounts": [
                    {
"mountPath": "/data/weight/alpaca-lora-7b",
"name": "alpacalora-weight"                    },
                    {
"mountPath": "/data/model/llama-7b-hf",
"name": "llama-model"                    }
                ]
            }
        ],
"restartPolicy": "Never",
"volumes": [
            {
"hostPath": {
"path": "/models/llama-7b-hf"                },
"name": "llama-model"            },
            {
"hostPath": {
"path": "/weights/alpaca-lora-7b"                },
"name": "alpacalora-weight"            }
        ]
    }

参数说明：

"k8s.aliyun.com/eci-image-cache": "true", # 开启容器镜像缓存可以提高容器镜像加载速度"k8s.aliyun.com/eci-data-cache-provisionedIops": "35000", 
"k8s.aliyun.com/eci-data-cache-burstingEnabled": "true", ## 开启burst可以提高缓存载入内存的速度"k8s.aliyun.com/eci-use-specs": "64-128G", # 该语言模型可以直接基于cpu推理，也可以选择GPU会更快"k8s.aliyun.com/eci-data-cache-bucket": "huggingFace-model", ## virtual hostpath归属的bucket"k8s.aliyun.com/eci-extra-ephemeral-storage": "30Gi"## 扩容临时存储

验证：

更多的可用Llama模型可以参考[7]

性能对比

我们针对不同大小的模型分别进行了不同并发条件下的应用启动时间（主要是模型加载到内存+应用启动时间）对比。我们通过启动流程打点记录加载耗时，样例如下：

Sat Jul 1508:27:03 UTC 2023/usr/local/lib/python3.10/dist-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. 
The class this function is called from is 'LlamaTokenizer'.

===================================BUG REPORT===================================Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cpu.so
/usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cpu.so...

Loading checkpoint shards:   0%|          | 0/41 [00:00<?, ?it/s]
Loading checkpoint shards:   2%|▏         | 1/41 [00:01<00:45,  1.13s/it]
Loading checkpoint shards:   5%|▍         | 2/41 [00:02<00:42,  1.09s/it]
Loading checkpoint shards:   7%|▋         | 3/41 [00:03<00:41,  1.09s/it]
Loading checkpoint shards:  10%|▉         | 4/41 [00:04<00:38,  1.05s/it]
Loading checkpoint shards:  12%|█▏        | 5/41 [00:05<00:37,  1.05s/it]
Loading checkpoint shards:  15%|█▍        | 6/41 [00:06<00:36,  1.05s/it]
Loading checkpoint shards:  17%|█▋        | 7/41 [00:07<00:35,  1.05s/it]
Loading checkpoint shards:  20%|█▉        | 8/41 [00:08<00:35,  1.06s/it]
Loading checkpoint shards:  22%|██▏       | 9/41 [00:09<00:33,  1.06s/it]
Loading checkpoint shards:  24%|██▍       | 10/41 [00:10<00:32,  1.05s/it]
Loading checkpoint shards:  27%|██▋       | 11/41 [00:11<00:31,  1.06s/it]
Loading checkpoint shards:  29%|██▉       | 12/41 [00:12<00:31,  1.07s/it]
Loading checkpoint shards:  32%|███▏      | 13/41 [00:13<00:30,  1.07s/it]
Loading checkpoint shards:  34%|███▍      | 14/41 [00:14<00:28,  1.07s/it]
Loading checkpoint shards:  37%|███▋      | 15/41 [00:15<00:27,  1.06s/it]
Loading checkpoint shards:  39%|███▉      | 16/41 [00:16<00:26,  1.05s/it]
Loading checkpoint shards:  41%|████▏     | 17/41 [00:17<00:25,  1.04s/it]
Loading checkpoint shards:  44%|████▍     | 18/41 [00:19<00:23,  1.04s/it]
Loading checkpoint shards:  46%|████▋     | 19/41 [00:20<00:22,  1.02s/it]
Loading checkpoint shards:  49%|████▉     | 20/41 [00:21<00:21,  1.04s/it]
Loading checkpoint shards:  51%|█████     | 21/41 [00:22<00:20,  1.03s/it]
Loading checkpoint shards:  54%|█████▎    | 22/41 [00:23<00:19,  1.05s/it]
Loading checkpoint shards:  56%|█████▌    | 23/41 [00:24<00:19,  1.06s/it]
Loading checkpoint shards:  59%|█████▊    | 24/41 [00:25<00:18,  1.06s/it]
Loading checkpoint shards:  61%|██████    | 25/41 [00:26<00:17,  1.07s/it]
Loading checkpoint shards:  63%|██████▎   | 26/41 [00:27<00:16,  1.08s/it]
Loading checkpoint shards:  66%|██████▌   | 27/41 [00:28<00:15,  1.10s/it]
Loading checkpoint shards:  68%|██████▊   | 28/41 [00:29<00:14,  1.09s/it]
Loading checkpoint shards:  71%|███████   | 29/41 [00:30<00:13,  1.09s/it]
Loading checkpoint shards:  73%|███████▎  | 30/41 [00:31<00:12,  1.11s/it]
Loading checkpoint shards:  76%|███████▌  | 31/41 [00:33<00:10,  1.10s/it]
Loading checkpoint shards:  78%|███████▊  | 32/41 [00:34<00:09,  1.08s/it]
Loading checkpoint shards:  80%|████████  | 33/41 [00:35<00:08,  1.07s/it]
Loading checkpoint shards:  83%|████████▎ | 34/41 [00:36<00:07,  1.05s/it]
Loading checkpoint shards:  85%|████████▌ | 35/41 [00:37<00:06,  1.06s/it]
Loading checkpoint shards:  88%|████████▊ | 36/41 [00:38<00:05,  1.06s/it]
Loading checkpoint shards:  90%|█████████ | 37/41 [00:39<00:04,  1.06s/it]
Loading checkpoint shards:  93%|█████████▎| 38/41 [00:40<00:03,  1.05s/it]
Loading checkpoint shards:  95%|█████████▌| 39/41 [00:41<00:02,  1.04s/it]
Loading checkpoint shards:  98%|█████████▊| 40/41 [00:42<00:01,  1.02s/it]
Loading checkpoint shards: 100%|██████████| 41/41 [00:43<00:00,  1.03s/it]
Loading checkpoint shards: 100%|██████████| 41/41 [00:43<00:00,  1.06s/it]
84.67508792877197
Sat Jul 1508:28:40 UTC 2023

第一组

模型：decapoda-research/llama-7b-hf

模型大小为：13GB

第二组

模型：decapoda-research/llama-13b-hf

模型大小为：38GB

第三组

采用NAS加载，调整应用并发数，记录应用基于不同模型的启动时长

通过测试对比，我们发现：

1、在没有并发的情况下，缓存相比NAS在加载速度方面没有明显的优势，只提高20%左右。

2、随着并发度的提高，缓存的优势开始明显，加载速度几乎不受并发度影响，而NAS加载速度出现明显的下降，启动时长和并发数接近线性关系。

3、模型越大数据缓存+Burst相比于数据缓存的优势就越明显。

所以，如果模型已经在NAS中保存，且数据没有并发加载的需求，直接可以用NAS加载；如果对数据并发加载要求比较高，建议采用数据缓存进行预热，可以解决并发加载慢的问题。实际应用场景中，模型仓库源都是在一定范围内共享的，很难避免有并发拉取的问题，进行缓存也很有必要。

总结

容器镜像的加速技术如今已经非常成熟，比如文中用到的阿里云容器镜像缓存，还有p2p分发技术以及开源的dadi、nydus等按需加载技术，然而这些加速技术对于大模型文件的加载都很难有显著的效果。

模型不同于容器镜像的分层构建，无法在不同的版本间复用。即便将模型直接打包进容器镜像就能够利用到这些现有的技术，但是镜像单层加载依然会有瓶颈，因为模型文件往往对应镜像一个体积非常大的层。

MaaS的概念最近开始被提出，这基于的一个事实就是模型已经逐渐开始具备相对独立的存储、版本管理能力，也有类OCI的概念被提出，模型与应用的解耦会是必然的一个趋势。

为了解决模型加载与容器镜像加载解耦的问题，我们提供了模型缓存的技术，让模型无需从远端的仓库加载，也不用打包进应用的镜像里，就可以直接像加载本地的文件一样使用模型，而且在模型缓存的制作、使用流程上做了极大的简化。当然，大模型只是我们数据缓存一个典型的应用场景，任何需要弹性+高性能加载的数据都可以从容器镜像剥离出来，构建独立的缓存。