背景
ModelScope作为更适合中国宝宝体质的HuggingFace社区,别的不说,在模型下载和获取方面很好的扮演了一个镜像平台的角色。除了下载模型以外,ModelScope提供了类似Transformers库语法风格和定义的一套接口。而在本文成文时(2024.1),ModelScope社区尚未公开支持Ascend系列硬件,但在对ModelScope的代码做了一定的研读后,发现对于ModelScope的源码进行少数的基础改动(得益于良好的代码可读性和松散的耦合关系),原始的ModelScope代码就可以基于Ascend系列硬件运行(ModelScope的几个官方用例可以跑通)。
具体的修改请见下文。
测试环境
pytorch == 2.1.0
modelscope == 1.9.4
硬件==910B1
官方示例
官方示例
from modelscope.pipelines import pipeline
word_segmentation = pipeline('word-segmentation',model='damo/nlp_structbert_word-segmentation_chinese-base')
input_str = '今天天气不错,适合出去游玩'
print(word_segmentation(input_str))
为了更好的观察模型的运行情况,我们稍微修改下打印这部分的代码
from modelscope.pipelines import pipeline
word_segmentation = pipeline('word-segmentation',model='damo/nlp_structbert_word-segmentation_chinese-base')
input_str = '今天天气不错,适合出去游玩'
print("word segment result is {} on device {}".format(word_segmentation(input_str), next(word_segmentation.model.parameters()).device))
输出为
word segment result is {'output': ['今天', '天气', '不错', ',', '适合', '出去', '游玩']} on device cpu
按照通常指定设备的信息,我们需要设置NPU的设备
from modelscope.pipelines import pipeline
import torch_npu
device = "npu:0"
word_segmentation_npu = pipeline('word-segmentation',model='damo/nlp_structbert_word-segmentation_chinese-base', device = device)
input_str = '今天天气不错,适合出去游玩'
print("word segment result is {} on device {}".format(word_segmentation(input_str), next(word_segmentation.model.parameters()).device))
执行以上代码会出现报错
(PyTorch-2.1.0) [root@4bfd19a25abf playground]# python npu_orig.py
2024-01-16 09:05:49,901 - modelscope - INFO - PyTorch version 2.1.0 Found.
2024-01-16 09:05:49,902 - modelscope - INFO - Loading ast index from /root/.cache/modelscope/ast_indexer
2024-01-16 09:05:50,107 - modelscope - INFO - Loading done! Current index file version is 1.9.4, with md5 6354b5190fb2274895e8f10bfc329a7d and a total number of 945 components indexed
Warning : ASCEND_HOME_PATH environment variable is not set.
2024-01-16 09:05:53,885 - modelscope - WARNING - Model revision not specified, use revision: v1.0.3
Traceback (most recent call last):
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/modelscope/utils/registry.py", line 212, in build_from_cfg
return obj_cls(**args)
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/modelscope/pipelines/nlp/token_classification_pipeline.py", line 50, in __init__
super().__init__(
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/modelscope/pipelines/base.py", line 95, in __init__
verify_device(device)
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/modelscope/utils/device.py", line 27, in verify_device
assert eles[0] in ['cpu', 'cuda', 'gpu'], err_msg
AssertionError: device should be either cpu, cuda, gpu, gpu:X or cuda:X where X is the ordinal for gpu device.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/aicc/playground/npu_orig.py", line 5, in <module>
word_segmentation_npu = pipeline('word-segmentation',model='damo/nlp_structbert_word-segmentation_chinese-base', device = device)
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/modelscope/pipelines/builder.py", line 164, in pipeline
return build_pipeline(cfg, task_name=task)
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/modelscope/pipelines/builder.py", line 67, in build_pipeline
return build_from_cfg(
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/modelscope/utils/registry.py", line 215, in build_from_cfg
raise type(e)(f'{obj_cls.__name__}: {e}')
AssertionError: WordSegmentationPipeline: device should be either cpu, cuda, gpu, gpu:X or cuda:X where X is the ordinal for gpu device.
(PyTorch-2.1.0) [root@4bfd19a25abf playground]#
在当前modelscope已经注册的设备中还没有包含npu,那么我们截下来可以对modelscope/utils/device.py
这部分代码稍作修改。这部分的函数主要有3个函数待修改:verify_device
、device_placement
、create_device
。
这里直接将修改后的device.py贴在此处,以供参考
# Copyright (c) Alibaba, Inc. and its affiliates.
import os
from contextlib import contextmanager
from modelscope.utils.constant import Devices, Frameworks
from modelscope.utils.logger import get_logger
logger = get_logger()
def verify_device(device_name):
""" Verify device is valid, device should be either cpu, cuda, gpu, cuda:X or gpu:X.
Args:
device (str): device str, should be either cpu, cuda, gpu, gpu:X or cuda:X
where X is the ordinal for gpu device.
Return:
device info (tuple): device_type and device_id, if device_id is not set, will use 0 as default.
"""
err_msg = 'device should be either cpu, cuda, gpu, gpu:X or cuda:X where X is the ordinal for gpu device.'
assert device_name is not None and device_name != '', err_msg
device_name = device_name.lower()
eles = device_name.split(':')
assert len(eles) <= 2, err_msg
assert device_name is not None
assert eles[0] in ['cpu', 'cuda', 'gpu', 'npu'], err_msg
device_type = eles[0]
device_id = None
if len(eles) > 1:
device_id = int(eles[1])
if device_type == 'cuda':
device_type = Devices.gpu
if device_type == Devices.gpu and device_id is None:
device_id = 0
return device_type, device_id
@contextmanager
def device_placement(framework, device_name='gpu:0'):
""" Device placement function, allow user to specify which device to place model or tensor
Args:
framework (str): tensorflow or pytorch.
device (str): gpu or cpu to use, if you want to specify certain gpu,
use gpu:$gpu_id or cuda:$gpu_id.
Returns:
Context manager
Examples:
>>> # Requests for using model on cuda:0 for gpu
>>> with device_placement('pytorch', device='gpu:0'):
>>> model = Model.from_pretrained(...)
"""
device_type, device_id = verify_device(device_name)
if framework == Frameworks.tf:
import tensorflow as tf
if device_type == Devices.gpu and not tf.test.is_gpu_available():
logger.debug(
'tensorflow: cuda is not available, using cpu instead.')
device_type = Devices.cpu
if device_type == Devices.cpu:
with tf.device('/CPU:0'):
yield
else:
if device_type == Devices.gpu:
with tf.device(f'/device:gpu:{device_id}'):
yield
elif framework == Frameworks.torch:
import torch
import torch_npu
if device_type == Devices.gpu:
if torch.cuda.is_available():
torch.cuda.set_device(f'cuda:{device_id}')
else:
logger.debug(
'pytorch: cuda is not available, using cpu instead.')
elif device_type == "npu":
torch.npu.set_device(f'npu:{device_id}')
yield
else:
yield
def create_device(device_name):
""" create torch device
Args:
device_name (str): cpu, gpu, gpu:0, cuda:0 etc.
"""
import torch
import torch_npu
device_type, device_id = verify_device(device_name)
use_cuda = False
if device_type == Devices.gpu:
use_cuda = True
if not torch.cuda.is_available():
logger.info('cuda is not available, using cpu instead.')
use_cuda = False
if device_type == "npu":
torch_npu.npu.set_device(f"npu:{device_id}")
device = torch.device(f"npu:{device_id}")
elif use_cuda:
device = torch.device(f'cuda:{device_id}')
else:
device = torch.device('cpu')
return device
def get_device():
import torch
from torch import distributed as dist
if torch.cuda.is_available():
if dist.is_available() and dist.is_initialized(
) and 'LOCAL_RANK' in os.environ:
device_id = f"cuda:{os.environ['LOCAL_RANK']}"
else:
device_id = 'cuda:0'
else:
device_id = 'cpu'
return torch.device(device_id)
结果比较
我们加上性能的打点,然后比较两者之间的差异,可以看到NPU的性能远高于CPU执行推理的性能
原始代码为
from modelscope.pipelines import pipeline
import torch_npu
import time
word_segmentation = pipeline('word-segmentation',model='damo/nlp_structbert_word-segmentation_chinese-base')
input_str = '今天天气不错,适合出去游玩'
tik = time.time()
result = word_segmentation(input_str)
tok = time.time()
print("word segment result is {} on device {} with perf {} tokens/s".format(result, next(word_segmentation.model.parameters()).device, len(result)/(tok-tik)))
device = "npu:0"
word_segmentation_npu = pipeline('word-segmentation',model='damo/nlp_structbert_word-segmentation_chinese-base', device = device)
input_str = '今天天气不错,适合出去游玩'
tik = time.time()
result = word_segmentation_npu(input_str)
tok = time.time()
print("word segment result is {} on device {} with perf {} tokens/s".format(result, next(word_segmentation_npu.model.parameters()).device, len(result)/(tok-tik)))
输出为(已经删除了一些冗余的打印内容)
word segment result is {'output': ['今天', '天气', '不错', ',', '适合', '出去', '游玩']} on device cpu with perf 0.34250816868505096 tokens/s
word segment result is {'output': ['今天', '天气', '不错', ',', '适合', '出去', '游玩']} on device npu:0 with perf 1.7934692348675776 tokens/s