🙋魔搭ModelScope本期社区进展:
📟4848个模型:Emu3系列、GLM-4-Voice、stable-diffusion-3.5-large、Janus-1.3B等;
📁45个数据集:CCI3-HQ-Annotation-Benchmark、SWE-bench、simpletuner_venv等;
🎨46个创新应用:SD3.5-turbo快速生图、阿里Tora-轨迹导向的视频生成、open-notebooklm-demo等;
📄7篇文章:
- GLM-4-Voice,智谱开源版“Her”来了!
- 统一多模态模型来了!智源发布多模态世界模型Emu3!
- Compass Arena 大模型竞技场多模态榜单发布!
- 今日热点:“AI像人类一样使用手机和电脑”,魔搭社区的开源项目已先行一步
- Deepseek开源多模态LLM模型框架Janus,魔搭社区最佳实践
- 请拥有edu邮箱的同学来领取专(免)属(费)GPU!
- MemoryScope:为LLM聊天机器人配备的长期记忆系统
精选模型
GLM-4-Voice
智谱 AI 推出并开源端到端语音模型 GLM-4-Voice,GLM-4-Voice 能够直接理解和生成中英文语音,进行实时语音对话,并且能够遵循用户的指令要求改变语音的情感、语调、语速、方言等属性。
GLM-4-Voice 由三个部分组成:
- GLM-4-Voice-Tokenizer: 通过在 Whisper 的 Encoder 部分增加 Vector Quantization 并在 ASR 数据上有监督训练,将连续的语音输入转化为离散的 token。每秒音频平均只需要用 12.5 个离散 token 表示。
- GLM-4-Voice-Decoder: 基于 CosyVoice 的 Flow Matching 模型结构训练的支持流式推理的语音解码器,将离散化的语音 token 转化为连续的语音输出。最少只需要 10 个语音 token 即可开始生成,降低端到端对话延迟。
- GLM-4-Voice-9B: 在 GLM-4-9B 的基础上进行语音模态的预训练和对齐,从而能够理解和生成离散化的语音 token。
模型链接:
GLM-4-Voice-Tokenizer:
https://modelscope.cn/models/ZhipuAI/glm-4-voice-tokenizer
GLM-4-Voice-9B:
https://modelscope.cn/models/ZhipuAI/glm-4-voice-9b
GLM-4-Voice-Decoder:
https://modelscope.cn/models/ZhipuAI/glm-4-voice-decoder
代码示例:
环境准备:
git clone https://github.com/THUDM/GLM-4-Voice.git pip install matcha-tts torchaudio hyperpyyaml cd GLM-4-Voice # 如果出现环境问题,可以运行以下命令 pip install -r requirements.txt
然后进去`GLM-4-Voice`的目录下运行以下代码:
import os import uuid from typing import List, Optional, Tuple import torch import torchaudio from flow_inference import AudioDecoder from modelscope import snapshot_download from speech_tokenizer.modeling_whisper import WhisperVQEncoder from speech_tokenizer.utils import extract_speech_token from transformers import AutoModel, AutoTokenizer, GenerationConfig, WhisperFeatureExtractor class GLM4Voice: def _prepare_model(self): model_path = snapshot_download('ZhipuAI/glm-4-voice-9b') decoder_path = snapshot_download('ZhipuAI/glm-4-voice-decoder') tokenizer_path = snapshot_download('ZhipuAI/glm-4-voice-tokenizer') flow_config = os.path.join(decoder_path, 'config.yaml') flow_checkpoint = os.path.join(decoder_path, 'flow.pt') hift_checkpoint = os.path.join(decoder_path, 'hift.pt') # GLM self.model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device_map=self.device).eval() self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, use_fast=False) # Flow & Hift self.audio_decoder = AudioDecoder( config_path=flow_config, flow_ckpt_path=flow_checkpoint, hift_ckpt_path=hift_checkpoint, device=self.device) # Speech tokenizer self.whisper_model = WhisperVQEncoder.from_pretrained(tokenizer_path).eval().to(self.device) self.feature_extractor = WhisperFeatureExtractor.from_pretrained(tokenizer_path) def clear(self): self.previous_tokens = '' def __init__(self, generation_config=None): if generation_config is None: generation_config = GenerationConfig(top_p=0.8, temperature=0.2, max_new_tokens=2000, do_sample=True) self.generation_config = generation_config self.device = 'cuda:0' self._prepare_model() self.audio_offset = self.tokenizer.convert_tokens_to_ids('<|audio_0|>') self.end_token_id = self.tokenizer.convert_tokens_to_ids('<|user|>') self.clear() def infer(self, audio_path: Optional[str] = None, text: Optional[str] = None) -> Tuple[str, str]: if audio_path is not None: audio_tokens = extract_speech_token(self.whisper_model, self.feature_extractor, [audio_path])[0] audio_tokens = ''.join([f'<|audio_{x}|>' for x in audio_tokens]) audio_tokens = '<|begin_of_audio|>' + audio_tokens + '<|end_of_audio|>' user_input = audio_tokens system_prompt = 'User will provide you with a speech instruction. Do it step by step. First, think about the instruction and respond in a interleaved manner, with 13 text token followed by 26 audio tokens. ' else: user_input = text system_prompt = 'User will provide you with a text instruction. Do it step by step. First, think about the instruction and respond in a interleaved manner, with 13 text token followed by 26 audio tokens.' text = self.previous_tokens text = text.strip() if '<|system|>' not in text: text += f'<|system|>\n{system_prompt}' text += f'<|user|>\n{user_input}<|assistant|>streaming_transcription\n' inputs = self.tokenizer([text], return_tensors='pt').to(self.device) generate_ids = self.model.generate(**inputs, generation_config=self.generation_config)[0] generate_ids = generate_ids[inputs['input_ids'].shape[1]:] self.previous_tokens += text + self.tokenizer.decode(generate_ids, spaces_between_special_tokens=False) return self._parse_generate_ids(generate_ids) def _parse_generate_ids(self, generate_ids: List[int]) -> Tuple[str, str]: text_tokens, audio_tokens = [], [] this_uuid = str(uuid.uuid4()) for token_id in generate_ids.tolist(): if token_id >= self.audio_offset: audio_tokens.append(token_id - self.audio_offset) elif token_id != self.end_token_id: text_tokens.append(token_id) audio_tokens_pt = torch.tensor(audio_tokens, device=self.device)[None] tts_speech, _ = self.audio_decoder.token2wav(audio_tokens_pt, uuid=this_uuid, finalize=True) audio_path = f'{this_uuid}.wav' with open(audio_path, 'wb') as f: torchaudio.save(f, tts_speech.cpu(), 22050, format='wav') response = self.tokenizer.decode(text_tokens) return response, audio_path if __name__ == '__main__': generation_config = GenerationConfig(top_p=0.8, temperature=0.2, max_new_tokens=2000, do_sample=True) glm_voice = GLM4Voice(generation_config=generation_config) audio_path = 'http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/weather.wav' response, output_path = glm_voice.infer(audio_path=audio_path) print(f'response: {response}\noutput_path: {output_path}') text = '请用英文回答' response, output_path = glm_voice.infer(text=text) print(f'response: {response}\noutput_path: {output_path}') glm_voice.clear() # 清空历史 text = '请用英文回答' response, output_path = glm_voice.infer(text=text) print(f'response: {response}\noutput_path: {output_path}') """ response: 是啊,阳光明媚的,真是个出门走走的好日子!你今天有什么计划吗? output_path: 7f146cb5-4c1f-4c2c-85d0-0a8c985c90c0.wav response: Sure! Today's weather is really nice, isn't it? It's a great day to go out and enjoy some fresh air. Do you have any plans for today? output_path: 9326df35-aeec-4292-856b-5c0b1688e3f8.wav response: Sure, I'll answer in English. What would you like to know? output_path: e6e7c94b-7532-475f-bea7-e41566a954b6.wav """
更多玩法教程详见:
Emu3系列
2024年10月21日,智源研究院正式发布原生多模态世界模型Emu3。该模型使用单一的Transformer进行训练,并通过将图像、文本和视频等不同模态的数据转化为离散空间中的token来进行预测。只基于下一个token预测,无需扩散模型或组合方法,即可完成文本、图像、视频三种模态数据的理解和生成,并超越传统任务特定模型的效果,在生成和感知任务中都达到了SOTA的水平。
模型链接:
Emu3-Stage1:
https://modelscope.cn/models/BAAI/Emu3-Stage1
Emu3-VisionTokenizer:
https://modelscope.cn/models/BAAI/Emu3-VisionTokenizer
Emu3-Gen:
https://modelscope.cn/collections/Emu3-9eacc8668b1043
Emu3-Chat:
https://modelscope.cn/models/BAAI/Emu3-Chat
代码示例:
from PIL import Image from transformers import AutoTokenizer, AutoModel, AutoImageProcessor, AutoModelForCausalLM from transformers.generation.configuration_utils import GenerationConfig from transformers.generation import LogitsProcessorList, PrefixConstrainedLogitsProcessor, UnbatchedClassifierFreeGuidanceLogitsProcessor import torch from modelscope import snapshot_download # model path EMU_HUB = snapshot_download("BAAI/Emu3-Stage1") VQ_HUB = snapshot_download("BAAI/Emu3-VisionTokenizer") import sys sys.path.append(EMU_HUB) from processing_emu3 import Emu3Processor # prepare model and processor model = AutoModelForCausalLM.from_pretrained( EMU_HUB, device_map="cuda:0", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", trust_remote_code=True, ) tokenizer = AutoTokenizer.from_pretrained(EMU_HUB, trust_remote_code=True, padding_side="left") image_processor = AutoImageProcessor.from_pretrained(VQ_HUB, trust_remote_code=True) image_tokenizer = AutoModel.from_pretrained(VQ_HUB, device_map="cuda:0", trust_remote_code=True).eval() processor = Emu3Processor(image_processor, image_tokenizer, tokenizer, chat_template="{image_prompt}{text_prompt}") # Image Generation # prepare input POSITIVE_PROMPT = " masterpiece, film grained, best quality." NEGATIVE_PROMPT = "lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry." classifier_free_guidance = 3.0 prompt = "a portrait of young girl." prompt += POSITIVE_PROMPT kwargs = dict( mode='G', ratio="1:1", image_area=model.config.image_area, return_tensors="pt", padding="longest", ) pos_inputs = processor(text=prompt, **kwargs) neg_inputs = processor(text=NEGATIVE_PROMPT, **kwargs) # prepare hyper parameters GENERATION_CONFIG = GenerationConfig( use_cache=True, eos_token_id=model.config.eos_token_id, pad_token_id=model.config.pad_token_id, max_new_tokens=40960, do_sample=True, top_k=2048, ) h = pos_inputs.image_size[:, 0] w = pos_inputs.image_size[:, 1] constrained_fn = processor.build_prefix_constrained_fn(h, w) logits_processor = LogitsProcessorList([ UnbatchedClassifierFreeGuidanceLogitsProcessor( classifier_free_guidance, model, unconditional_ids=neg_inputs.input_ids.to("cuda:0"), ), PrefixConstrainedLogitsProcessor( constrained_fn , num_beams=1, ), ]) # generate outputs = model.generate( pos_inputs.input_ids.to("cuda:0"), GENERATION_CONFIG, logits_processor=logits_processor, attention_mask=pos_inputs.attention_mask.to("cuda:0"), ) mm_list = processor.decode(outputs[0]) for idx, im in enumerate(mm_list): if not isinstance(im, Image.Image): continue im.save(f"result_{idx}.png") # Multimodal Understanding text = "The image depicts " image = Image.open("assets/demo.png") inputs = processor( text=text, image=image, mode='U', padding="longest", return_tensors="pt", ) GENERATION_CONFIG = GenerationConfig( pad_token_id=tokenizer.pad_token_id, bos_token_id=tokenizer.bos_token_id, eos_token_id=tokenizer.eos_token_id, max_new_tokens=1024, ) outputs = model.generate( inputs.input_ids.to("cuda:0"), GENERATION_CONFIG, attention_mask=inputs.attention_mask.to("cuda:0"), ) outputs = outputs[:, inputs.input_ids.shape[-1]:] answers = processor.batch_decode(outputs, skip_special_tokens=True) for ans in answers: print(ans)
更多玩法教程详见:
stable-diffusion-3.5-large
Stability近期发布了最新模型系列 stable-diffusion-3.5-large,SD3.5进行了全面的架构,现在根据更新的、更宽松的社区license,增强了图像保真度、指令遵循和可控性,可在消费级显卡轻松运行。
模型链接:
https://modelscope.cn/models/AI-ModelScope/stable-diffusion-3.5-large
示例代码:
安装依赖
!pip install diffusers -U
推理代码:
import torch from diffusers import StableDiffusion3Pipeline from modelscope import snapshot_download model_dir = snapshot_download("AI-ModelScope/stable-diffusion-3.5-large") pipe = StableDiffusion3Pipeline.from_pretrained(model_dir, torch_dtype=torch.bfloat16) pipe = pipe.to("cuda") image = pipe( "A capybara holding a sign that reads Hello World", num_inference_steps=28, guidance_scale=3.5, ).images[0] image.save("capybara.png")
Janus-1.3B
DeepSeek近期推出了简单、统一且灵活的多模态框架Janus,它能够统一处理多模态理解和生成任务。与之前的研究不同的是,Janus将视觉编码解耦为独立的路径,并利用单一、统一的transformer架构进行处理。这种方法不仅缓解了视觉编码器在理解和生成任务中的冲突,还增强了框架的灵活性。
模型链接:
https://modelscope.cn/models/deepseek-ai/Janus-1.3B
示例代码:
环境安装
!git clone https://github.com/deepseek-ai/Janus.git %cd Janus !pip install -e .
视觉理解
import torch from transformers import AutoModelForCausalLM from janus.models import MultiModalityCausalLM, VLChatProcessor from janus.utils.io import load_pil_images from modelscope import snapshot_download # specify the path to the model model_path = snapshot_download("deepseek-ai/Janus-1.3B") vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path) tokenizer = vl_chat_processor.tokenizer vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained( model_path, trust_remote_code=True ) vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval() conversation = [ { "role": "User", "content": "<image_placeholder>\nConvert the formula into latex code.", "images": ["/mnt/workspace/Janus/images/equation.png"], }, {"role": "Assistant", "content": ""}, ] # load images and prepare for inputs pil_images = load_pil_images(conversation) prepare_inputs = vl_chat_processor( conversations=conversation, images=pil_images, force_batchify=True ).to(vl_gpt.device) # # run image encoder to get the image embeddings inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs) # # run the model to get the response outputs = vl_gpt.language_model.generate( inputs_embeds=inputs_embeds, attention_mask=prepare_inputs.attention_mask, pad_token_id=tokenizer.eos_token_id, bos_token_id=tokenizer.bos_token_id, eos_token_id=tokenizer.eos_token_id, max_new_tokens=512, do_sample=False, use_cache=True, ) answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True) print(f"{prepare_inputs['sft_format'][0]}", answer)
图片生成
import os import PIL.Image import torch import numpy as np from transformers import AutoModelForCausalLM from janus.models import MultiModalityCausalLM, VLChatProcessor from modelscope import snapshot_download # specify the path to the model model_path = snapshot_download("deepseek-ai/Janus-1.3B") vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path) tokenizer = vl_chat_processor.tokenizer vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained( model_path, trust_remote_code=True ) vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval() conversation = [ { "role": "User", "content": "A stunning princess from kabul in red, white traditional clothing, blue eyes, brown hair", }, {"role": "Assistant", "content": ""}, ] sft_format = vl_chat_processor.apply_sft_template_for_multi_turn_prompts( conversations=conversation, sft_format=vl_chat_processor.sft_format, system_prompt="", ) prompt = sft_format + vl_chat_processor.image_start_tag @torch.inference_mode() def generate( mmgpt: MultiModalityCausalLM, vl_chat_processor: VLChatProcessor, prompt: str, temperature: float = 1, parallel_size: int = 16, cfg_weight: float = 5, image_token_num_per_image: int = 576, img_size: int = 384, patch_size: int = 16, ): input_ids = vl_chat_processor.tokenizer.encode(prompt) input_ids = torch.LongTensor(input_ids) tokens = torch.zeros((parallel_size*2, len(input_ids)), dtype=torch.int).cuda() for i in range(parallel_size*2): tokens[i, :] = input_ids if i % 2 != 0: tokens[i, 1:-1] = vl_chat_processor.pad_id inputs_embeds = mmgpt.language_model.get_input_embeddings()(tokens) generated_tokens = torch.zeros((parallel_size, image_token_num_per_image), dtype=torch.int).cuda() for i in range(image_token_num_per_image): outputs = mmgpt.language_model.model(inputs_embeds=inputs_embeds, use_cache=True, past_key_values=outputs.past_key_values if i != 0 else None) hidden_states = outputs.last_hidden_state logits = mmgpt.gen_head(hidden_states[:, -1, :]) logit_cond = logits[0::2, :] logit_uncond = logits[1::2, :] logits = logit_uncond + cfg_weight * (logit_cond-logit_uncond) probs = torch.softmax(logits / temperature, dim=-1) next_token = torch.multinomial(probs, num_samples=1) generated_tokens[:, i] = next_token.squeeze(dim=-1) next_token = torch.cat([next_token.unsqueeze(dim=1), next_token.unsqueeze(dim=1)], dim=1).view(-1) img_embeds = mmgpt.prepare_gen_img_embeds(next_token) inputs_embeds = img_embeds.unsqueeze(dim=1) dec = mmgpt.gen_vision_model.decode_code(generated_tokens.to(dtype=torch.int), shape=[parallel_size, 8, img_size//patch_size, img_size//patch_size]) dec = dec.to(torch.float32).cpu().numpy().transpose(0, 2, 3, 1) dec = np.clip((dec + 1) / 2 * 255, 0, 255) visual_img = np.zeros((parallel_size, img_size, img_size, 3), dtype=np.uint8) visual_img[:, :, :] = dec os.makedirs('generated_samples', exist_ok=True) for i in range(parallel_size): save_path = os.path.join('generated_samples', "img_{}.jpg".format(i)) PIL.Image.fromarray(visual_img[i]).save(save_path) generate( vl_gpt, vl_chat_processor, prompt, )
数据集推荐
CCI3-HQ-Annotation-Benchmark
CCI3-HQ-Annotation-Benchmark是一个高质量中文互联网语料库,由智源研究院提供,适用于各种自然语言处理任务的数据标注基准。
数据集链接:
https://modelscope.cn/datasets/BAAI/CCI3-HQ-Annotation-Benchmark
SWE-bench
SWE-bench是一个自然语言处理任务的数据集,旨在评估和提升模型在社交媒体文本理解方面的表现。
数据集链接:
https://modelscope.cn/datasets/AI-ModelScope/SWE-bench
simpletuner_venv
Simpletuner_venv是一个用于音乐制作和调音的数据集,支持音乐爱好者和开发者进行音频处理和音乐创作。
数据集链接:
https://modelscope.cn/datasets/livehouse/simpletuner_venv
精选应用
SD3.5-turbo快速生图
Stability近期发布了最新模型系列 stable-diffusion-3.5-large,SD3.5-turbo是基于Stable Diffusion 3.5 Large模型的快速生图工具,支持高效图像生成体验。
体验直达:
https://modelscope.cn/studios/AI-ModelScope/stable-diffusion-3.5-large-turbo
阿里Tora-轨迹导向的视频生成
阿里近期开源的轨迹控制版视频生成工具—— Tora,只需绘制任意数量的轨迹,并输入一段文本提示(prompt),便可生成为期 6 秒的轨迹控制视频。用户可以选择使用提供的预设轨迹,或者自定义绘制轨迹,以实现更具个性化的效果。
体验直达:
https://modelscope.cn/studios/xiaoche/Tora
open-notebooklm-demo
Open-NotebookLM-Demo是用于展示如何在Notebook环境中应用和测试大型语言模型。
体验直达:
https://modelscope.cn/studios/studio-test/open-notebooklm-demo
社区精选文章
- GLM-4-Voice,智谱开源版“Her”来了!
- 统一多模态模型来了!智源发布多模态世界模型Emu3!
- Compass Arena 大模型竞技场多模态榜单发布!
- 今日热点:“AI像人类一样使用手机和电脑”,魔搭社区的开源项目已先行一步
- Deepseek开源多模态LLM模型框架Janus,魔搭社区最佳实践
- 请拥有edu邮箱的同学来领取专(免)属(费)GPU!
- MemoryScope:为LLM聊天机器人配备的长期记忆系统