nlp_mt5_zero-shot-augment_chinese-base模型进行微调,其中pytorch_model.bin文件一直没有输出,并且报错:
MemoryError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "F:\study\graduationProject\vue2_mt5\vue_flask\finetune.py", line 61, in 
    trainer.train()
  File "D:\tool\Anaconda\anaconda3\envs\modelscope\lib\site-packages\modelscope\trainers\trainer.py", line 711, in train
    self.train_loop(self.train_dataloader)
  File "D:\tool\Anaconda\anaconda3\envs\modelscope\lib\site-packages\modelscope\trainers\trainer.py", line 1243, in train_loop
    self.invoke_hook(TrainerStages.after_train_epoch)
  File "D:\tool\Anaconda\anaconda3\envs\modelscope\lib\site-packages\modelscope\trainers\trainer.py", line 1395, in invoke_hook
    getattr(hook, fn_name)(self)
  File "D:\tool\Anaconda\anaconda3\envs\modelscope\lib\site-packages\modelscope\trainers\hooks\checkpoint\checkpoint_hook.py", line 177, in after_train_epoch
    self._do_save(trainer, CheckpointStrategy.by_epoch)
  File "D:\tool\Anaconda\anaconda3\envs\modelscope\lib\site-packages\modelscope\trainers\hooks\checkpoint\checkpoint_hook.py", line 160, in _do_save
    self._save_checkpoint(trainer, prefix)
  File "D:\tool\Anaconda\anaconda3\envs\modelscope\lib\site-packages\modelscope\trainers\hooks\checkpoint\checkpoint_hook.py", line 224, in _save_checkpoint
    self.processor.save_checkpoints(trainer, checkpoint_path_prefix,
  File "D:\tool\Anaconda\anaconda3\envs\modelscope\lib\site-packages\modelscope\trainers\hooks\checkpoint\checkpoint_processor.py", line 126, in save_checkpoints
    self.save_trainer_state(trainer, model, _train_state_file, meta,
  File "D:\tool\Anaconda\anaconda3\envs\modelscope\lib\site-packages\modelscope\trainers\hooks\checkpoint\checkpoint_processor.py", line 192, in save_trainer_state
    save_checkpoint(
  File "D:\tool\Anaconda\anaconda3\envs\modelscope\lib\site-packages\modelscope\utils\checkpoint.py", line 114, in save_checkpoint
    torch.save(checkpoint, f)
  File "D:\tool\Anaconda\anaconda3\envs\modelscope\lib\site-packages\torch\serialization.py", line 620, in save
    return
  File "D:\tool\Anaconda\anaconda3\envs\modelscope\lib\site-packages\torch\serialization.py", line 482, in exit
    self.file_like.write_end_of_file()
RuntimeError: [enforce fail at inline_container.cc:424] . unexpected pos 1237267904 vs 1237267856
减小批处理大小:尝试减小训练时的批处理大小(batch size),以减少内存消耗。
training_args = TrainingArguments(
    ...,
    per_device_train_batch_size=8,  # 调整批处理大小
    ...
)
文件系统或路径问题 检查磁盘空间:确保磁盘空间足够。
df -h # 查看磁盘空间使用情况