系统环境:
操作系统:Linux
Python版本:3.8.12
代码编辑器:VSCode+Jupyter Notebook
datasets版本:2.0.0
数据集的:
代码:
import datasets dataset=datasets.load_dataset("yelp_review_full")
报错信息:
ConnectionError Traceback (most recent call last) /tmp/ipykernel_21708/3707219471.py in <module> ----> 1 dataset=datasets.load_dataset("yelp_review_full") myenv/lib/python3.8/site-packages/datasets/load.py in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, **config_kwargs) 1658 1659 # Create a dataset builder -> 1660 builder_instance = load_dataset_builder( 1661 path=path, 1662 name=name, myenv/lib/python3.8/site-packages/datasets/load.py in load_dataset_builder(path, name, data_dir, data_files, cache_dir, features, download_config, download_mode, revision, use_auth_token, **config_kwargs) 1484 download_config = download_config.copy() if download_config else DownloadConfig() 1485 download_config.use_auth_token = use_auth_token -> 1486 dataset_module = dataset_module_factory( 1487 path, 1488 revision=revision, myenv/lib/python3.8/site-packages/datasets/load.py in dataset_module_factory(path, revision, download_config, download_mode, force_local_path, dynamic_modules_path, data_dir, data_files, **download_kwargs) 1236 f"Couldn't find '{path}' on the Hugging Face Hub either: {type(e1).__name__}: {e1}" 1237 ) from None -> 1238 raise e1 from None 1239 else: 1240 raise FileNotFoundError( myenv/lib/python3.8/site-packages/datasets/load.py in dataset_module_factory(path, revision, download_config, download_mode, force_local_path, dynamic_modules_path, data_dir, data_files, **download_kwargs) 1173 if path.count("/") == 0: # even though the dataset is on the Hub, we get it from GitHub for now 1174 # TODO(QL): use a Hub dataset module factory instead of GitHub -> 1175 return GithubDatasetModuleFactory( 1176 path, 1177 revision=revision, myenv/lib/python3.8/site-packages/datasets/load.py in get_module(self) 531 revision = self.revision 532 try: --> 533 local_path = self.download_loading_script(revision) 534 except FileNotFoundError: 535 if revision is not None or os.getenv("HF_SCRIPTS_VERSION", None) is not None: myenv/lib/python3.8/site-packages/datasets/load.py in download_loading_script(self, revision) 511 if download_config.download_desc is None: 512 download_config.download_desc = "Downloading builder script" --> 513 return cached_path(file_path, download_config=download_config) 514 515 def download_dataset_infos_file(self, revision: Optional[str]) -> str: myenv/lib/python3.8/site-packages/datasets/utils/file_utils.py in cached_path(url_or_filename, download_config, **download_kwargs) 232 if is_remote_url(url_or_filename): 233 # URL, so get it from the cache (downloading if necessary) --> 234 output_path = get_from_cache( 235 url_or_filename, 236 cache_dir=cache_dir, myenv/lib/python3.8/site-packages/datasets/utils/file_utils.py in get_from_cache(url, cache_dir, force_download, proxies, etag_timeout, resume_download, user_agent, local_files_only, use_etag, max_retries, use_auth_token, ignore_url_params, download_desc) 580 _raise_if_offline_mode_is_enabled(f"Tried to reach {url}") 581 if head_error is not None: --> 582 raise ConnectionError(f"Couldn't reach {url} ({repr(head_error)})") 583 elif response is not None: 584 raise ConnectionError(f"Couldn't reach {url} (error {response.status_code})") ConnectionError: Couldn't reach https://raw.githubusercontent.com/huggingface/datasets/2.0.0/datasets/yelp_review_full/yelp_review_full.py (ReadTimeout(ReadTimeoutError("HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Read timed out. (read timeout=100)")))
很明显这是上不了raw.githubusercontent.com的问题。
如果你可以使用代理,最好的解决方式就是直接挂代理运行全程。
对于不方便直接使用代理的情况,以下介绍我使用的解决方案:在本机使用代理,然后将文件上传到运行环境的解决方案。(注意本机和服务器可以是不同操作系统的)
我试过直接把这个Python文件下载下来,然后上传到服务器上,但是操作了半天也不行,因为这个Python文件里面给出的数据下载链接在谷歌云,但是直接把那个数据下下来上传还是不行,修改数据下载链接到S3文件也不行。总之不行,如果有可行的方法请直接给我讲一下。
大略来说,我的成功做法就是现在本地加载数据集,然后储存到磁盘,然后将文件夹上传至服务器,并从磁盘直接加载数据集。
在本地加载数据集并储存到本地磁盘(注意这个路径是Windows系统的路径):
import datasets dataset=datasets.load_dataset("yelp_review_full",cache_dir='mypath\data\huggingfacedatasetscache') dataset.save_to_disk('mypath\\data\\yelp_review_full_disk')
将路径文件夹上传到服务器:
可以使用bypy和百度网盘来进行操作,参考我之前撰写的博文bypy:使用Linux命令行上传及下载百度云盘文件(远程服务器大文件传输必备)_诸神缄默不语的博客-CSDN博客_bypy 命令。
先上传到我的应用数据-bypy文件夹中,然后在服务器上下载文件夹(注意下载文件夹是将远程文件夹里的所有文件下载到本地文件夹,而不是直接下载整个文件夹):bypy downdir yelp_full_review_disk mypath/datasets/yelp_full_review_disk
然后在服务器上从磁盘加载数据集:
dataset=datasets.load_from_disk("mypath/datasets/yelp_full_review_disk")
就可以正常使用数据集了:
注意,根据datasets的文档,这个数据集也可以直接存储到S3FileSystem(https://huggingface.co/docs/datasets/v2.0.0/en/package_reference/main_classes#datasets.filesystems.S3FileSystem)上。我觉得这大概也是个类似谷歌云或者百度云那种可公开下载文件的API?感觉会比存储到本地然后转储到服务器更方便。
我没有研究过这个功能,所以没有使用这个。
指标的:
代码:
metric=datasets.load_metric('accuracy')
报错信息:
ConnectionError Traceback (most recent call last) /tmp/ipykernel_24141/2186493793.py in <module> ----> 1 metric=datasets.load_metric('accuracy') myenv/lib/python3.8/site-packages/datasets/load.py in load_metric(path, config_name, process_id, num_process, cache_dir, experiment_id, keep_in_memory, download_config, download_mode, revision, **metric_init_kwargs) 1390 """ 1391 download_mode = DownloadMode(download_mode or DownloadMode.REUSE_DATASET_IF_EXISTS) -> 1392 metric_module = metric_module_factory( 1393 path, revision=revision, download_config=download_config, download_mode=download_mode 1394 ).module_path myenv/lib/python3.8/site-packages/datasets/load.py in metric_module_factory(path, revision, download_config, download_mode, force_local_path, dynamic_modules_path, **download_kwargs) 1322 except Exception as e2: # noqa: if it's not in the cache, then it doesn't exist. 1323 if not isinstance(e1, FileNotFoundError): -> 1324 raise e1 from None 1325 raise FileNotFoundError( 1326 f"Couldn't find a metric script at {relative_to_absolute_path(combined_path)}. " myenv/lib/python3.8/site-packages/datasets/load.py in metric_module_factory(path, revision, download_config, download_mode, force_local_path, dynamic_modules_path, **download_kwargs) 1310 elif is_relative_path(path) and path.count("/") == 0 and not force_local_path: 1311 try: -> 1312 return GithubMetricModuleFactory( 1313 path, 1314 revision=revision, myenv/lib/python3.8/site-packages/datasets/load.py in get_module(self) 598 revision = self.revision 599 try: --> 600 local_path = self.download_loading_script(revision) 601 revision = self.revision 602 except FileNotFoundError: myenv/lib/python3.8/site-packages/datasets/load.py in download_loading_script(self, revision) 592 if download_config.download_desc is None: 593 download_config.download_desc = "Downloading builder script" --> 594 return cached_path(file_path, download_config=download_config) 595 596 def get_module(self) -> MetricModule: myenv/lib/python3.8/site-packages/datasets/utils/file_utils.py in cached_path(url_or_filename, download_config, **download_kwargs) 232 if is_remote_url(url_or_filename): 233 # URL, so get it from the cache (downloading if necessary) --> 234 output_path = get_from_cache( 235 url_or_filename, 236 cache_dir=cache_dir, myenv/lib/python3.8/site-packages/datasets/utils/file_utils.py in get_from_cache(url, cache_dir, force_download, proxies, etag_timeout, resume_download, user_agent, local_files_only, use_etag, max_retries, use_auth_token, ignore_url_params, download_desc) 580 _raise_if_offline_mode_is_enabled(f"Tried to reach {url}") 581 if head_error is not None: --> 582 raise ConnectionError(f"Couldn't reach {url} ({repr(head_error)})") 583 elif response is not None: 584 raise ConnectionError(f"Couldn't reach {url} (error {response.status_code})") ConnectionError: Couldn't reach https://raw.githubusercontent.com/huggingface/datasets/2.0.0/metrics/accuracy/accuracy.py (ReadTimeout(ReadTimeoutError("HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Read timed out. (read timeout=100)")))
指标的简单一点,只要把这个Python文件下载到本地(这个可以不用代理。免代理下载GitHub文件的方法我没有专门撰写博文,但是可以参考我之前写的类似主题的博文:PyG的Planetoid无法直接下载Cora等数据集的3个解决方式_诸神缄默不语的博客-CSDN博客_planetoid数据集),然后改为调用这个文件即可:
metric=datasets.load_metric('mypath/accuracy.py')