大数据量一次性导入MongoDB

2023-01-04 316

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

本文涉及的产品

云数据库 MongoDB，独享型 2核8GB

云原生大数据计算服务 MaxCompute，5000CU*H 100GB 3个月

云原生大数据计算服务MaxCompute，500CU*H 100GB 3个月

简介： 大数据量一次性导入MongoDB

大数据量一次性导入MongoDB

0. 写在前面

Linux：Ubuntu16.04 Kylin
MongoDB：3.2.7
数据文件大小：13518条

1. 前置芝士

mongoimport 命令可以将数据文件导入到MongoDB数据库中。

该命令的使用方式如下：

zhangsan@node01:/usr/local/mongodb-3.2.7/bin$ ./bin/mongoimport --helpoptions:
--help                  produce help message
-v [ --verbose ]        be more verbose (include multiple times for more 
                          verbosity e.g. -vvvvv)
--version               print the program's version and exit  -h [ --host ] arg       mongo host to connect to ( <set name>/s1,s2 for sets)  --port arg              server port. Can also use --host hostname:port  --ipv6                  enable IPv6 support (disabled by default)  -u [ --username ] arg   username  -p [ --password ] arg   password  --dbpath arg            directly access mongod database files in the given                           path, instead of connecting to a mongod  server -                           needs to lock the data directory, so cannot be used                           if a mongod is currently accessing the same path  --directoryperdb        if dbpath specified, each db is in a separate                           directory  --journal               enable journaling  -d [ --db ] arg         database to use  -c [ --collection ] arg collection to use (some commands)  -f [ --fields ] arg     comma separated list of field names e.g. -f name,age  --fieldFile arg         file with fields names - 1 per line  --ignoreBlanks          if given, empty fields in csv and tsv will be ignored  --type arg              type of file to import.  default: json (json,csv,tsv)  --file arg              file to import from; if not specified stdin is used  --drop                  drop collection first   --headerline            CSV,TSV only - use first line as headers  --upsert                insert or update objects that already exist  --upsertFields arg      comma-separated fields for the query part of the                           upsert. You should make sure this is indexed  --stopOnError           stop importing at first error rather than continuing  --jsonArray             load a json array, not one item per line. Currently                           limited to 4MB.

可以看到 --type 参数，mongoimport命令默认导入的数据文件格式为:JSON，同时也支持csv和tsv格式

本文的原始数据是txt格式，故已经提前利用Python将数据格式转换为JOSN格式。

--jsonArray 参数在后面需要用到。

2. mongoimport命令导入JSON文件数据失败

将数据导入到数据库db_books下的集合tb_books中，导入命令如下：

zhangsan@node01:/usr/local/mongodb-3.2.7/bin$ ./mongoimport --db db_books --collection tb_books --file /home/zhangsan/windowsUpload/data/tb_books.json

但是却出现以下报错信息，信息如下:

2022-11-20T22:11:00.034-0700    connected to: localhost
2022-11-20T22:11:00.035-0700    Failed: error unmarshaling bytes on document #0: JSON decoder out of sync - data changing underfoot?2022-11-20T22:11:00.035-0700    imported 0 documents

可以看到，在导入第一行数据时就出现error，首先检查了数据文件JSON格式并没有出错，经过查找，需要添加 --jsonArray 参数进去

zhangsan@node01:/usr/local/mongodb-3.2.7/bin$ ./mongoimport --db db_books --collection tb_books --jsonArray--file /home/zhangsan/data/tb_books.json

如果是CSV格式，导入命令如下：

zhangsan@node01:/usr/local/mongodb-3.2.7/bin$ ./mongoimport --db db_books --collection tb_books --type csv --file /home/zhangsan/data/tb_books.csv --headerline

3. db.COLLECTION.count()返回值不正确

数据导入一共是13518条，但是Shell命令行中执行count()返回值少于13518

如果每条数据的_id是从0开始递增到13518的，那添加以下参数即可返回正确的数据条数

db.tb_books.count({_id: {$exists: true}})

但是此种方式查询会很慢，原因是：count()使用参数会强制count不使用集合的元数据，而是扫描集合

针对这个问题，如果mongodb经历了一次硬崩溃，并且没有被优雅地关闭，那么诸如'db.stats.objects'、'db.<coll>.stats.count',、'db.<coll>.count()' 返回的值是无效的。如果不发布任何查询，mongodb可能只是回到了收集的统计数据这一步骤上。

解决方法如下：使用 validate(true)

> db.tb_books.count()
10137> db.tb_books.find({}).toArray().length
13518> db.tb_books.validate(true)

4. 数据导入不完全

使用mongoimport命令导入JSON数据不完全，需要使用 --batchSize xxxx 参数指定有多少个worker进行批量导入。

zhangsan@node01:/usr/local/mongodb-3.2.7/bin$ ./mongoimport --db db_books --collection tb_books --file /home/zhangsan/windowsUpload/data/tb_books.json --batchSize100

大数据量一次性导入MongoDB

大数据量一次性导入MongoDB

0. 写在前面

1. 前置芝士

2. mongoimport命令导入JSON文件数据失败

3. db.COLLECTION.count()返回值不正确

4. 数据导入不完全

5. 参考资料

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

大数据量一次性导入MongoDB

大数据量一次性导入MongoDB

0. 写在前面

1. 前置芝士

2. mongoimport命令导入JSON文件数据失败

3. db.COLLECTION.count()返回值不正确

4. 数据导入不完全

5. 参考资料

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像