MapReduce with MongoDB and Python[ZT]

本文涉及的产品
云数据库 MongoDB,独享型 2核8GB
推荐场景:
构建全方位客户视图
简介:

MapReduce with MongoDB and Python

从  Artificial Intelligence in Motion 作者: Marcel Pinheiro Caraciolo (由于Artificial Intelligence in Motion发布的图在墙外,所以将图换到cnblogs)
Hi all,

 In this post, I'll present a demonstration of a map-reduce example with MongoDB and server side JavaScript.  Based on the fact that I've been working  with this technology recently, I thought it would be useful to present here a simple example of  how it works and how to integrate with Python.

But What is MongoDb ?

For you, who doesn't know what is and the basics of how to use  MongoDB, it is important to explain a little bit about the  No-SQL movement. Currently, there are several databases that break with the requirements present in the traditional relational database systems. I present as follows the main keypoints shown at several No-SQL databases:
  • SQL commands are not used as query API (Examples of APIs used include JSON, BSON, etc.)
  • Doesn't guarantee atomic operations.
  • Distributed and horizontally scalable.
  • It doesn't have to predefine schemas. (Non-Schema)
  • Non-tabular data storing (eg; key-value, object, graphs, etc).
Although it is not so obvious, No-SQL is an abbreviation  to  Not Only SQL. The effort and development of this new approach have been doing a lot of noise since 2009. You can find more information about it  here and  here.  It is important to notice that the non-relational databases does not represent a complete replacement for relational databases. It is necessary to know the pros and cons of each approach and decide the most appropriate for your needs in the scenario that you're facing.

MongoDB is one of the most popular  No-SQL today and what this article will focus on. It is a schemaless, document oriented, high performance, scalable database  that uses the key-values concepts to store documents as JSON structured documents. It also includes some relational database features such as indexing models and dynamic queries. It is used today in production in over than 40 websites, including web services such as  SourceForgeGitHubEletronic Arts and  The New York Times..

One of the best functionalities that I like in MongoDb is the  Map-Reduce. In the next section I will explain  how it works illustrated with a simple example using MongoDb and Python.

If you want to install MongoDb or get more information, you can download it  here and read a nice tutorial  here.

Map- Reduce 

MapReduce is a programming model for processing and generating large data sets. It is a framework introduced by Google for support parallel computations large data sets spread over clusters of computers.  Now MapReduce is considered a popular model in distributed computing, inspired by the functions map and reduce commonly used in functional programming.  It can be considered  'Data-Oriented' which process data in two primary steps: Map and Reduce.  On top of that, the query is now executed on simultaneous data sources. The process of mapping the request of the input reader to the data set is called 'Map', and the process of aggregation of the intermediate results from the mapping function in a consolidated result is called 'Reduce'.  The paper about the MapReduce with more details it can be read  here.

Today there are several implementations of MapReduce such as Hadoop, Disco, Skynet, etc. The most famous is Hadoop and is implemented in Java as an open-source project.  In MongoDB there is also a similar implementation in spirit like Hadoop with all input coming from a collection and output going to a collection. For a practical definition, Map-Reduce in MongoDB is useful for batch manipulation of data and aggregation operations.  In real case scenarios, in a situation where  you would have used GROUP BY in SQL,  map/reduce is the equivalent tool in MongoDB.

Now thtat we have introduced Map-Reduce, let's see how access the MongoDB by Python.

PyMongo


PyMongo is a Python distribution containing tools for working with  MongoDB, and is the recommended way to work with MongoDB from Python. It's easy to install and to use. See  here how to install  and use it.

Map-Reduce in Action

Now let's see Map-Reduce in action. For demonstrate the map-reduce I've decided to used of the classical problems solved using it: Word Frequency count across a series of documents. It's a simple problem and is suited to being solved by a map-reduce query.

I've decided to use two samples for this task. The first one is a list of simple sentences to illustrate how the map reduce works.  The second one is the 2009 Obama's Speech at his election for president. It will be used to show a real example illustrated by the code.

Let's consider the diagram below in order to help demonstrate how the map-reduce could be distributed. It shows four sentences that are split  in words and grouped by the function  map and after reduced independently (aggregation)  by the function  reduce. This is interesting as it means our query can be distributed into separate nodes (computers), resulting in faster processing in word count frequency runtime. It's also important to notice the example below shows a balanced tree, but it could be unbalanced or even show some redundancy.

http://images.cnblogs.com/cnblogs_com/zhengyun_ustc/255879/o_Picture%206.png

Map-Reduce Distribution

 

Some notes you need to know before developing your  map and  reduce functions:
  •  The MapReduce engine may invoke reduce functions iteratively; thus; these functions must be idempotent. That is, the following must hold for your reduce function:
                  for all k,vals : reduce( k, [reduce(k,vals)] ) == reduce(k,vals)
  •  Currently, the return value from a reduce function cannot be an array (it's typically an object or a number)
  • If you need to perform an operation only once, use a finalize function.

 Let's go now to the code. For this task, I'll use the  Pymongo framework, which has support for Map/Reduce. As I said earlier, the input text will be the Obama's speech, which has by the way many repeated words. Take a look at the tags cloud (cloud of words which each word fontsize is evaluated based on its frequency) of Obama's Speech.

http://images.cnblogs.com/cnblogs_com/zhengyun_ustc/255879/o_Picture%207.png

 

Obama's Speech in 2009

 

 


For writing our map and reduce functions,  MongoDB allows clients to send JavaScript map and reduce implementations that will get evaluated and run on the server. Here is our map function.
http://images.cnblogs.com/cnblogs_com/zhengyun_ustc/255879/o_Picture%207.1.png

 

 

wordMap.js

 


As you can see the 'this' variable refers to the context from which the function is called. That is, MongoDB will call the map function on each document in the collection we are querying, and it will be pointing to document where it will have the access the key of a document such as 'text', by callingthis.text.  The map function doesn't return a list, instead it calls an emit function which it expects to be defined. This parameters of this function (key, value) will be grouped with others  intermediate results from another map evaluations that have the same key (key, [value1, value2]) and passed to the function reduce that we will define now. 

http://images.cnblogs.com/cnblogs_com/zhengyun_ustc/255879/o_Picture%208.png

 

wordReduce.js

 

 

The  reduce function must reduce a list of a chosen type to a single value of that same type; it must be transitive so it doesn't matter how the mapped items are grouped.

Now let's code our word count example using the  Pymongo  client and passing the map/reduce functions to the server.

 

http://images.cnblogs.com/cnblogs_com/zhengyun_ustc/255879/o_Picture%209.png

 

mapReduce.py

 

 

Let's see the result now:

http://images.cnblogs.com/cnblogs_com/zhengyun_ustc/255879/o_Picture%2010.png

 

And it works! :D

With Map-Reduce function the word frequency count is extremely efficient and even performs better in a distributed environment. With this brief experiment we  can see the potential of map-reduce model for distributed computing, specially on large data sets.

All code used in this article can be download  here .

My next posts will be about  performance evaluation on machine learning techniques.  Wait for news!

Marcel Caraciolo

References 

 

相关实践学习
MongoDB数据库入门
MongoDB数据库入门实验。
快速掌握 MongoDB 数据库
本课程主要讲解MongoDB数据库的基本知识,包括MongoDB数据库的安装、配置、服务的启动、数据的CRUD操作函数使用、MongoDB索引的使用(唯一索引、地理索引、过期索引、全文索引等)、MapReduce操作实现、用户管理、Java对MongoDB的操作支持(基于2.x驱动与3.x驱动的完全讲解)。 通过学习此课程,读者将具备MongoDB数据库的开发能力,并且能够使用MongoDB进行项目开发。   相关的阿里云产品:云数据库 MongoDB版 云数据库MongoDB版支持ReplicaSet和Sharding两种部署架构,具备安全审计,时间点备份等多项企业能力。在互联网、物联网、游戏、金融等领域被广泛采用。 云数据库MongoDB版(ApsaraDB for MongoDB)完全兼容MongoDB协议,基于飞天分布式系统和高可靠存储引擎,提供多节点高可用架构、弹性扩容、容灾、备份回滚、性能优化等解决方案。 产品详情: https://www.aliyun.com/product/mongodb
目录
相关文章
|
5月前
|
NoSQL MongoDB Python
【Python】已完美解决(MongoDB安装报错)Service ‘MongoDB Server (MongoDB)’ (MongoDB) failed tostart
【Python】已完美解决(MongoDB安装报错)Service ‘MongoDB Server (MongoDB)’ (MongoDB) failed tostart
325 1
|
1月前
|
存储 分布式计算 NoSQL
MongoDB Map Reduce
10月更文挑战第23天
27 1
|
3月前
|
NoSQL MongoDB 数据库
python3操作MongoDB的crud以及聚合案例,代码可直接运行(python经典编程案例)
这篇文章提供了使用Python操作MongoDB数据库进行CRUD(创建、读取、更新、删除)操作的详细代码示例,以及如何执行聚合查询的案例。
38 6
|
3月前
|
NoSQL JavaScript Java
Java Python访问MongoDB
Java Python访问MongoDB
24 4
|
7月前
|
存储 NoSQL MongoDB
MongoDB数据库转换为表格文件的Python实现
MongoDB数据库转换为表格文件的Python实现
228 0
|
4月前
|
NoSQL 安全 MongoDB
用python安装mongodb
用python安装mongodb
32 0
|
5月前
|
NoSQL Shell MongoDB
【Python】已解决:(MongoDB安装报错)‘mongo’ 不是内部或外部命令,也不是可运行的程序
【Python】已解决:(MongoDB安装报错)‘mongo’ 不是内部或外部命令,也不是可运行的程序
588 0
|
7月前
|
NoSQL MongoDB Redis
Python与NoSQL数据库(MongoDB、Redis等)面试问答
【4月更文挑战第16天】本文探讨了Python与NoSQL数据库(如MongoDB、Redis)在面试中的常见问题,包括连接与操作数据库、错误处理、高级特性和缓存策略。重点介绍了使用`pymongo`和`redis`库进行CRUD操作、异常捕获以及数据一致性管理。通过理解这些问题、易错点及避免策略,并结合代码示例,开发者能在面试中展现其技术实力和实践经验。
496 9
Python与NoSQL数据库(MongoDB、Redis等)面试问答
|
7月前
|
分布式计算 C语言 Python
基于Python实现MapReduce
一、什么是MapReduce 首先,将这个单词分解为Map、Reduce。 • Map阶段:在这个阶段,输入数据集被分割成小块,并由多个Map任务处理。每个Map任务将输入数据映射为一系列(key, value)对,并生成中间结果。 • Reduce阶段:在这个阶段,中间结果被重新分组和排序,以便相同key的中间结果被传递到同一个Reduce任务。每个Reduce任务将具有相同key的中间结果合并、计算,并生成最终的输出。
|
6月前
|
NoSQL Shell MongoDB
python操作MongoDB部分
python操作MongoDB部分
40 0
下一篇
DataWorks