
暂无个人介绍
(适用于hadoop 2.7及以上版本) 涉及到RESTful API ResourceManager REST API’s:https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html WebHDFS REST API:https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/WebHDFS.html MapReduce History Server REST API’s:https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/HistoryServerRest.html Spark Monitoring and Instrumentationhttp://spark.apache.org/docs/latest/monitoring.html 1. 统计HDFS文件系统实时使用情况 URLhttp://emr-header-1:50070/webhdfs/v1/?user.name=hadoop&op=GETCONTENTSUMMARY 返回结果: { "ContentSummary": { "directoryCount": 2, "fileCount" : 1, "length" : 24930, "quota" : -1, "spaceConsumed" : 24930, "spaceQuota" : -1 } } 关于返回结果的说明: { "name" : "ContentSummary", "properties": { "ContentSummary": { "type" : "object", "properties": { "directoryCount": { "description": "The number of directories.", "type" : "integer", "required" : true }, "fileCount": { "description": "The number of files.", "type" : "integer", "required" : true }, "length": { "description": "The number of bytes used by the content.", "type" : "integer", "required" : true }, "quota": { "description": "The namespace quota of this directory.", "type" : "integer", "required" : true }, "spaceConsumed": { "description": "The disk space consumed by the content.", "type" : "integer", "required" : true }, "spaceQuota": { "description": "The disk space quota.", "type" : "integer", "required" : true } } } } } 注意length与spaceConsumed的关系,跟hdfs副本数有关。 如果要统计各个组工作目录的使用情况,使用如下请求:http://emr-header-1:50070/webhdfs/v1/user/feed_aliyun?user.name=hadoop&op=GETCONTENTSUMMARY 2. 查看集群的实时信息和状态 URL http://emr-header-1:8088/ws/v1/cluster 返回结果 { "clusterInfo": { "id": 1495123166259, "startedOn": 1495123166259, "state": "STARTED", "haState": "ACTIVE", "rmStateStoreName": "org.apache.hadoop.yarn.server.resourcemanager.recovery.NullRMStateStore", "resourceManagerVersion": "2.7.2", "resourceManagerBuildVersion": "2.7.2 from 4bee04d3d1c27d7ef559365d3bdd2a8620807bfc by root source checksum c63f7cc71b8f63249e35126f0f7492d", "resourceManagerVersionBuiltOn": "2017-04-17T12:28Z", "hadoopVersion": "2.7.2", "hadoopBuildVersion": "2.7.2 from 4bee04d3d1c27d7ef559365d3bdd2a8620807bfc by root source checksum 3329b146070a2bc9e249fa9ba9fb55", "hadoopVersionBuiltOn": "2017-04-17T12:18Z", "haZooKeeperConnectionState": "ResourceManager HA is not enabled." } } 3. 查看资源队列的实时信息,包括队列的配额信息、资源使用实时情况 URL http://emr-header-1:8088/ws/v1/cluster/scheduler 返回结果 { "scheduler": { "schedulerInfo": { "type": "capacityScheduler", "capacity": 100, "usedCapacity": 0, "maxCapacity": 100, "queueName": "root", "queues": { "queue": [ { "type": "capacitySchedulerLeafQueueInfo", "capacity": 1, "usedCapacity": 0, "maxCapacity": 90, "absoluteCapacity": 1, "absoluteMaxCapacity": 90, "absoluteUsedCapacity": 0, "numApplications": 0, "queueName": "algorithm_aliyun", "state": "RUNNING", "resourcesUsed": { "memory": 0, "vCores": 0 }, "hideReservationQueues": false, "nodeLabels": [ "*" ], "numActiveApplications": 0, "numPendingApplications": 0, "numContainers": 0, "maxApplications": 100, "maxApplicationsPerUser": 100, "userLimit": 100, "users": null, "userLimitFactor": 1, "AMResourceLimit": { "memory": 11776, "vCores": 7 }, "usedAMResource": { "memory": 0, "vCores": 0 }, "userAMResourceLimit": { "memory": 160, "vCores": 1 }, "preemptionDisabled": true }, { "type": "capacitySchedulerLeafQueueInfo", "capacity": 1, "usedCapacity": 0, "maxCapacity": 90, "absoluteCapacity": 1, "absoluteMaxCapacity": 90, "absoluteUsedCapacity": 0, "numApplications": 0, "queueName": "dcps_aliyun", "state": "RUNNING", "resourcesUsed": { "memory": 0, "vCores": 0 }, "hideReservationQueues": false, "nodeLabels": [ "*" ], "numActiveApplications": 0, "numPendingApplications": 0, "numContainers": 0, "maxApplications": 100, "maxApplicationsPerUser": 100, "userLimit": 100, "users": null, "userLimitFactor": 1, "AMResourceLimit": { "memory": 11776, "vCores": 7 }, "usedAMResource": { "memory": 0, "vCores": 0 }, "userAMResourceLimit": { "memory": 160, "vCores": 1 }, "preemptionDisabled": true }, { "type": "capacitySchedulerLeafQueueInfo", "capacity": 31, "usedCapacity": 0, "maxCapacity": 100, "absoluteCapacity": 31, "absoluteMaxCapacity": 100, "absoluteUsedCapacity": 0, "numApplications": 0, "queueName": "default", "state": "RUNNING", "resourcesUsed": { "memory": 0, "vCores": 0 }, "hideReservationQueues": false, "nodeLabels": [ "*" ], "numActiveApplications": 0, "numPendingApplications": 0, "numContainers": 0, "maxApplications": 3100, "maxApplicationsPerUser": 3100, "userLimit": 100, "users": null, "userLimitFactor": 1, "AMResourceLimit": { "memory": 13088, "vCores": 8 }, "usedAMResource": { "memory": 0, "vCores": 0 }, "userAMResourceLimit": { "memory": 4064, "vCores": 3 }, "preemptionDisabled": true }, { "type": "capacitySchedulerLeafQueueInfo", "capacity": 15.000001, "usedCapacity": 0, "maxCapacity": 100, "absoluteCapacity": 15.000001, "absoluteMaxCapacity": 100, "absoluteUsedCapacity": 0, "numApplications": 0, "queueName": "feed_aliyun", "state": "RUNNING", "resourcesUsed": { "memory": 0, "vCores": 0 }, "hideReservationQueues": false, "nodeLabels": [ "*" ], "numActiveApplications": 0, "numPendingApplications": 0, "numContainers": 0, "maxApplications": 1500, "maxApplicationsPerUser": 7500, "userLimit": 100, "users": null, "userLimitFactor": 5, "AMResourceLimit": { "memory": 12320, "vCores": 8 }, "usedAMResource": { "memory": 0, "vCores": 0 }, "userAMResourceLimit": { "memory": 9856, "vCores": 7 }, "preemptionDisabled": true }, { "type": "capacitySchedulerLeafQueueInfo", "capacity": 51, "usedCapacity": 0, "maxCapacity": 90, "absoluteCapacity": 51, "absoluteMaxCapacity": 90, "absoluteUsedCapacity": 0, "numApplications": 0, "queueName": "hot_aliyun", "state": "RUNNING", "resourcesUsed": { "memory": 0, "vCores": 0 }, "hideReservationQueues": false, "nodeLabels": [ "*" ], "numActiveApplications": 0, "numPendingApplications": 0, "numContainers": 0, "maxApplications": 5100, "maxApplicationsPerUser": 5100, "userLimit": 100, "users": null, "userLimitFactor": 1, "AMResourceLimit": { "memory": 11776, "vCores": 7 }, "usedAMResource": { "memory": 0, "vCores": 0 }, "userAMResourceLimit": { "memory": 6688, "vCores": 5 }, "preemptionDisabled": true }, { "type": "capacitySchedulerLeafQueueInfo", "capacity": 1, "usedCapacity": 0, "maxCapacity": 90, "absoluteCapacity": 1, "absoluteMaxCapacity": 90, "absoluteUsedCapacity": 0, "numApplications": 0, "queueName": "push_aliyun", "state": "RUNNING", "resourcesUsed": { "memory": 0, "vCores": 0 }, "hideReservationQueues": false, "nodeLabels": [ "*" ], "numActiveApplications": 0, "numPendingApplications": 0, "numContainers": 0, "maxApplications": 100, "maxApplicationsPerUser": 100, "userLimit": 100, "users": null, "userLimitFactor": 1, "AMResourceLimit": { "memory": 11776, "vCores": 7 }, "usedAMResource": { "memory": 0, "vCores": 0 }, "userAMResourceLimit": { "memory": 160, "vCores": 1 }, "preemptionDisabled": true } ] } } } } 具体参数说明参考:https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_Queue_API 4. 查看实时的作业列表,列表信息中也包含了作业运行的详情信息,包括作业名称、id、运行状态、起止时间,资源使用情况。 URL http://emr-header-1:8088/ws/v1/cluster/apps 返回结果 { "apps": { "app": [ { "finishedTime" : 1326815598530, "amContainerLogs" : "http://host.domain.com:8042/node/containerlogs/container_1326815542473_0001_01_000001", "trackingUI" : "History", "state" : "FINISHED", "user" : "user1", "id" : "application_1326815542473_0001", "clusterId" : 1326815542473, "finalStatus" : "SUCCEEDED", "amHostHttpAddress" : "host.domain.com:8042", "progress" : 100, "name" : "word count", "startedTime" : 1326815573334, "elapsedTime" : 25196, "diagnostics" : "", "trackingUrl" : "http://host.domain.com:8088/proxy/application_1326815542473_0001/jobhistory/job/job_1326815542473_1_1", "queue" : "default", "allocatedMB" : 0, "allocatedVCores" : 0, "runningContainers" : 0, "memorySeconds" : 151730, "vcoreSeconds" : 103 }, { "finishedTime" : 1326815789546, "amContainerLogs" : "http://host.domain.com:8042/node/containerlogs/container_1326815542473_0002_01_000001", "trackingUI" : "History", "state" : "FINISHED", "user" : "user1", "id" : "application_1326815542473_0002", "clusterId" : 1326815542473, "finalStatus" : "SUCCEEDED", "amHostHttpAddress" : "host.domain.com:8042", "progress" : 100, "name" : "Sleep job", "startedTime" : 1326815641380, "elapsedTime" : 148166, "diagnostics" : "", "trackingUrl" : "http://host.domain.com:8088/proxy/application_1326815542473_0002/jobhistory/job/job_1326815542473_2_2", "queue" : "default", "allocatedMB" : 0, "allocatedVCores" : 0, "runningContainers" : 1, "memorySeconds" : 640064, "vcoreSeconds" : 442 } ] } } 如果要统计固定时间段的,可以加上"?finishedTimeBegin={时间戳}&finishedTimeEnd={时间戳}"参数,例如http://emr-header-1:8088/ws/v1/cluster/apps?finishedTimeBegin=1496742124000&finishedTimeEnd=1496742134000 5. 统计作业扫描的数据量情况 job扫描的数据量,需要通过History Server的RESTful API查询,MapReduce的和Spark的又有一些差异。 5.1 Mapreduce job扫描数据量 URL http://emr-header-1:19888/ws/v1/history/mapreduce/jobs/job_1495123166259_0962/counters 返回结果 { "jobCounters" : { "id" : "job_1326381300833_2_2", "counterGroup" : [ { "counterGroupName" : "Shuffle Errors", "counter" : [ { "reduceCounterValue" : 0, "mapCounterValue" : 0, "totalCounterValue" : 0, "name" : "BAD_ID" }, { "reduceCounterValue" : 0, "mapCounterValue" : 0, "totalCounterValue" : 0, "name" : "CONNECTION" }, { "reduceCounterValue" : 0, "mapCounterValue" : 0, "totalCounterValue" : 0, "name" : "IO_ERROR" }, { "reduceCounterValue" : 0, "mapCounterValue" : 0, "totalCounterValue" : 0, "name" : "WRONG_LENGTH" }, { "reduceCounterValue" : 0, "mapCounterValue" : 0, "totalCounterValue" : 0, "name" : "WRONG_MAP" }, { "reduceCounterValue" : 0, "mapCounterValue" : 0, "totalCounterValue" : 0, "name" : "WRONG_REDUCE" } ] }, { "counterGroupName" : "org.apache.hadoop.mapreduce.FileSystemCounter", "counter" : [ { "reduceCounterValue" : 0, "mapCounterValue" : 0, "totalCounterValue" : 2483, "name" : "FILE_BYTES_READ" }, { "reduceCounterValue" : 0, "mapCounterValue" : 0, "totalCounterValue" : 108525, "name" : "FILE_BYTES_WRITTEN" }, { "reduceCounterValue" : 0, "mapCounterValue" : 0, "totalCounterValue" : 0, "name" : "FILE_READ_OPS" }, { "reduceCounterValue" : 0, "mapCounterValue" : 0, "totalCounterValue" : 0, "name" : "FILE_LARGE_READ_OPS" }, { "reduceCounterValue" : 0, "mapCounterValue" : 0, "totalCounterValue" : 0, "name" : "FILE_WRITE_OPS" }, { "reduceCounterValue" : 0, "mapCounterValue" : 0, "totalCounterValue" : 48, "name" : "HDFS_BYTES_READ" }, { "reduceCounterValue" : 0, "mapCounterValue" : 0, "totalCounterValue" : 0, "name" : "HDFS_BYTES_WRITTEN" }, { "reduceCounterValue" : 0, "mapCounterValue" : 0, "totalCounterValue" : 1, "name" : "HDFS_READ_OPS" }, { "reduceCounterValue" : 0, "mapCounterValue" : 0, "totalCounterValue" : 0, "name" : "HDFS_LARGE_READ_OPS" }, { "reduceCounterValue" : 0, "mapCounterValue" : 0, "totalCounterValue" : 0, "name" : "HDFS_WRITE_OPS" } ] }, { "counterGroupName" : "org.apache.hadoop.mapreduce.TaskCounter", "counter" : [ { "reduceCounterValue" : 0, "mapCounterValue" : 0, "totalCounterValue" : 1, "name" : "MAP_INPUT_RECORDS" }, { "reduceCounterValue" : 0, "mapCounterValue" : 0, "totalCounterValue" : 1200, "name" : "MAP_OUTPUT_RECORDS" }, { "reduceCounterValue" : 0, "mapCounterValue" : 0, "totalCounterValue" : 4800, "name" : "MAP_OUTPUT_BYTES" }, { "reduceCounterValue" : 0, "mapCounterValue" : 0, "totalCounterValue" : 2235, "name" : "MAP_OUTPUT_MATERIALIZED_BYTES" }, { "reduceCounterValue" : 0, "mapCounterValue" : 0, "totalCounterValue" : 48, "name" : "SPLIT_RAW_BYTES" }, { "reduceCounterValue" : 0, "mapCounterValue" : 0, "totalCounterValue" : 0, "name" : "COMBINE_INPUT_RECORDS" }, { "reduceCounterValue" : 0, "mapCounterValue" : 0, "totalCounterValue" : 0, "name" : "COMBINE_OUTPUT_RECORDS" }, { "reduceCounterValue" : 0, "mapCounterValue" : 0, "totalCounterValue" : 1200, "name" : "REDUCE_INPUT_GROUPS" }, { "reduceCounterValue" : 0, "mapCounterValue" : 0, "totalCounterValue" : 2235, "name" : "REDUCE_SHUFFLE_BYTES" }, { "reduceCounterValue" : 0, "mapCounterValue" : 0, "totalCounterValue" : 1200, "name" : "REDUCE_INPUT_RECORDS" }, { "reduceCounterValue" : 0, "mapCounterValue" : 0, "totalCounterValue" : 0, "name" : "REDUCE_OUTPUT_RECORDS" }, { "reduceCounterValue" : 0, "mapCounterValue" : 0, "totalCounterValue" : 2400, "name" : "SPILLED_RECORDS" }, { "reduceCounterValue" : 0, "mapCounterValue" : 0, "totalCounterValue" : 1, "name" : "SHUFFLED_MAPS" }, { "reduceCounterValue" : 0, "mapCounterValue" : 0, "totalCounterValue" : 0, "name" : "FAILED_SHUFFLE" }, { "reduceCounterValue" : 0, "mapCounterValue" : 0, "totalCounterValue" : 1, "name" : "MERGED_MAP_OUTPUTS" }, { "reduceCounterValue" : 0, "mapCounterValue" : 0, "totalCounterValue" : 113, "name" : "GC_TIME_MILLIS" }, { "reduceCounterValue" : 0, "mapCounterValue" : 0, "totalCounterValue" : 1830, "name" : "CPU_MILLISECONDS" }, { "reduceCounterValue" : 0, "mapCounterValue" : 0, "totalCounterValue" : 478068736, "name" : "PHYSICAL_MEMORY_BYTES" }, { "reduceCounterValue" : 0, "mapCounterValue" : 0, "totalCounterValue" : 2159284224, "name" : "VIRTUAL_MEMORY_BYTES" }, { "reduceCounterValue" : 0, "mapCounterValue" : 0, "totalCounterValue" : 378863616, "name" : "COMMITTED_HEAP_BYTES" } ] }, { "counterGroupName" : "org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter", "counter" : [ { "reduceCounterValue" : 0, "mapCounterValue" : 0, "totalCounterValue" : 0, "name" : "BYTES_READ" } ] }, { "counterGroupName" : "org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCounter", "counter" : [ { "reduceCounterValue" : 0, "mapCounterValue" : 0, "totalCounterValue" : 0, "name" : "BYTES_WRITTEN" } ] } ] } } 其中org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter里面的BYTES_READ为job扫描的数据量具体参数:https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/HistoryServerRest.html#Job_Counters_API 5.2 Mapreduce job扫描数据量 URL http://emr-header-1:18080/api/v1/applications/application_1495123166259_1050/executors 每个executor的totalInputBytes总和为整个job的数据扫描量。更多参考:http://spark.apache.org/docs/latest/monitoring.html
背景 前面一篇介绍了yarn的capacity scheduler原理,实验了在EMR集群上使用capacity scheduler对集群资源的隔离和quota的限制。本文会介绍EMR集群上capacity scheduler的ACL实现。 为什么要做这个?前面给集群分配的资源分配了多个队列,以及每个队列的资源配比和作业调度的优先级。如果多租户里面的每个都按照约定,各自往自己对应的队列里面提交作业,自然没有问题。但是如果用户熟悉capacity scheduler的操作和原理,也是可以占用别组的资源队列。所有有了capacity scheduler的ACL设置。 关键参数 yarn.scheduler.capacity.queue-mappings 指定用户和queue的映射关系。默认用户上来,不用指定queue参数就能直接到对应的queue。这个比较方便,参数的格式为:[u|g]:[name]:[queue_name][,next mapping]* yarn.scheduler.capacity.root.{queue-path}.acl_administer_queue 指定谁能管理这个队列里面的job,英文解释为The ACL of who can administer jobs on the default queue. 星号*表示all,一个空格表示none; yarn.scheduler.capacity.root.{queue-path}.acl_submit_applications 指定谁能提交job到这个队列,英文解释是The ACL of who can administer jobs on the queue.星号*表示all,一个空格表示none; EMR集群上具体操作步骤 创建EMR集群 修改相关配置来支持queue acl yarn-site: yarn.acl.enable=true mapred-site: mapreduce.cluster.acls.enabled=true hdfs-site: dfs.permissions.enabled=true这个跟capacity scheduler queue的acl没什么关系,是控制hdfs acl的,这里一并设置了 hdfs-site: mapreduce.job.acl-view-job=* 如果配置了dfs.permissions.enabled=true,就需要配置一下这个,要不然在hadoop ui上面没发查看job信息 重启yarn和hdfs,使配置生效(root账户) su -l hdfs -c '/usr/lib/hadoop-current/sbin/stop-dfs.sh' su -l hadoop -c '/usr/lib/hadoop-current/sbin/stop-yarn.sh' su -l hdfs -c '/usr/lib/hadoop-current/sbin/start-dfs.sh' su -l hadoop -c '/usr/lib/hadoop-current/sbin/start-yarn.sh' su -l hadoop -c '/usr/lib/hadoop-current/sbin/yarn-daemon.sh start proxyserver' 修改capacity scheduler配置完整配置 <configuration> <property> <name>yarn.scheduler.capacity.maximum-applications</name> <value>10000</value> <description> Maximum number of applications that can be pending and running. </description> </property> <property> <name>yarn.scheduler.capacity.maximum-am-resource-percent</name> <value>0.25</value> <description> Maximum percent of resources in the cluster which can be used to run application masters i.e. controls number of concurrent running applications. </description> </property> <property> <name>yarn.scheduler.capacity.resource-calculator</name> <value>org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator</value> <description> The ResourceCalculator implementation to be used to compare Resources in the scheduler. The default i.e. DefaultResourceCalculator only uses Memory while DominantResourceCalculator uses dominant-resource to compare multi-dimensional resources such as Memory, CPU etc. </description> </property> <property> <name>yarn.scheduler.capacity.root.queues</name> <value>a,b,default</value> <description> The queues at the this level (root is the root queue). </description> </property> <property> <name>yarn.scheduler.capacity.root.default.capacity</name> <value>20</value> <description>Default queue target capacity.</description> </property> <property> <name>yarn.scheduler.capacity.root.a.capacity</name> <value>30</value> <description>Default queue target capacity.</description> </property> <property> <name>yarn.scheduler.capacity.root.b.capacity</name> <value>50</value> <description>Default queue target capacity.</description> </property> <property> <name>yarn.scheduler.capacity.root.default.user-limit-factor</name> <value>1</value> <description> Default queue user limit a percentage from 0.0 to 1.0. </description> </property> <property> <name>yarn.scheduler.capacity.root.default.maximum-capacity</name> <value>100</value> <description> The maximum capacity of the default queue. </description> </property> <property> <name>yarn.scheduler.capacity.root.default.state</name> <value>RUNNING</value> <description> The state of the default queue. State can be one of RUNNING or STOPPED. </description> </property> <property> <name>yarn.scheduler.capacity.root.a.state</name> <value>RUNNING</value> <description> The state of the default queue. State can be one of RUNNING or STOPPED. </description> </property> <property> <name>yarn.scheduler.capacity.root.b.state</name> <value>RUNNING</value> <description> The state of the default queue. State can be one of RUNNING or STOPPED. </description> </property> <property> <name>yarn.scheduler.capacity.root.acl_submit_applications</name> <value> </value> <description> The ACL of who can submit jobs to the root queue. </description> </property> <property> <name>yarn.scheduler.capacity.root.a.acl_submit_applications</name> <value>root</value> <description> The ACL of who can submit jobs to the default queue. </description> </property> <property> <name>yarn.scheduler.capacity.root.b.acl_submit_applications</name> <value>hadoop</value> <description> The ACL of who can submit jobs to the default queue. </description> </property> <property> <name>yarn.scheduler.capacity.root.default.acl_submit_applications</name> <value>root</value> <description> The ACL of who can submit jobs to the default queue. </description> </property> <property> <name>yarn.scheduler.capacity.root.acl_administer_queue</name> <value> </value> <description> The ACL of who can administer jobs on the default queue. </description> </property> <property> <name>yarn.scheduler.capacity.root.default.acl_administer_queue</name> <value>root</value> <description> The ACL of who can administer jobs on the default queue. </description> </property> <property> <name>yarn.scheduler.capacity.root.a.acl_administer_queue</name> <value>root</value> <description> The ACL of who can administer jobs on the default queue. </description> </property> <property> <name>yarn.scheduler.capacity.root.b.acl_administer_queue</name> <value>root</value> <description> The ACL of who can administer jobs on the default queue. </description> </property> <property> <name>yarn.scheduler.capacity.node-locality-delay</name> <value>40</value> <description> Number of missed scheduling opportunities after which the CapacityScheduler attempts to schedule rack-local containers. Typically this should be set to number of nodes in the cluster, By default is setting approximately number of nodes in one rack which is 40. </description> </property> <property> <name>yarn.scheduler.capacity.queue-mappings</name> <value>u:hadoop:b,u:root:a</value> </property> <property> <name>yarn.scheduler.capacity.queue-mappings-override.enable</name> <value>false</value> <description> If a queue mapping is present, will it override the value specified by the user? This can be used by administrators to place jobs in queues that are different than the one specified by the user. The default is false. </description> </property> </configuration> 上面的配置,分配了三个队列和对应的资源配比,设置用户hadoop默认(不指定队列的时候)往b队列提,root默认往a队列提。同时hadoop只能往b队列提交作业,root可以往所有队列提交作业。其它用户没有权限提交作业。 踩过的坑 acl_administer_queue的配置 配置中支持两种操作的acl权限配置acl_administer_queue和acl_submit_applications。按照语意,如果要控制是否能提交作业,只要配置队列的acl_submit_applications属性即可,按照文档,也就是这个意思。但是其实不是的,只要有administer权限的,就能提交作业。这个问题查了好久,找源码才找到。 @Override public void submitApplication(ApplicationId applicationId, String userName, String queue) throws AccessControlException { // Careful! Locking order is important! // Check queue ACLs UserGroupInformation userUgi = UserGroupInformation.createRemoteUser(userName); if (!hasAccess(QueueACL.SUBMIT_APPLICATIONS, userUgi) && !hasAccess(QueueACL.ADMINISTER_QUEUE, userUgi)) { throw new AccessControlException("User " + userName + " cannot submit" + " applications to queue " + getQueuePath()); } root queue的配置 如果要限制用户对queue的权限root queue一定要设置,不能只设置leaf queue。因为权限是根权限具有更高的优先级,看代码注释说:// recursively look up the queue to see if parent queue has the permission。这个跟常人理解也b不一样。所以需要先把把的权限限制住,要不然配置的各种自队列的权限根本没有用。 <property> <name>yarn.scheduler.capacity.root.acl_submit_applications</name> <value> </value> <description> The ACL of who can submit jobs to the root queue. </description> </property>
背景 前面一篇介绍了yarn的capacity scheduler原理,实验了在EMR集群上使用capacity scheduler对集群资源的隔离和quota的限制。本文会介绍EMR集群上capacity scheduler的ACL实现。 为什么要做这个?前面给集群分配的资源分配了多个队列,以及每个队列的资源配比和作业调度的优先级。如果多租户里面的每个都按照约定,各自往自己对应的队列里面提交作业,自然没有问题。但是如果用户熟悉capacity scheduler的操作和原理,也是可以占用别组的资源队列。所有有了capacity scheduler的ACL设置。 关键参数 yarn.scheduler.capacity.queue-mappings 指定用户和queue的映射关系。默认用户上来,不用指定queue参数就能直接到对应的queue。这个比较方便,参数的格式为:[u|g]:[name]:[queue_name][,next mapping]* yarn.scheduler.capacity.root.{queue-path}.acl_administer_queue 指定谁能管理这个队列里面的job,英文解释为The ACL of who can administer jobs on the default queue. 星号*表示all,一个空格表示none; yarn.scheduler.capacity.root.{queue-path}.acl_submit_applications 指定谁能提交job到这个队列,英文解释是The ACL of who can administer jobs on the queue.星号*表示all,一个空格表示none; EMR集群上具体操作步骤 创建EMR集群 修改相关配置来支持queue acl yarn-site: yarn.acl.enable=true mapred-site: mapreduce.cluster.acls.enabled=true hdfs-site: dfs.permissions.enabled=true这个跟capacity scheduler queue的acl没什么关系,是控制hdfs acl的,这里一并设置了 hdfs-site: mapreduce.job.acl-view-job=* 如果配置了dfs.permissions.enabled=true,就需要配置一下这个,要不然在hadoop ui上面没发查看job信息 重启yarn和hdfs,使配置生效(root账户) su -l hdfs -c '/usr/lib/hadoop-current/sbin/stop-dfs.sh' su -l hadoop -c '/usr/lib/hadoop-current/sbin/stop-yarn.sh' su -l hdfs -c '/usr/lib/hadoop-current/sbin/start-dfs.sh' su -l hadoop -c '/usr/lib/hadoop-current/sbin/start-yarn.sh' su -l hadoop -c '/usr/lib/hadoop-current/sbin/yarn-daemon.sh start proxyserver' 修改capacity scheduler配置完整配置 <configuration> <property> <name>yarn.scheduler.capacity.maximum-applications</name> <value>10000</value> <description> Maximum number of applications that can be pending and running. </description> </property> <property> <name>yarn.scheduler.capacity.maximum-am-resource-percent</name> <value>0.25</value> <description> Maximum percent of resources in the cluster which can be used to run application masters i.e. controls number of concurrent running applications. </description> </property> <property> <name>yarn.scheduler.capacity.resource-calculator</name> <value>org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator</value> <description> The ResourceCalculator implementation to be used to compare Resources in the scheduler. The default i.e. DefaultResourceCalculator only uses Memory while DominantResourceCalculator uses dominant-resource to compare multi-dimensional resources such as Memory, CPU etc. </description> </property> <property> <name>yarn.scheduler.capacity.root.queues</name> <value>a,b,default</value> <description> The queues at the this level (root is the root queue). </description> </property> <property> <name>yarn.scheduler.capacity.root.default.capacity</name> <value>20</value> <description>Default queue target capacity.</description> </property> <property> <name>yarn.scheduler.capacity.root.a.capacity</name> <value>30</value> <description>Default queue target capacity.</description> </property> <property> <name>yarn.scheduler.capacity.root.b.capacity</name> <value>50</value> <description>Default queue target capacity.</description> </property> <property> <name>yarn.scheduler.capacity.root.default.user-limit-factor</name> <value>1</value> <description> Default queue user limit a percentage from 0.0 to 1.0. </description> </property> <property> <name>yarn.scheduler.capacity.root.default.maximum-capacity</name> <value>100</value> <description> The maximum capacity of the default queue. </description> </property> <property> <name>yarn.scheduler.capacity.root.default.state</name> <value>RUNNING</value> <description> The state of the default queue. State can be one of RUNNING or STOPPED. </description> </property> <property> <name>yarn.scheduler.capacity.root.a.state</name> <value>RUNNING</value> <description> The state of the default queue. State can be one of RUNNING or STOPPED. </description> </property> <property> <name>yarn.scheduler.capacity.root.b.state</name> <value>RUNNING</value> <description> The state of the default queue. State can be one of RUNNING or STOPPED. </description> </property> <property> <name>yarn.scheduler.capacity.root.acl_submit_applications</name> <value> </value> <description> The ACL of who can submit jobs to the root queue. </description> </property> <property> <name>yarn.scheduler.capacity.root.a.acl_submit_applications</name> <value>root</value> <description> The ACL of who can submit jobs to the default queue. </description> </property> <property> <name>yarn.scheduler.capacity.root.b.acl_submit_applications</name> <value>hadoop</value> <description> The ACL of who can submit jobs to the default queue. </description> </property> <property> <name>yarn.scheduler.capacity.root.default.acl_submit_applications</name> <value>root</value> <description> The ACL of who can submit jobs to the default queue. </description> </property> <property> <name>yarn.scheduler.capacity.root.acl_administer_queue</name> <value> </value> <description> The ACL of who can administer jobs on the default queue. </description> </property> <property> <name>yarn.scheduler.capacity.root.default.acl_administer_queue</name> <value>root</value> <description> The ACL of who can administer jobs on the default queue. </description> </property> <property> <name>yarn.scheduler.capacity.root.a.acl_administer_queue</name> <value>root</value> <description> The ACL of who can administer jobs on the default queue. </description> </property> <property> <name>yarn.scheduler.capacity.root.b.acl_administer_queue</name> <value>root</value> <description> The ACL of who can administer jobs on the default queue. </description> </property> <property> <name>yarn.scheduler.capacity.node-locality-delay</name> <value>40</value> <description> Number of missed scheduling opportunities after which the CapacityScheduler attempts to schedule rack-local containers. Typically this should be set to number of nodes in the cluster, By default is setting approximately number of nodes in one rack which is 40. </description> </property> <property> <name>yarn.scheduler.capacity.queue-mappings</name> <value>u:hadoop:b,u:root:a</value> </property> <property> <name>yarn.scheduler.capacity.queue-mappings-override.enable</name> <value>false</value> <description> If a queue mapping is present, will it override the value specified by the user? This can be used by administrators to place jobs in queues that are different than the one specified by the user. The default is false. </description> </property> </configuration> 上面的配置,分配了三个队列和对应的资源配比,设置用户hadoop默认(不指定队列的时候)往b队列提,root默认往a队列提。同时hadoop只能往b队列提交作业,root可以往所有队列提交作业。其它用户没有权限提交作业。 踩过的坑 acl_administer_queue的配置 配置中支持两种操作的acl权限配置acl_administer_queue和acl_submit_applications。按照语意,如果要控制是否能提交作业,只要配置队列的acl_submit_applications属性即可,按照文档,也就是这个意思。但是其实不是的,只要有administer权限的,就能提交作业。这个问题查了好久,找源码才找到。 @Override public void submitApplication(ApplicationId applicationId, String userName, String queue) throws AccessControlException { // Careful! Locking order is important! // Check queue ACLs UserGroupInformation userUgi = UserGroupInformation.createRemoteUser(userName); if (!hasAccess(QueueACL.SUBMIT_APPLICATIONS, userUgi) && !hasAccess(QueueACL.ADMINISTER_QUEUE, userUgi)) { throw new AccessControlException("User " + userName + " cannot submit" + " applications to queue " + getQueuePath()); } root queue的配置 如果要限制用户对queue的权限root queue一定要设置,不能只设置leaf queue。因为权限是根权限具有更高的优先级,看代码注释说:// recursively look up the queue to see if parent queue has the permission。这个跟常人理解也不一样。所以需要先把把的权限限制住,要不然配置的各种自队列的权限根本没有用。 <property> <name>yarn.scheduler.capacity.root.acl_submit_applications</name> <value> </value> <description> The ACL of who can submit jobs to the root queue. </description> </property>
背景 使用过hadoop的人基本都会考虑集群里面资源的调度和优先级的问题,假设你现在所在的公司有一个大hadoop的集群,有很多不同的业务组同时使用。但是A项目组经常做一些定时的BI报表,B项目组则经常使用一些软件做一些临时需求。那么他们肯定会遇到同时提交任务的场景,这个时候到底如何分配资源满足这两个任务呢?是先执行A的任务,再执行B的任务,还是同时跑两个? 目前一些使用EMR的大公司,会使用一个比较大的集群,来共公司内部不同业务组的人共同使用,相对于使用多个小集群,使用大的集群一方面可以达到更高的最大性能,根据不同业务的峰谷情况来调度从而获得更高的资源利用里,降低总成本。另外一个方面,能够更好地在不同的业务组之间实现资源的共享和数据的流动。下面结合EMR集群,介绍一下如何进行大集群的资源quota管控。 yarn默认提供了两种调度规则,capacity scheduler和fair scheduler。现在使用比较多的是capacity scheduler。具体的实现原理和调度源码可以google一下capacity scheduler。 什么是capacity调度器 Capacity调度器说的通俗点,可以理解成一个个的资源队列。这个资源队列是用户自己去分配的。比如我大体上把整个集群分成了queue1queue2两个队列,queue1是给一个业务组使用queue2给另外一个业务组使用。如果第一个业务组下面又有两个方向,那么还可以继续分,比如专门做BI的和做实时分析的。那么队列的分配就可以参考下面的树形结构: root ------default[20%] ------q1[60%] |---q1.q11[70%] |---q1.q12[30%] ------q2[20%] 整个集群的queue必须挂在root下面。分成三个queue:default,q1和q2。三个队列使用集群资源的quota配比为:20%,60%,20%。default这个queue是必须要存在的。 虽然有了这样的资源分配,但是并不是说提交任务到q2里面,它就只能使用20%的资源,即使剩下的80%都空闲着。它也是能够实现(通过配置),只要资源实在空闲状态,那么q2就可以使用100%的资源。但是一旦其它队列提交了任务,q2就需要在释放资源后,把资源还给其它队列,直到达到预设de配比值。粗粒度上资源是按照上面的方式进行,在每个队列的内部,还是按照FIFO的原则来分配资源的。 capacity调度器特性 capacity调度器具有以下的几个特性: 层次化的队列设计,这种层次化的队列设计保证了子队列可以使用父队列设置的全部资源。这样通过层次化的管理,更容易合理分配和限制资源的使用。容量保证,队列上都会设置一个资源的占比,这样可以保证每个队列都不会占用整个集群的资源。 安全,每个队列又严格的访问控制。用户只能向自己的队列里面提交任务,而且不能修改或者访问其他队列的任务。 弹性分配,空闲的资源可以被分配给任何队列。当多个队列出现争用的时候,则会按照比例进行平衡。 多租户租用,通过队列的容量限制,多个用户就可以共享同一个集群,同事保证每个队列分配到自己的容量,提高利用率。 操作性,yarn支持动态修改调整容量、权限等的分配,可以在运行时直接修改。还提供给管理员界面,来显示当前的队列状况。管理员可以在运行时,添加一个队列;但是不能删除一个队列。管理员还可以在运行时暂停某个队列,这样可以保证当前的队列在执行过程中,集群不会接收其他的任务。如果一个队列被设置成了stopped,那么就不能向他或者子队列上提交任务了。 capacity调度器的配置 登录EMR集群master节点,编辑配置文件:/etc/emr/hadoop-conf/capacity-scheduler.xml这里先贴出来完整的比较复杂一点的配置,后面再详细说明: <?xml version="1.0" encoding="utf-8"?> <configuration> <property> <name>yarn.scheduler.capacity.maximum-applications</name> <value>10000</value> <description>Maximum number of applications that can be pending and running.</description> </property> <property> <name>yarn.scheduler.capacity.maximum-am-resource-percent</name> <value>0.25</value> <description>Maximum percent of resources in the cluster which can be used to run application masters i.e. controls number of concurrent running applications.</description> </property> <property> <name>yarn.scheduler.capacity.resource-calculator</name> <value>org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator</value> <description>The ResourceCalculator implementation to be used to compare Resources in the scheduler. The default i.e. DefaultResourceCalculator only uses Memory while DominantResourceCalculator uses dominant-resource to compare multi-dimensional resources such as Memory, CPU etc.</description> </property> <property> <name>yarn.scheduler.capacity.root.queues</name> <value>default,q1,q2</value> <description>The queues at the this level (root is the root queue).</description> </property> <property> <name>yarn.scheduler.capacity.root.q1.queues</name> <value>q11,q12</value> <description>The queues at the this level (root is the root queue).</description> </property> <property> <name>yarn.scheduler.capacity.root.default.capacity</name> <value>20</value> <description>Default queue target capacity.</description> </property> <property> <name>yarn.scheduler.capacity.root.q1.capacity</name> <value>60</value> <description>Default queue target capacity.</description> </property> <property> <name>yarn.scheduler.capacity.root.q2.capacity</name> <value>20</value> <description>Default queue target capacity.</description> </property> <property> <name>yarn.scheduler.capacity.root.q1.q11.capacity</name> <value>70</value> <description>Default queue target capacity.</description> </property> <property> <name>yarn.scheduler.capacity.root.q1.q11.maximum-capacity</name> <value>90</value> <description>Default queue target capacity.</description> </property> <property> <name>yarn.scheduler.capacity.root.q1.q11.minimum-user-limit-percent</name> <value>25</value> <description>Default queue target capacity.</description> </property> <property> <name>yarn.scheduler.capacity.root.q1.q12.capacity</name> <value>30</value> <description>Default queue target capacity.</description> </property> <property> <name>yarn.scheduler.capacity.root.q1.q12.user-limit-factor</name> <value>0.7</value> <description>Default queue target capacity.</description> </property> <property> <name>yarn.scheduler.capacity.root.q2.user-limit-factor</name> <value>0.4</value> <description>Default queue user limit a percentage from 0.0 to 1.0.</description> </property> <property> <name>yarn.scheduler.capacity.root.q2.maximum-capacity</name> <value>100</value> <description>The maximum capacity of the default queue.</description> </property> <property> <name>yarn.scheduler.capacity.root.q1.state</name> <value>RUNNING</value> <description>The state of the default queue. State can be one of RUNNING or STOPPED.</description> </property> <property> <name>yarn.scheduler.capacity.root.q2.state</name> <value>RUNNING</value> <description>The state of the default queue. State can be one of RUNNING or STOPPED.</description> </property> <property> <name>yarn.scheduler.capacity.root.q1.acl_submit_applications</name> <value>*</value> <description>The ACL of who can submit jobs to the default queue.</description> </property> <property> <name>yarn.scheduler.capacity.root.q2.acl_submit_applications</name> <value>*</value> <description>The ACL of who can submit jobs to the default queue.</description> </property> <property> <name>yarn.scheduler.capacity.root.q1.acl_administer_queue</name> <value>*</value> <description>The ACL of who can administer jobs on the default queue.</description> </property> <property> <name>yarn.scheduler.capacity.root.q2.acl_administer_queue</name> <value>*</value> <description>The ACL of who can administer jobs on the default queue.</description> </property> <property> <name>yarn.scheduler.capacity.node-locality-delay</name> <value>40</value> <description>Number of missed scheduling opportunities after which the CapacityScheduler attempts to schedule rack-local containers. Typically this should be set to number of nodes in the cluster, By default is setting approximately number of nodes in one rack which is 40.</description> </property> <property> <name>yarn.scheduler.capacity.queue-mappings</name> <value/> <description>A list of mappings that will be used to assign jobs to queues The syntax for this list is [u|g]:[name]:[queue_name][,next mapping]* Typically this list will be used to map users to queues, for example, u:%user:%user maps all users to queues with the same name as the user.</description> </property> <property> <name>yarn.scheduler.capacity.queue-mappings-override.enable</name> <value>false</value> <description>If a queue mapping is present, will it override the value specified by the user? This can be used by administrators to place jobs in queues that are different than the one specified by the user. The default is false.</description> </property> </configuration> 在hadoop帐号下,执行命令刷新一下queue配置:/usr/lib/hadoop-current/bin/yarn rmadmin -refreshQueues 然后打开hadoop ui,就能看到效果: 参数说明 队列属性:yarn.scheduler.capacity.${quere-path}.capacity 例如下面的配置,表示共三个大的queue,default,q1,q2,q1下面又分了两个queue q11,q12。 <property> <name>yarn.scheduler.capacity.root.queues</name> <value>default,q1,q2</value> <description> The queues at the this level (root is the root queue). </description> </property> <property> <name>yarn.scheduler.capacity.root.q1.queues</name> <value>q11,q12</value> <description> The queues at the this level (root is the root queue). </description> </property> 下面的配置设置每个queue的资源quota配比: <property> <name>yarn.scheduler.capacity.root.default.capacity</name> <value>20</value> <description>Default queue target capacity.</description> </property> <property> <name>yarn.scheduler.capacity.root.q1.capacity</name> <value>60</value> <description>Default queue target capacity.</description> </property> <property> <name>yarn.scheduler.capacity.root.q2.capacity</name> <value>20</value> <description>Default queue target capacity.</description> </property> 设置q1大队列下面的两个小队列的资源quota配比: <property> <name>yarn.scheduler.capacity.root.q1.q11.capacity</name> <value>70</value> <description>Default queue target capacity.</description> </property> <property> <name>yarn.scheduler.capacity.root.q1.q12.capacity</name> <value>30</value> <description>Default queue target capacity.</description> </property> 下面的配置表示,q1.q11这个queue,能够使用q1所有资源的最大比例,q1.q11分配的资源原本是占q1总资源的70%(见上面配置),但是如果q1里面没有其它作业在跑,都是空闲的,那么q11是可以使用到q1总资源的90%。但是如果q1里面没有这么多空闲资源,q11 只能使用到q1总资源的70%: <property> <name>yarn.scheduler.capacity.root.q1.q11.maximum-capacity</name> <value>90</value> <description>Default queue target capacity.</description> </property> 下面的配置表示q1.q11里面的作业的保底资源要有25%,意思是说,q1下面的总资源至少还要剩余25%,q11里面的作业才能提上来,如果q1下面的总资源已经小于25%了,那么往q11里面提作业就要等待: <property> <name>yarn.scheduler.capacity.root.q1.q11.minimum-user-limit-percent</name> <value>25</value> <description>Default queue target capacity.</description> </property> 另外一些比较重要的ACL配置: yarn.scheduler.capacity.root.q1.acl_submit_applications 表示哪些user/group可以往q1里面提作业; yarn.scheduler.capacity.queue-mappings 这个比较强大,可以设定user/group和queue的映射关系,格式为[u|g]:[name]:queue_name* 操作示例 本文使用的EMR集群配置: 注意,如果不指定queue,则默认往default提交: hadoop jar /usr/lib/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar randomtextwriter -D mapreduce.randomtextwriter.totalbytes=100000000000 /HiBench/Wordcount/Input 如果指定一个不存在的queue,则会报错 hadoop jar /usr/lib/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar randomtextwriter -D mapreduce.randomtextwriter.totalbytes=100000000000 -D mapreduce.job.queuename=notExist /HiBench/Wordcount/Input3 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_1494398749894_0010 to YARN : Application application_1494398749894_0010 submitted by user hadoop to unknown queue: notExist 再往default里面提交job,发现作业等待,因为default的20% quota已经用完。 然后往q2里面提交一个作业: hadoop jar /usr/lib/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar randomtextwriter -D mapreduce.randomtextwriter.totalbytes=100000000000 -D mapreduce.job.queuename=q2 /HiBench/Wordcount/Input3 设置了q2单个作业不能超过q2总队列资源的40%,可以看到只用到50%多,尽管还有其它map在等待,也是不会占用更大资源。当然,这里不是非常精确,跟集群总体配置和每个map占用的资源有关,一个map占用的资源可以看成最小单元,一个map占用的资源不一定正好到达设定的比例,有可能会超过一点。 往q1.q12里面提交job hadoop jar /usr/lib/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar randomtextwriter -D mapreduce.randomtextwriter.totalbytes=100000000000 -D mapreduce.job.queuename=q12 /HiBench/Wordcount/Input4 可以看到,q12也设置了yarn.scheduler.capacity.root.q1.q12.user-limit-factor为0.7,可以看到只占用了q12的78%。 往q11里面提交一个作业,会发现它直接占满了q11,是因为q11设置了yarn.scheduler.capacity.root.q1.q11.maximum-capacity为90%,即q11能占到q1总资源的90%。 hadoop jar /usr/lib/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar randomtextwriter -D mapreduce.randomtextwriter.totalbytes=100000000000 -D mapreduce.job.queuename=q11 /HiBench/Wordcount/Input5 通过上面的例子,可以看到yarn的capacity scheduler还是很强大的,可以实现对集群资源的控制和优先级调度。以上例子中使用了hadoop作业,指定queue的参数为-D mapreduce.job.queuename=${queue-name},如果是spark作业,可以用--queue ${queue-name}指定。以上作业的提交方式可以直接在EMR控制台的作业中设置。
本文简单介绍了E-MapReduce集群上的Oozie工作流的使用,并且提供了可以直接使用的示例代码。全文使用Zeppelin notebook演示。欢迎下载导入[notebook.json],直接就能在zepplein上运行。或者在ZeppelinHub Viewer页面上浏览:[ZeppelinHub Viewer].
1.概述 RRDtool 代表 “Round Robin Database tool” ,作者同时也是 MRTG 软件的发明人。官方站点位于http://oss.oetiker.ch/rrdtool/ 。 所谓的“Round Robin” 其实是一种存储数据的方式,使用固定大小的空间来存储数据,并有一个指针指向最新的数据的位置。我们可以把用于存储数据的数据库的空间看成一个圆,上面有很多刻度。这些刻度所在的位置就代表用于存储数据的地方。所谓指针,可以认为是从圆心指向这些刻度的一条直线。指针会随着数据的读写自动移动。要注意的是,这个圆没有起点和终点,所以指针可以一直移动,而不用担心到达终点后就无法前进的问题。在一段时间后,当所有的空间都存满了数据,就又从头开始存放。这样整个存储空间的大小就是一个固定的数值。所以RRDtool 就是使用类似的方式来存放数据的工具, RRDtool 所使用的数据库文件的后缀名是'.rrd。著名的hadoop集群监控工具ganglia也是使用RRD数据库来存储数据和展示数据的。其原理如下图: 2.特点 首先 RRDtool 存储数据,扮演了一个后台工具的角色。但同时 RRDtool 又允许创建图表,这使得RRDtool看起来又像是前端工具。其他的数据库只能存储数据,不能创建图表。 RRDtool 的每个 rrd 文件的大小是固定的,而普通的数据库文件的大小是随着时间而增加的。 其他数据库只是被动的接受数据, RRDtool 可以对收到的数据进行计算,例如前后两个数据的变化程度(rate of change),并存储该结果。 RRDtool 要求定时获取数据,其他数据库则没有该要求。如果在一个时间间隔内(heartbeat)没有收到值,则会用 UNKN (unknow)代替,其他数据库则不会这样。 3. 安装RRDTool 3.1.安装yum源 [root@node1 ~] # rpm -ivh http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm [root@node1 ~] # rpm -ivh http://pkgs.repoforge.org/rpmforge-release/rpmforge-release-0.5.3-1.el6.rf.x86_64.rpm 3.2.同步时间 [root@node1 ~] # ntpdate 202.120.2.101 3.3.下载rrdtool的RPM包 [root@node1 ~] # wget ftp://195.220.108.108/linux/centos/6.4/os/x86_64/Packages/rrdtool-1.3.8-6.el6.x86_64.rpm 3.4.yum安装rrdtool [root@node1 ~] # yum -y localinstall --nogpgcheck rrdtool-1.3.8-6.el6.x86_64.rpm 3.5.查看一下安装文件 [root@node1 ~] # rpm -qa | grep rrdtool rrdtool-1.3.8-6.el6.x86_64 [root@node1 ~] # rpm -ql rrdtool /usr/bin/rrdcgi /usr/bin/rrdtool #命令行工具 /usr/bin/rrdupdate /usr/lib64/librrd .so.4 #下面是库文件 /usr/lib64/librrd .so.4.0.7 /usr/lib64/librrd_th .so.4 /usr/lib64/librrd_th .so.4.0.7 /usr/share/man/man1/bin_dec_hex .1.gz #下面是帮助文档 /usr/share/man/man1/cdeftutorial .1.gz /usr/share/man/man1/rpntutorial .1.gz /usr/share/man/man1/rrd-beginners .1.gz /usr/share/man/man1/rrdbuild .1.gz /usr/share/man/man1/rrdcgi .1.gz /usr/share/man/man1/rrdcreate .1.gz /usr/share/man/man1/rrddump .1.gz /usr/share/man/man1/rrdfetch .1.gz /usr/share/man/man1/rrdfirst .1.gz /usr/share/man/man1/rrdgraph .1.gz /usr/share/man/man1/rrdgraph_data .1.gz /usr/share/man/man1/rrdgraph_examples .1.gz /usr/share/man/man1/rrdgraph_graph .1.gz /usr/share/man/man1/rrdgraph_rpn .1.gz /usr/share/man/man1/rrdinfo .1.gz /usr/share/man/man1/rrdlast .1.gz /usr/share/man/man1/rrdlastupdate .1.gz /usr/share/man/man1/rrdresize .1.gz /usr/share/man/man1/rrdrestore .1.gz /usr/share/man/man1/rrdthreads .1.gz /usr/share/man/man1/rrdtool .1.gz /usr/share/man/man1/rrdtune .1.gz /usr/share/man/man1/rrdtutorial .1.gz /usr/share/man/man1/rrdupdate .1.gz /usr/share/man/man1/rrdxport .1.gz /usr/share/rrdtool 3.6.查看一下命令行工具 [root@node1 ~] # rrdtool -h RRDtool 1.3.8 Copyright 1997-2009 by Tobias Oetiker <tobi@oetiker.ch> Compiled Aug 21 2010 10:57:18 Usage: rrdtool [options] command command_options Valid commands: create, update, updatev, graph, graphv, dump, restore, last, lastupdate, first, info, fetch, tune, resize, xport RRDtool is distributed under the Terms of the GNU General Public License Version 2. (www.gnu.org /copyleft/gpl .html) For more information read the RRD manpages 4. rrdtool命令详解 4.1.创建RRD数据库 create 语法 rrdtool create filename [--start|-b start time ] [--step|-s step] DS:ds-name:DST:dst arguments #最后获取的数据是PDP,更新数据时要考滤DS顺序(*把所有要更新的数据,按照DS定义的顺序用冒号格开*) RRA:CF:cf arguments #最后获取的数据是CDP,绘图时使用的是这些数据 参数详解: DS:DS 用于定义 Data Soure 。也就是用于存放结果的变量名。DS是用来申明数据源的,也可以理解为申明数据变量,也就是你要检测的端口对应的变量名,这个参数在画图的时候还要使用的。 DST:DST 就是DS的类型。有 COUNTER、GUAGE、DERIVE、ABSOLUTE、COMPUTE 5种。由于网卡流量属于计数器型,所以这里应该为 COUNTER 。 *RRA:RRA 用于指定数据如何存放。我们可以把一个RRA 看成一个表,各保存不同 interval 的统计结果。RRA的作用就是定义更新的数据是如何记录的。比如我们每5分钟产生一条刷新的数据,那么一个小时就是12条。每天就是288条。这么庞大的 数据量,一定不可能都存下来。肯定有一个合并(consolidate)数据的方式,那么这个就是RRA的作用了。 *PDP:Primary Data Point 。正常情况下每个 interval RRDtool 都会收到一个值;RRDtool 在收到脚本给来的值后会计算出另外一个值(例如平均值),这个 值就是 PDP ;这个值代表的一般是“xxx/秒”的含义。注意,该值不一定等于RRDtool 收到的那个值。除非是GAUGE ,可以看下面的例子就知道了 *CF:CF 就是 Consolidation Function 的缩写。也就是合并(统计)功能。有 AVERAGE、MAX、MIN、LAST 四种分别表示对多个PDP 进行取平均、取最大值、取最小值、取当前值四种类型。具体作用等到 update 操作时再说。 *CDP:Consolidation Data Point 。RRDtool 使用多个 PDP 合并为(计算出)一个 CDP。也就是执行上面 的CF 操作后的结果。这个值就是存入 RRA的数据,绘图时使用的也是这些数据。 下面是RRA与PDP、CDP之间的关系图, (0).filename 默认是以 .rrd 结尾,但也以随你设定。 (1).--start|-b start time 设 定RRD数据库加入的第一个数据值的时间,从1970-01-01 00:00:00 UTC时间以来的时间(秒)。RRDtool不会接受早于或在指定时刻上的任何数值。默认值是now-10s;如果 update 操作中给出的时间在 –-start 之前,则 RRDtool拒绝接受。--satrt 选项也是可选的。 如果你想指定--start 为1天前,可以用CODE:--start $(date -d '1 days ago' +%s)。注意,--start 选项的值必是 timestamp 的格式。 (2).--step|-s step 指定数据将要被填入RRD数据库的基本的时间间隔。默认值是300秒; (3).DS:ds-name:DST:dst arguments DS(Data Source) DS:DS 用于定义 Data Soure 。也就是用于存放结果的变量名。 DS是用来申明数据源的,也可以理解为申明数据变量,也就是你要检测的端口对应的变量名,这个参数在画图的时候还要使用的。这里开始定义RRD数据的基本 属性;单个RRD数据库可以接受来自几个数据源的输入。在DS选项中要为每个需要在RRD中存储的数据源指定一些基本的属性;ds-name数据域命 名;DST定义数据源的类型,dst arguments参数依赖于数据源的类型。 案例:DS:mysql:COUNTER:600:0:100000000 DS(Data Source,数据源)表达式总共有六个栏位:*DS 表示这个为DS表达式*ds-name 数据域命名*DST 定义数据源的类型heartbeat 有效期(heartbeat),案例里的值为'600',假设要取12:00的数据,而前后300秒里的值(11:55-12:05)经过平均或是取最大或最小都算是12:00的有效值;min 允许存放的最小值,此例允许最小为0。max 允许存放的最大值,最大为100000000。注,如果不想设限制可以再第五个栏位和第六个栏位以 "U:U"表示(U即Unknown)。 DST 定义数据源的类型。数据源项的后续参数依赖于数据源的类型。对于GAUGE、COUNTER、DERIVE、以及ABSOLUTE,其数据源的格式为: DS:ds-name:GAUGE | COUNTER | DERIVE | ABSOLUTE:heartbeat:min:max。DST 的选择是十分重要的,如果选错了 DST ,即使你的脚本取的数据是对的,放入 RRDtool 后也是错误的,更不用提画出来的图是否有意义了。 GAUGE :GAGUE 和上面三种不同,它没有“平均”的概念,RRDtool 收到值之后字节存入 RRA 中。 COUNTER :必须是递增的,除非是计数器溢出。在这种情况下,RRDtool 会自动修改收到的值。例如网络接口流量、收到的packets 数量都属于这一类型。 DERIVE:和 COUNTER 类似。但可以是递增,也可以递减,或者一会增加一会儿减少。 ABSOLUTE :ABSOLUTE 比较特殊,它每次都假定前一个interval的值是0,再计算平均值。 COMPUTE :COMPUTE 比较特殊,它并不接受输入,它的定义是一个表达式,能够引用其他DS并自动计算出某个值。例如 CODE:DS:eth0_bytes:COUNTER:600:0:U DS:eth0_bits:COMPUTE:eth0_bytes,8, 则 eth0_bytes 每得到一个值,eth0_bits 会自动计算出它的值:将 eth0_bytes 的值乘以 8 。不过 COMPUTE 型的 DS 有个限制,只能应用它所在的 RRD 的 DS ,不能引用其他 RRD 的 DS。 COMPUTE 型 DS 是新版本的 RRDtool 才有的,你也可以用 CDEF 来实现该功能。如:CDEF:eth0_bits=eth0_bytes,8, 参考:http://my.oschina.net/u/1458120/blog/208857http://www.360doc.com/content/10/1205/16/4947005_75240524.shtmlhttp://blog.liuts.com/post/215/
刚刚收到Zeppelin创始人moon soo Lee发给Zeppelin User group的邮件,说zeppelin已经从孵化器毕业,顺利升级为apache顶级项目。以后介绍zeppelin可以去掉incubator了。ASF官方声明:https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces92 原邮件如下: Congratulations and thank you everyone who made it possible! ---------- Forwarded message ---------From: Sally Khudairi sk@apache.orgDate: Wed, May 25, 2016 at 3:04 AMSubject: The Apache Software Foundation Announces Apache® Zeppelin™ as a Top-Level Project To: Apache Announce List announce@apache.org this announcement is available online at https://s.apache.org/YqqP Open Source Big Data analytics and visualization tool for distributed, interactive, and collaborative systems using Apache Flink, Apache Hadoop, Apache Spark, and more. Forest Hill, MD –25 May 2016– The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of more than 350 Open Source projects and initiatives, announced today that Apache® Zeppelin™ has graduated from the Apache Incubator to become a Top-Level Project (TLP), signifying that the project's community and products have been well-governed under the ASF's meritocratic process and principles. Apache Zeppelin is a modern, web-based notebook that enables interactive data analytics. Notebooks help developers, data scientists, and related users to handle data efficiently without worrying about command lines and cluster details. "The Zeppelin community is pleased to graduate from the Apache Incubator," said Lee Moon Soo, Vice President of Apache Zeppelin. "With 118 worldwide contributors and widespread adoption in numerous commercial products, we are proud to officially be a part of the Apache Big Data ecosystem." Zeppelin's collaborative data analytics and visualization capabilities makes data exploration, visualization, sharing, and collaboration easy over distributed, general-purpose data processing systems that use Apache Flink, Apache Hadoop, and Apache Spark, among other Big Data platforms. Apache Zeppelin is: Multi-purpose --features data ingestion, exploration, analysis, visualization, and collaboration; Robust --supports 20+ more backend systems, including Apache Spark, Apache Flink, Apache Hive, Python, R, and any JDBC (Java Database Connectivity); Easy to deploy --built on top of modern Web technologies (provides built-in Apache Spark integration, eliminating the need to build a separate module, plugin, or library); Easy to use --with built-in visualizations and dynamic forms; Flexible --allows users to mix different languages, exchange data between backends, adjust the layout; Extensible --with pluggable architecture for interpreters, notebook storages, authentication, and visualizations (in progress); and Advanced --allows interaction between custom visualizations and cluster resources "With Apache Zeppelin, a wide range of users can make beautiful data-driven, interactive, and collaborative documents with SQL, Scala, and more," added Soo. Apache Zeppelin is in use at an array of organizations and solutions, including Amazon Web Services, Hortonworks, JuJu, and Twitter, among others. "Congratulations to Apache Zeppelin community on graduation," said Tim Hall, Vice President of Product Management at Hortonworks. "Several members of our team have been working over the past year in the Zeppelin community to make it enterprise ready. We are excited to be associated with this community and look forward to helping our customers get the best insights out of their data with Apache Zeppelin." "Apache Zeppelin is becoming an important tool at Twitter for creating and sharing interactive data analytics and visualizations," said Prasad Wagle, Technical Lead in the Data Platform team at Twitter. "Since it integrates seamlessly with all the popular data analytics engines, it is very easy to create and share reports and dashboards. With its extensible architecture and a vibrant Open Source community, I am looking forward to Apache Zeppelin advancing the state of the art in data analytics and visualization." "Apache Zeppelin is the major user-facing piece of Memcore’s in-memory data processing Cloud offering. Building a technology stack might be quite exciting engineering challenge, however, if users can’t visualize and work with the data conveniently, it is as good as not having the data at all. Apache Zeppelin enables efficient user acquisition by anyone trying to build new products or service offerings in the Big- and Fast- Data markets, making innovations, collaboration, and development easier for anyone," said Dr. Konstantin Boudnik, Founder and CEO of Memcore.io. "I am very excited to see Apache Zeppelin graduating as an ASF Top Level Project. This shows that more people are joining the community, bringing the project to a new level, and adding more integration points with existing data analytics and transactional software systems. This directly benefits the community at-large." Apache Zeppelin originated in 2013 at NFLabs as Peloton, a commercial data analytics product. Since entering the Apache Incubator in December 2014, the project has had three releases, and twice participated in Google Summer of Code under the Apache umbrella. "It was an honor to help with the incubation of Zeppelin," said Ted Dunning, Vice President of the Apache Incubator. "I have been very impressed with the Zeppelin community and the software they have built. I see Apache Zeppelin being adopted all over the place where people need to apply a notebook style to a wide variety of kinds of computing." Catch Apache Zeppelin in action during Berlin Buzzwords, 7 June 2016 https://s.apache.org/mV8E Availability and OversightApache Zeppelin software is released under the Apache License v2.0 and is overseen by a self-selected team of active contributors to the project. A Project Management Committee (PMC) guides the Project's day-to-day operations, including community development and product releases. For downloads, documentation, and ways to become involved with Apache Zeppelin, visit http://zeppelin.apache.org/ and https://twitter.com/ApacheZeppelin About the Apache IncubatorThe Apache Incubator is the entry path for projects and codebases wishing to become part of the efforts at The Apache Software Foundation. All code donations from external organizations and existing external projects wishing to join the ASF enter through the Incubator to: 1) ensure all donations are in accordance with the ASF legal standards; and 2) develop new communities that adhere to our guiding principles. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF. For more information, visit http://incubator.apache.org/ About The Apache Software Foundation (ASF)Established in 1999, the all-volunteer Foundation oversees more than 350 leading Open Source projects, including Apache HTTP Server --the world's most popular Web server software. Through the ASF's meritocratic process known as "The Apache Way," more than 550 individual Members and 5,300 Committers successfully collaborate to develop freely available enterprise-grade software, benefiting millions of users worldwide: thousands of software solutions are distributed under the Apache License; and the community actively participates in ASF mailing lists, mentoring initiatives, and ApacheCon, the Foundation's official user conference, trainings, and expo. The ASF is a US 501(c)(3) charitable organization, funded by individual donations and corporate sponsors including Alibaba Cloud Computing, ARM, Bloomberg, Budget Direct, Cerner, Cloudera, Comcast, Confluent, Facebook, Google, Hortonworks, HP, Huawei, IBM, InMotion Hosting, iSigma, LeaseWeb, Microsoft, OPDi, PhoenixNAP, Pivotal, Private Internet Access, Produban, Red Hat, Serenata Flowers, WANdisco, and Yahoo. For more information, visit http://www.apache.org/ and https://twitter.com/TheASF © The Apache Software Foundation. "Apache", "Zeppelin", "Apache Zeppelin", "Ambari", "Apache Ambari", "Flink", "Apache Flink", "Hadoop", "Apache Hadoop", "Hive", "Apache Hive", "Spark", "Apache Spark", and "ApacheCon" are registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. All other brands and trademarks are the property of their respective owners.
前言 我们知道,zeppelin有功能强大的交互式notebook,而且能生成非常好的报表数据。但是由于一些安全的原因,我们很少会直接将我们集群的zeppelin服务地址分享给别人,因为在zeppelin notebook中,用户可以直接查询集群的数据甚至执行shell,将这些公开给外部用户是十分危险的。而且有些云计算平台(像Aliyun E-MapReduce, Amazon AWS等)为了安全考虑,关闭了一些端口,需要通过端口转发才能在集群外部环境访问到zeppelin服务,这就使得共享集群zeppelin服务更难。这样的话,我们为了共享我们的交互式报表和数据的时候,就只能通过截图到PDF或者ppt中,然后在发邮件给别人,非常麻烦,而且还无法保存notebook报表上的动态数据提示信息和交换数据。 现在,有了ZeppelinHub viewer,就不用这么麻烦了。ZeppelinHub是一个专门用于分享zeppelin notebook的工具,有自己的账号体系,可以与zeppelin服务集成,进行zeppelin notebook的分享、复制已经多租户的权限管控。ZeppelinHub viewer是它的一个最简单的notebook分享功能,不需要用户提供一个zeppelin服务环境,甚至不需要注册登录ZeppelinHub就能使用。 生成报表 首先,需要分享者在zeppelin服务页面创建notebook,生成一系列的表表数据 关于交互式和可视化,可以参考:如何在Aliyun E-MapReduce集群上使用Zeppelin和Hue 保存notebook为josn格式 将生成好报表数据的notebook保存为json数据: 注意将该json文件命名为note.json。在ZeppelinHub Viewer使用的时候约定为该名称。 将该json文件上传到oss且设置为公共读(或者放在github上) 将该oss文件的路径粘贴到ZeppelinHub Viewer中 访问:https://www.zeppelinhub.com/viewer 并且把刚才的note.json文件的路径粘贴过来 点击“view”就能看到该notebook中的报表和对应的查询sql。https://www.zeppelinhub.com/viewer/notebooks/aHR0cDovL2Vtci1zYW1wbGUtcHJvamVjdHMub3NzLWNuLWhhbmd6aG91LmFsaXl1bmNzLmNvbS96ZXBwZWxpbi1ub3RlYm9vay1qc29uL25vdGUuanNvbg 只要将这个地址分享出去,就可以访问到报表和notebook,而且安全。
前言 目前Aliyun E-MapReduce支持了Appache Zeppelin和Hue,在Aliyun E-MapReduce集群上可以很方便的使用zeppelin和hue。 Apache Zeppelin是一个提供了web版的类似ipython的notebook,用于做数据分析和可视化。背后可以接入不同的数据处理引擎,包括spark, hive, tajo等,原生支持scala, java, shell, markdown等。它的整体展现和使用形式和Databricks Cloud是一样的,就是来自于当时的demo。 Hue是一个开源的Apache Hadoop UI系统,最早是由Cloudera Desktop演化而来,由Cloudera贡献给开源社区,它是基于Python Web框架Django实现的。通过使用Hue我们可以在浏览器端的Web控制台上与Hadoop集群进行交互来分析处理数据,例如操作HDFS上的数据,运行MapReduce Job等等。 准备工作 创建集群 创建集群的时候选择E-MapReduce支持zeppelin和hue主版本。目前E-MapReduce支持Zeppelin和Hue的主版本为1.3.0。创建集群的时候记得设置打开公网ip。 打通ssh无密登录并建立一个SSH隧道 集群创建完成之后,需要建立一个ssh隧道来访问集群的8888和8080端口。详细步骤参考:https://help.aliyun.com/document_detail/28187.html 这里以mac环境为例,使用chrome浏览器实现端口转发(假设集群master节点公网ip为xx.xx.xx.xx): a). 登录到master节点 ssh root@xx.xx.xx.xx 输入密码 b). 查看本机的id_rsa.pub内容(注意在本机执行,不要在远程的master节点上执行) cat ~/.ssh/id_rsa.pub c). 将本机的id_rsa.pub内容写入到远程master节点的~/.ssh/authorized_keys中(在远端master节点上执行) mkdir ~/.ssh/ vim ~/.ssh/authorized_keys 然后将步骤b)中看到的内容粘贴进来 现在应该可以直接使用ssh root@xx.xx.xx.xx免密登录master节点了。 d). 在本机执行以下命令进行端口转发 ssh -i ~/.ssh/id_rsa -ND 8157 root@xx.xx.xx.xx e). 启动chrome(在本机新开terminal执行) /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --proxy-server="socks5://localhost:8157" --host-resolver-rules="MAP * 0.0.0.0 , EXCLUDE localhost" --user-data-dir=/tmp f). 在新打开的chrome中访问Zeppelin和Hue zeppelin: xx.xx.xx.xx:8080 hue: xx.xx.xx.xx:8888 操作步骤和示例 Zepplein 在chrome中访问xx.xx.xx.xx:8080,首先创建一个新的noteboook 一个简单的zeppelin示例 支持markdown语法 %md ## Welcome to Aliyun E-MapReduce. This is a Zeppelin sample. ##### This is a live tutorial, you can run the code yourself. (Shift-Enter to Run) 原生支持scala。使用scala进行load数据 import org.apache.commons.io.IOUtils import java.net.URL import java.nio.charset.Charset // Zeppelin creates and injects sc (SparkContext) and sqlContext (HiveContext or SqlContext) // So you don't need create them manually // load bank data val bankText = sc.parallelize( IOUtils.toString( new URL("http://emr-sample-projects.oss-cn-hangzhou.aliyuncs.com/bank.csv"), Charset.forName("utf8")).split("\n")) case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer) val bank = bankText.map(s => s.split(";")).filter(s => s(0) != "\"age\"").map( s => Bank(s(0).toInt, s(1).replaceAll("\"", ""), s(2).replaceAll("\"", ""), s(3).replaceAll("\"", ""), s(5).replaceAll("\"", "").toInt ) ).toDF() bank.registerTempTable("bank") 使用spark sql查询和展示结果 %sql select age, count(1) value from bank where age < 30 group by age order by age %sql select age, count(1) value from bank where age < ${maxAge=30} group by age order by age %sql select age, count(1) value from bank where marital="${marital=single,single|divorced|married}" group by age order by age zeppelin 运行shell示例 %sh cd /tmp wget http://emr-sample-projects.oss-cn-hangzhou.aliyuncs.com/bank.csv ls -la rm bank.csv ls -la su -l hadoop -c "hadoop dfs -ls /" zeppelin上运行hive sql示例 Download Spending Dataset into HDFS %sh #remove existing copies of dataset from HDFS su -l hadoop -c "hadoop fs -rm /tmp/expenses.csv" #fetch the dataset wget http://emr-sample-projects.oss-cn-hangzhou.aliyuncs.com/healthexpenditurebyareaandsource.csv -O /tmp/expenses.csv #remove header sed -i '1d' /tmp/expenses.csv #remove empty fields sed -i "s/,,,,,//g" /tmp/expenses.csv sed -i '/^\s*$/d' /tmp/expenses.csv #put data into HDFS su -l hadoop -c "hadoop fs -put /tmp/expenses.csv /tmp" su -l hadoop -c "hadoop fs -ls -h /tmp/expenses.csv" rm /tmp/expenses.csv %hive drop table if exists `health_table` Create Hive table %hive CREATE TABLE `health_table` ( `year` string , `state` string , `category` string , `funding_src1` string, `funding_src2` string, `spending` int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TextFile Load dataset into Hive table %hive load data inpath '/tmp/expenses.csv' into table health_table Grant permissions %hive select count(*) from health_table Spending (In Billions) By State %hive select state, sum(spending)/1000 SpendinginBillions from health_table group by state order by SpendinginBillions desc Spending (In Billions) By Year %hive select year,sum(spending)/1000 SpendinginBillions from health_table group by year order by SpendinginBillions Spending (In Billions) By Category %hive select category, sum(spending)/1000 SpendinginBillions from health_table group by category order by SpendinginBillions desc zeppelin notebook json配置 zeppelin的每个notebook都是可以保存、导入和导出的。上面的三个示例,可以通过下载json配置直接导入。 sample shell demo hive demo zeppelin插图 zeppelin中的插图,都可以进行复制和保存: Hue 通过xx.xx.xx.xx:8888访问Hue。第一次登陆hue的时候,需要设置一个管理员账户和密码。请慎重设置和保管你的Hue管理员账户和密码信息。 Hive Editor 通过Hive Editor可以进行hive sql的编写和交互式运行。在左边栏DATABASE中会展示当前的数据库和表信息。 Metastore Manager 通过Metastore Manager可以查看和管理hive表,可以可视化创建表。 File Browser 通过File Browser可以查看和管理到hdfs上的文件。 示例代码 这里展示一个通过hue metastore界面从文件建表的过程,并通过Hive Editor对表中的数据做查询和展示的例子。 首先,下载数据文件到master节点,并且将该文件放入hdfs的/tmp目录下。可以ssh到master节点上操作,也可以直接使用zepplein的shell notebook。 有三种方式可以实现将数据文件放在hdfs的/tmp目录下: 1) 直接ssh登录到master节点上: ssh root@xx.xx.xx.xx cd /tmp wget http://emr-sample-projects.oss-cn-hangzhou.aliyuncs.com/us_map.csv su -l hadoop -c "hadoop fs -put /tmp/us_map.csv /tmp/us_map.csv" su -l hadoop -c "hadoop fs -ls -h /tmp/us_map.csv" 2) 这直接使用zeppelin的shell notebook操作: %sh cd /tmp wget http://emr-sample-projects.oss-cn-hangzhou.aliyuncs.com/us_map.csv %sh su -l hadoop -c "hadoop fs -put /tmp/us_map.csv /tmp/us_map.csv" %sh su -l hadoop -c "hadoop fs -ls -h /tmp/us_map.csv" 3) 通过Hue的file browser上传 通过hue的Metastore manager创建表。 然后一路next直到最后create table完成建表。刷新Query Editor界面在左侧的DATABASE导航栏可以看到新建的表。 在Hive Editor中执行 select * from us_map
nginx反向代理配置时,添加下面的配置:
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header REMOTE-HOST $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
location / {
proxy_pass http://127.0.0.1:10678;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header REMOTE-HOST $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}
你的集群是经典网络的还是vpc的。vpc的是不会有ip的,使用的是用户自定义标识。参考:https://help.aliyun.com/document_detail/28966.html?spm=5176.doc29005.6.560.A2AfFi
请问你是使用阿里云的E-MapReduce吗?可以在emr客户交流钉钉群里面讨论。
请提工单
建议你在阿里云官网购买。
emr专业问题,可以在emr客户交钉钉流群里面讨论,响应速度会更快。
机器是你帐号下的,无法转让
你可以先将你的数据加工处理好,在使用quick bi分析展示。
直接提工单就好了,请问你有什么难处?
你好,emr中有作业和执行计划的概念。如果你的中间结果落到集群上或者oss上,可以创建多个job, 设置依赖关系。如果你的中间结果不落地,直接把这两步计算写在一份代码里面,使用一个job就可以完成了。具体问题, 可以加入emr客户群交钉钉流,一般由专人值班回答客户咨询。
有人工工单,一般工单都可以较快的相应。也是使用人工电话进行投诉。
服务器不一定开了ping。ping不同并不代表服务器有问题,可以使用telnet命令。
如果10个bucket不够用,可以提工单申请,应该是有人工审核;但是这个毕竟不是长久之计,题主可以这样,可以使用bucket下面不同的文件夹来隔离。文件夹数目的上限比较大。
可以使用oss的sdk,很简单的,几行代码就可以了。https://help.aliyun.com/document_detail/32008.html?spm=5176.7933691.203620.1.bTiGTp
不建议这么使用,一个阿里云账号最多可以有3个ACCESS_ID和ACCESS_KEY,可以对不同的bucket使用不同的权限控制。
挺快的,几天就好了。
嗯 阿里云后来终于搞了个maven仓库
shiro的配置很简单的,看一下就会了
哪里的?截图看看。
你的集群配置是什么,是不是配置比较低?在zeppelin里面运行spark之后,zeppelin会启一个常驻的spark job来接受zepplein notebook上的其他spark作业提交,你上集群的hadoop ui看一下是不是一直有个spark job在运行。这个spark job一直运行导致其他的mapreduce作业拿不到资源,状态一直是accepted。
你可以点击“interpreter”, 在“ spark”这一段,点击右上角的“restart”,这样可以将处理spark submit的常驻作业停掉,然后你其他的mapreduce作业就能获得资源,里面可以运行了。你试一下。