《Flink官方文档》Batch Examples（二）-阿里云开发者社区

《Flink官方文档》Batch Examples（二）

2017-05-18 1508

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

本文涉及的产品

实时计算 Flink 版，5000CU*H 3个月

简介： 连通分支连通分支算法识别会一个更大的图，这部分图通过被相同的组件ID链接的所有顶点连接。类似PageRank，连通组件是一个迭代算法。在每个步骤中，每个顶点都将其当前组件ID传给所有邻居。如果小于自己的组件ID，一个顶点从邻居接受组件ID。

连通分支

连通分支算法识别会一个更大的图，这部分图通过被相同的组件ID链接的所有顶点连接。类似PageRank，连通组件是一个迭代算法。在每个步骤中，每个顶点都将其当前组件ID传给所有邻居。如果小于自己的组件ID，一个顶点从邻居接受组件ID。

此实现使用增量迭代：组件ID未变化的顶点不参与下一步骤。因为后来的迭代通常只处理一些离群顶点，这将产生更好的性能。

// read vertex and edge data
DataSet<Long> vertices = getVertexDataSet(env);
DataSet<Tuple2<Long, Long>> edges = getEdgeDataSet(env).flatMap(new UndirectEdge());

// assign the initial component IDs (equal to the vertex ID)
DataSet<Tuple2<Long, Long>> verticesWithInitialId = vertices.map(new DuplicateValue<Long>());

// open a delta iteration
DeltaIteration<Tuple2<Long, Long>, Tuple2<Long, Long>> iteration =
        verticesWithInitialId.iterateDelta(verticesWithInitialId, maxIterations, 0);

// apply the step logic:
DataSet<Tuple2<Long, Long>> changes = iteration.getWorkset()
        // join with the edges
        .join(edges).where(0).equalTo(0).with(new NeighborWithComponentIDJoin())
        // select the minimum neighbor component ID
        .groupBy(0).aggregate(Aggregations.MIN, 1)
        // update if the component ID of the candidate is smaller
        .join(iteration.getSolutionSet()).where(0).equalTo(0)
        .flatMap(new ComponentIdFilter());

// close the delta iteration (delta and new workset are identical)
DataSet<Tuple2<Long, Long>> result = iteration.closeWith(changes, changes);

// emit result
result.writeAsCsv(outputPath, "\n", " ");

// User-defined functions

public static final class DuplicateValue<T> implements MapFunction<T, Tuple2<T, T>> {

    @Override
    public Tuple2<T, T> map(T vertex) {
        return new Tuple2<T, T>(vertex, vertex);
    }
}

public static final class UndirectEdge
                    implements FlatMapFunction<Tuple2<Long, Long>, Tuple2<Long, Long>> {
    Tuple2<Long, Long> invertedEdge = new Tuple2<Long, Long>();

    @Override
    public void flatMap(Tuple2<Long, Long> edge, Collector<Tuple2<Long, Long>> out) {
        invertedEdge.f0 = edge.f1;
        invertedEdge.f1 = edge.f0;
        out.collect(edge);
        out.collect(invertedEdge);
    }
}

public static final class NeighborWithComponentIDJoin
                implements JoinFunction<Tuple2<Long, Long>, Tuple2<Long, Long>, Tuple2<Long, Long>> {

    @Override
    public Tuple2<Long, Long> join(Tuple2<Long, Long> vertexWithComponent, Tuple2<Long, Long> edge) {
        return new Tuple2<Long, Long>(edge.f1, vertexWithComponent.f1);
    }
}

public static final class ComponentIdFilter
                    implements FlatMapFunction<Tuple2<Tuple2<Long, Long>, Tuple2<Long, Long>>,
                                            Tuple2<Long, Long>> {

    @Override
    public void flatMap(Tuple2<Tuple2<Long, Long>, Tuple2<Long, Long>> value,
                        Collector<Tuple2<Long, Long>> out) {
        if (value.f0.f1 < value.f1.f1) {
            out.collect(value.f0);
        }
    }
}

scala

// set up execution environment
val env = ExecutionEnvironment.getExecutionEnvironment

// read vertex and edge data
// assign the initial components (equal to the vertex id)
val vertices = getVerticesDataSet(env).map { id => (id, id) }

// undirected edges by emitting for each input edge the input edges itself and an inverted
// version
val edges = getEdgesDataSet(env).flatMap { edge => Seq(edge, (edge._2, edge._1)) }

// open a delta iteration
val verticesWithComponents = vertices.iterateDelta(vertices, maxIterations, Array(0)) {
  (s, ws) =>

    // apply the step logic: join with the edges
    val allNeighbors = ws.join(edges).where(0).equalTo(0) { (vertex, edge) =>
      (edge._2, vertex._2)
    }

    // select the minimum neighbor
    val minNeighbors = allNeighbors.groupBy(0).min(1)

    // update if the component of the candidate is smaller
    val updatedComponents = minNeighbors.join(s).where(0).equalTo(0) {
      (newVertex, oldVertex, out: Collector[(Long, Long)]) =>
        if (newVertex._2 < oldVertex._2) out.collect(newVertex)
    }

    // delta and new workset are identical
    (updatedComponents, updatedComponents)
}


verticesWithComponents.writeAsCsv(outputPath, "\n", " ")

该连通分支程序实现了上述例子。它需要运行下列参数：–vertices –edges –output –iterations 。
输入文件是纯文本文件，必须格式化如下：

–Vertices 以IDS表示的顶点，由换行字符分隔。例如“1\n2\n12\n42\n63\n”给出了五个订单（1）、（2）、（12）、（42）和（63）。

–Edges 边通过以空格分隔的两个顶点ID表示。不同边是由换行符分隔。例如“1 2\n2 12\n1 12\n42 63\n”表示了四个无方向链接（1）-（2）、（2）-（12）、（1）-（12）和（42）-（63）。

关系型查询

关系型查询示例假定会使用两张表，一张订单表，另一张是TPC-H决策支持基准测试表。TPC-H是数据库行业标准基准测试。如何生成输入数据请参见下面的说明。

该示例实现以下sql查询。
SELECT l_orderkey, o_shippriority, sum(l_extendedprice) as revenue FROM orders, lineitem WHERE l_orderkey = o_orderkey AND o_orderstatus = "F" AND YEAR(o_orderdate) > 1993 AND o_orderpriority LIKE "5%" GROUP BY l_orderkey, o_shippriority;
Flink程序中按照如下的方式进行sql查询

// get orders data set: (orderkey, orderstatus, orderdate, orderpriority, shippriority)
DataSet<Tuple5<Integer, String, String, String, Integer>> orders = getOrdersDataSet(env);
// get lineitem data set: (orderkey, extendedprice)
DataSet<Tuple2<Integer, Double>> lineitems = getLineitemDataSet(env);

// orders filtered by year: (orderkey, custkey)
DataSet<Tuple2<Integer, Integer>> ordersFilteredByYear =
        // filter orders
        orders.filter(
            new FilterFunction<Tuple5<Integer, String, String, String, Integer>>() {
                @Override
                public boolean filter(Tuple5<Integer, String, String, String, Integer> t) {
                    // status filter
                    if(!t.f1.equals(STATUS_FILTER)) {
                        return false;
                    // year filter
                    } else if(Integer.parseInt(t.f2.substring(0, 4)) <= YEAR_FILTER) {
                        return false;
                    // order priority filter
                    } else if(!t.f3.startsWith(OPRIO_FILTER)) {
                        return false;
                    }
                    return true;
                }
            })
        // project fields out that are no longer required
        .project(0,4).types(Integer.class, Integer.class);

// join orders with lineitems: (orderkey, shippriority, extendedprice)
DataSet<Tuple3<Integer, Integer, Double>> lineitemsOfOrders =
        ordersFilteredByYear.joinWithHuge(lineitems)
                            .where(0).equalTo(0)
                            .projectFirst(0,1).projectSecond(1)
                            .types(Integer.class, Integer.class, Double.class);

// extendedprice sums: (orderkey, shippriority, sum(extendedprice))
DataSet<Tuple3<Integer, Integer, Double>> priceSums =
        // group by order and sum extendedprice
        lineitemsOfOrders.groupBy(0,1).aggregate(Aggregations.SUM, 2);

// emit result
priceSums.writeAsCsv(outputPath);

缺少scala例子（译者注）

关系查询程序实现了上述查询。它需要以下参数运行–orders –lineitem –output 。
order和lineitem文件可以使用TPC-H基准测试套件的数据生成工具（DBGEN）生成。采取以下步骤生成需提供给flink程序输入的任意大小的数据文件。

1、下载并解压DBGEN

2、复制makefile.suite并更名为Makefile，编辑修改如下：

DATABASE = DB2
MACHINE  = LINUX
WORKLOAD = TPCH
CC       = gcc

1、使用make命令构建DBGEN

2、使用DBGEN生成lineitem和orders表。-s命令传入1，将会一个生成约1 GB的大小的数据集。

./dbgen -T o -s 1

转载自并发编程网 - ifeve.com

《Flink官方文档》Batch Examples（二）

连通分支

关系型查询

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

《Flink官方文档》Batch Examples（二）

连通分支

关系型查询

热门文章

最新文章

相关课程

相关电子书

相关实验场景