Flink1.7.2 DataStream Operator 示例

本文涉及的产品
实时计算 Flink 版,5000CU*H 3个月
简介: Flink Operator 示例, operator,window等输入数据,程序源码,输出结果,示例演示

Flink1.7.2 DataStream Operator 示例

源码

map

  • 处理所有元素
  • 输入数据
模板
  • 程序
模板
  • 输出数据
模板

map

  • 处理所有元素
  • 输入数据

    你好
    发送数据
  • 程序

    
    package com.opensourceteams.module.bigdata.flink.example.stream.operator.map
    
    import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
    
    import org.apache.flink.streaming.api.scala._
    
    /**
      * nc -lk 1234  输入数据
      */
    object Run {
    
      def main(args: Array[String]): Unit = {
    
    
        val port = 1234
        val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
        val dataStream = env.socketTextStream("localhost", port, '\n')
    
        val dataStreamMap = dataStream.map(x => x + " 增加的数据")
    
        dataStreamMap.print()
    
    
        if(args == null || args.size ==0){
          env.execute("默认作业")
        }else{
          env.execute(args(0))
        }
    
        println("结束")
    
      }
    
    }
    
  • 输出数据

        1> 你好 增加的数据
        2> 发送数据 增加的数据

flatMap

  • 处理所有元素,并且把每行中的子集合,汇总成一个大集合
  • 输入数据

    a b c
    e f g
  • 程序

    package com.opensourceteams.module.bigdata.flink.example.stream.operator.flatmap
    
    import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
    
    /**
      * nc -lk 1234  输入数据
      */
    object Run {
    
      def main(args: Array[String]): Unit = {
    
    
        val port = 1234
        // get the execution environment
        val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
        val dataStream = env.socketTextStream("localhost", port, '\n')
    
        val dataStream2 = dataStream.flatMap(x => x.split(" "))
    
        dataStream2.print()
    
    
    
        if(args == null || args.size ==0){
          env.execute("默认作业")
        }else{
          env.execute(args(0))
        }
    
        println("结束")
    
      }
    
    
    }
    
  • 输出数据

    a
    b
    c
    e
    f
    g

filter

  • 过滤数据
  • 输入数据

    a b c
    a c
    b b
    d d
  • 程序

    package com.opensourceteams.module.bigdata.flink.example.stream.operator.filter
    
    import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment}
    
    /**
      * nc -lk 1234  输入数据
      */
    object Run {
    
      def main(args: Array[String]): Unit = {
    
    
        val port = 1234
        // get the execution environment
        val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
        val dataStream = env.socketTextStream("localhost", port, '\n')
    
        val dataStreamMap = dataStream.filter( x => (x.contains("a")))
    
        dataStreamMap.print()
    
    
        if(args == null || args.size ==0){
          env.execute("默认作业")
        }else{
          env.execute(args(0))
        }
    
        println("结束")
    
      }
    
    
    
    }
    
    
  • 输出数据

    3> a b c
    4> a c

keyBy

  • 指定某列为key,一般按key分组时用
  • 输入数据

    c a b a
  • 程序

        package com.opensourceteams.module.bigdata.flink.example.stream.operator.sum
        
        import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
        import org.apache.flink.streaming.api.windowing.time.Time
        
        /**
          * nc -lk 1234  输入数据
          */
        object Run {
        
          def main(args: Array[String]): Unit = {
        
        
            val port = 1234
            // get the execution environment
            val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
            val dataStream = env.socketTextStream("localhost", port, '\n')
        
        
            val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,1))
              .keyBy(0)
        
              //dataStream.keyBy("someKey") // Key by field "someKey"
              //dataStream.keyBy(0) // Key by the first element of a Tuple
        
              .timeWindow(Time.seconds(2))//每2秒滚动窗口
              .sum(1)
        
        
        
        
        
            dataStream2.print()
        
        
        
            if(args == null || args.size ==0){
              env.execute("默认作业")
            }else{
              env.execute(args(0))
            }
        
            println("结束")
        
          }
        
        
        }
    
    
  • 输出数据,数据输出顺序多线程是不固定的,但也是一样的规则取
  • 默认并行度

        6> (a,2)
        4> (c,1)
        2> (b,1)
  • 并行度为1,就先去重,取第一个元素,再按从最后一个开始,即 c a b a 变为 c a b 然后变成 c b a

     (c,1)
     (b,1)
     (a,2)
     

sum

  • keyBy指定某列为key,一般按key分组时用,sum按key分组后求合
  • 输入数据

    c a b a
  • 程序

        package com.opensourceteams.module.bigdata.flink.example.stream.operator.sum
        
        import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
        import org.apache.flink.streaming.api.windowing.time.Time
        
        /**
          * nc -lk 1234  输入数据
          */
        object Run {
        
          def main(args: Array[String]): Unit = {
        
        
            val port = 1234
            // get the execution environment
            val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
            val dataStream = env.socketTextStream("localhost", port, '\n')
        
        
            val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,1))
              .keyBy(0)
        
              //dataStream.keyBy("someKey") // Key by field "someKey"
              //dataStream.keyBy(0) // Key by the first element of a Tuple
        
              .timeWindow(Time.seconds(2))//每2秒滚动窗口
              .sum(1)
        
        
        
        
        
            dataStream2.print()
        
        
        
            if(args == null || args.size ==0){
              env.execute("默认作业")
            }else{
              env.execute(args(0))
            }
        
            println("结束")
        
          }
        
        
        }
    
    
  • 输出数据,数据输出顺序多线程是不固定的,但也是一样的规则取
  • 默认并行度

        6> (a,2)
        4> (c,1)
        2> (b,1)
  • 并行度为1,就先去重,取第一个元素,再按从最后一个开始,即 c a b a 变为 c a b 然后变成 c b a

     (c,1)
     (b,1)
     (a,2)
     

reduce

  • keyBy指定某列为key,一般按key分组时用,对相同的key,元素之间进行的函数运算
  • 输入数据

    a b b c
  • 程序

    package com.opensourceteams.module.bigdata.flink.example.stream.operator.reduce
    
    import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
    import org.apache.flink.streaming.api.windowing.time.Time
    
    /**
      * nc -lk 1234  输入数据
      */
    object Run {
    
      def main(args: Array[String]): Unit = {
    
    
        val port = 1234
        // get the execution environment
        val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
        env.setParallelism(1)  //设置并行度
        val dataStream = env.socketTextStream("localhost", port, '\n')
    
    
        val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,2))
          .keyBy(0)
    
          //dataStream.keyBy("someKey") // Key by field "someKey"
          //dataStream.keyBy(0) // Key by the first element of a Tuple
    
          .timeWindow(Time.seconds(2))//每2秒滚动窗口
            .reduce((a,b) =>  (a._1,a._2 * b._2) )
    
    
    
    
    
        dataStream2.print()
    
    
    
    
        println("=======================打印StreamPlanAsJSON=======================\n")
        println("JSON转图在线工具: https://flink.apache.org/visualizer")
        println(env.getStreamGraph.getStreamingPlanAsJSON)
        println("==================================================================\n")
    
        if(args == null || args.size ==0){
          env.execute("默认作业")
        }else{
          env.execute(args(0))
        }
    
        println("结束")
    
      }
    
    
    }
    
  • 输出数据,数据输出顺序多线程是不固定的,但也是一样的规则取
  • 默认并行度
    6> (a,2)
    4> (c,1)
    2> (b,1)
  • 并行度为1,就先去重,取第一个元素,再按从最后一个开始,即 c a b a 变为 c a b 然后变成 c b a

(a,2)
(c,2)
(b,4)
 

fold

  • 按key进行处理,第一个参数,是字符串,放在每次处理的最前面第二个是表达式,第二个表达式有两个参数,第一个参数,就是第一个参数的值,第二个参数,我每次循环key时,迭代的下一个元素
  • 输入数据
a a b c c 
  • 程序
package com.opensourceteams.module.bigdata.flink.example.stream.operator.fold

import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.time.Time

/**
  * nc -lk 1234  输入数据
  */
object Run {

  def main(args: Array[String]): Unit = {


    val port = 1234
    // get the execution environment
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)  //设置并行度
    val dataStream = env.socketTextStream("localhost", port, '\n')


    val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,1))
      .keyBy(0)


      //dataStream.keyBy("someKey") // Key by field "someKey"
      //dataStream.keyBy(0) // Key by the first element of a Tuple

      .timeWindow(Time.seconds(2))//每2秒滚动窗口
      .fold("开始字符串")((str, i) => { str + "-" + i} )





    dataStream2.print()




    println("=======================打印StreamPlanAsJSON=======================\n")
    println("JSON转图在线工具: https://flink.apache.org/visualizer")
    println(env.getStreamGraph.getStreamingPlanAsJSON)
    println("==================================================================\n")

    if(args == null || args.size ==0){
      env.execute("默认作业")
    }else{
      env.execute(args(0))
    }

    println("结束")

  }


}
  • 输出数据
开始字符串-(a,1)-(a,1)
开始字符串-(c,1)-(c,1)
开始字符串-(b,1)

Aggregations (sum ,max,min)

sum

  • 处理所有元素,相同key进行累加
  • 输入数据
a a c b c
  • 程序
package com.opensourceteams.module.bigdata.flink.example.stream.operator.aggregations.sum

import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.time.Time

/**
  * nc -lk 1234  输入数据
  */
object Run {

  def main(args: Array[String]): Unit = {


    val port = 1234
    // get the execution environment
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    //env.setParallelism(1)  //设置并行度,不设置就是默认最高并行度为的cpu ,我的四核8线程,就是最高并行度为8
    val dataStream = env.socketTextStream("localhost", port, '\n')


    val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,1))
      .keyBy(0)

      //dataStream.keyBy("someKey") // Key by field "someKey"
      //dataStream.keyBy(0) // Key by the first element of a Tuple

      .timeWindow(Time.seconds(2))//每2秒滚动窗口
      .sum(1)





    dataStream2.print()




    println("=======================打印StreamPlanAsJSON=======================\n")
    println("JSON转图在线工具: https://flink.apache.org/visualizer")
    println(env.getStreamGraph.getStreamingPlanAsJSON)
    println("==================================================================\n")

    if(args == null || args.size ==0){
      env.execute("默认作业")
    }else{
      env.execute(args(0))
    }

    println("结束")

  }


}
  • 输出数据
4> (c,2)
2> (b,1)
6> (a,2)

min(和minBy一样,没发现区别)

  • 处理所有元素,对相同key的元素,进行求最小的值
  • 输入数据
b a b a a b
  • 程序
package com.opensourceteams.module.bigdata.flink.example.datastream.operator.aggregations.sum


import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.time.Time
//import org.apache.flink.streaming.api.scala._

/**
  * nc -lk 1234  输入数据
  */
object Run {

  def main(args: Array[String]): Unit = {


    val port = 1234
    // get the execution environment
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    //env.setParallelism(1)  //设置并行度,不设置就是默认最高并行度为的cpu ,我的四核8线程,就是最高并行度为8
    val dataStream = env.socketTextStream("localhost", port, '\n')

    var i  = 0

    val dataStream2 = dataStream.flatMap(x => x.split(" ")).map( x => {
      i = i + 1
      (x,i)
    })
      .keyBy(0)

      .timeWindow(Time.seconds(2))//每2秒滚动窗口
      .min(1)





    dataStream2.print()




    println("=======================打印StreamPlanAsJSON=======================\n")
    println("JSON转图在线工具: https://flink.apache.org/visualizer")
    println(env.getStreamGraph.getStreamingPlanAsJSON)
    println("==================================================================\n")

    if(args == null || args.size ==0){
      env.execute("默认作业")
    }else{
      env.execute(args(0))
    }

    println("结束")

  }


}
  • 输出数据
  • 最终输出数据,顺序取决于线程的调用
2> (b,1)
6> (a,2)
  • 中间输出数据
6> (a,2)
2> (b,1)
6> (a,4)
2> (b,3)
6> (a,5)
2> (b,6)

max(和maxBy一样)

  • 处理所有元素,对相同key的元素,进行求最大的值
  • 输入数据
b a b a a b
  • 程序
package com.opensourceteams.module.bigdata.flink.example.datastream.operator.aggregations.sum


import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.time.Time
//import org.apache.flink.streaming.api.scala._

/**
  * nc -lk 1234  输入数据
  */
object Run {

  def main(args: Array[String]): Unit = {


    val port = 1234
    // get the execution environment
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    //env.setParallelism(1)  //设置并行度,不设置就是默认最高并行度为的cpu ,我的四核8线程,就是最高并行度为8
    val dataStream = env.socketTextStream("localhost", port, '\n')

    var i  = 0

    val dataStream2 = dataStream.flatMap(x => x.split(" ")).map( x => {
      i = i + 1
      (x,i)
    })
      .keyBy(0)

      .timeWindow(Time.seconds(2))//每2秒滚动窗口
      .max(1)





    dataStream2.print()




    println("=======================打印StreamPlanAsJSON=======================\n")
    println("JSON转图在线工具: https://flink.apache.org/visualizer")
    println(env.getStreamGraph.getStreamingPlanAsJSON)
    println("==================================================================\n")

    if(args == null || args.size ==0){
      env.execute("默认作业")
    }else{
      env.execute(args(0))
    }

    println("结束")

  }


}
  • 输出数据
  • 最终输出数据,顺序取决于线程的调用

2> (b,6)
6> (a,5)
  • 中间输出数据

6> (a,2)
2> (b,1)
6> (a,4)
2> (b,3)
6> (a,5)
2> (b,6)

Window

window

  • 定义window,并指定分配元素到window的方式
  • 可以在已经分区的KeyedStream上定义Windows。 Windows根据某些特征(例如,在最后5秒内到达的数据)对每个密钥中的数据进行分组。 有关窗口的完整说明,请参见windows。
  • 输入数据
b a b a a b
  • 程序
package com.opensourceteams.module.bigdata.flink.example.datastream.operator.window.window

import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.assigners.{ TumblingProcessingTimeWindows}
import org.apache.flink.streaming.api.windowing.time.Time

/**
  * nc -lk 1234  输入数据
  */
object Run {

  def main(args: Array[String]): Unit = {


    val port = 1234
    // get the execution environment
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)  //设置并行度
    val dataStream = env.socketTextStream("localhost", port, '\n')


    val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,1))
      .keyBy(0)

      /**
        * 定义window,并指定分配元素到window的方式
        * 可以在已经分区的KeyedStream上定义Windows。 Windows根据某些特征(例如,在最后5秒内到达的数据)对每个密钥中的数据进行分组。 有关窗口的完整说明,请参见windows。
        */
      .window(TumblingProcessingTimeWindows.of(Time.seconds(2)))

      .sum(1)





    dataStream2.print()




    println("=======================打印StreamPlanAsJSON=======================\n")
    println("JSON转图在线工具: https://flink.apache.org/visualizer")
    println(env.getStreamGraph.getStreamingPlanAsJSON)
    println("==================================================================\n")

    if(args == null || args.size ==0){
      env.execute("默认作业")
    }else{
      env.execute(args(0))
    }

    println("结束")

  }


}
  • 输出数据
(b,3)
(a,3)

WindowAll

  • 配合process.ProcessAllWindowFunction函数,该函数的参数elements: Iterable[(String, Int)] 即为当前window的所有元素,可以进行处理,再发给下游sink
  • 输入数据
b a b a a b
  • 程序
package com.opensourceteams.module.bigdata.flink.example.datastream.operator.window.windowAll

import org.apache.flink.streaming.api.scala.function.ProcessAllWindowFunction
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector

/**
  * nc -lk 1234  输入数据
  */
object Run {

  def main(args: Array[String]): Unit = {


    val port = 1234
    // get the execution environment
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    //env.setParallelism(1)  //设置并行度
    val dataStream = env.socketTextStream("localhost", port, '\n')


    dataStream.flatMap(x => x.split(" ")).map((_,1))
      .keyBy(0)


      .windowAll( TumblingProcessingTimeWindows.of(Time.seconds(2)))
        .process(new ProcessAllWindowFunction[(String, Int),(String, Int),TimeWindow] {
          override def process(context: Context, elements: Iterable[(String, Int)], out: Collector[(String, Int)]): Unit = {
            //可以对当前window中的所有元素进行操作,处理后,再发送给Sink
            for(element <- elements) out.collect(element)
          }
        })

        .print()





    println("=======================打印StreamPlanAsJSON=======================\n")
    println("JSON转图在线工具: https://flink.apache.org/visualizer")
    println(env.getStreamGraph.getStreamingPlanAsJSON)
    println("==================================================================\n")

    if(args == null || args.size ==0){
      env.execute("默认作业")
    }else{
      env.execute(args(0))
    }

    println("结束")












  }


}
  • 输出数据

8> (a,1)
7> (b,1)
3> (a,1)
2> (a,1)
1> (b,1)
4> (b,1)

Window.apply

  • 对一批window进行元素处理
  • 输入数据
b a b a a b
  • 程序
package com.opensourceteams.module.bigdata.flink.example.datastream.operator.window.window.apply

import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.{TimeWindow, Window}
import org.apache.flink.util.Collector

/**
  * nc -lk 1234  输入数据
  */
object Run {

  def main(args: Array[String]): Unit = {


    val port = 1234
    // get the execution environment
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
   // env.setParallelism(1)  //设置并行度
    val dataStream = env.socketTextStream("localhost", port, '\n')


    val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,1))
      .keyBy(0)

      /**
        * 定义window,并指定分配元素到window的方式
        * 可以在已经分区的KeyedStream上定义Windows。 Windows根据某些特征(例如,在最后5秒内到达的数据)对每个密钥中的数据进行分组。 有关窗口的完整说明,请参见windows。
        */
      .window(TumblingProcessingTimeWindows.of(Time.seconds(2)))

      /**
        * * @tparam IN The type of the input value.
        * * @tparam OUT The type of the output value.
        * * @tparam KEY The type of the key.
        */
      .apply(new WindowFunction[(String,Int),(String,Int),Tuple,TimeWindow] {
      override def apply(key: Tuple, window: TimeWindow, input: Iterable[(String, Int)], out: Collector[(String, Int)]): Unit ={
        //对window的所有元素进行处理
        for(element <- input) out.collect(element)
      }

    })






    dataStream2.print()




    println("=======================打印StreamPlanAsJSON=======================\n")
    println("JSON转图在线工具: https://flink.apache.org/visualizer")
    println(env.getStreamGraph.getStreamingPlanAsJSON)
    println("==================================================================\n")

    if(args == null || args.size ==0){
      env.execute("默认作业")
    }else{
      env.execute(args(0))
    }

    println("结束")

  }


}

  • 输出数据
2> (a,1)
3> (a,1)
7> (b,1)
4> (b,1)
1> (b,1)
8> (a,1)
    

Window.reduce

  • 处理所有元素,对window中相同key的元素进行,函数表达式计算
  • 输入数据
b a b a a b
  • 程序
package com.opensourceteams.module.bigdata.flink.example.datastream.operator.window.window.reduce

import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector

/**
  * nc -lk 1234  输入数据
  */
object Run {

  def main(args: Array[String]): Unit = {


    val port = 1234
    // get the execution environment
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
   // env.setParallelism(1)  //设置并行度
    val dataStream = env.socketTextStream("localhost", port, '\n')


    val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,1))
      .keyBy(0)

      /**
        * 定义window,并指定分配元素到window的方式
        * 可以在已经分区的KeyedStream上定义Windows。 Windows根据某些特征(例如,在最后5秒内到达的数据)对每个密钥中的数据进行分组。 有关窗口的完整说明,请参见windows。
        */
      .window(TumblingProcessingTimeWindows.of(Time.seconds(2)))

      /**
        * * @tparam IN The type of the input value.
        * * @tparam OUT The type of the output value.
        * * @tparam KEY The type of the key.
        */
      .reduce((a,b) => (a._1,a._2 +b._2))








    dataStream2.print()




    println("=======================打印StreamPlanAsJSON=======================\n")
    println("JSON转图在线工具: https://flink.apache.org/visualizer")
    println(env.getStreamGraph.getStreamingPlanAsJSON)
    println("==================================================================\n")

    if(args == null || args.size ==0){
      env.execute("默认作业")
    }else{
      env.execute(args(0))
    }

    println("结束")

  }


}
  • 输出数据
2> (b,3)
6> (a,3)

Window.fold

  • 按key进行处理,第一个参数,是字符串,放在每次处理的最前面第二个是表达式,第二个表达式有两个参数,第一个参数,就是第一个参数的值,第二个参数,我每次循环key时,迭代的下一个元素
  • 输入数据
b a b a a b
  • 程序
package com.opensourceteams.module.bigdata.flink.example.datastream.operator.window.window.fold

import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector

/**
  * nc -lk 1234  输入数据
  */
object Run {

  def main(args: Array[String]): Unit = {


    val port = 1234
    // get the execution environment
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
   // env.setParallelism(1)  //设置并行度
    val dataStream = env.socketTextStream("localhost", port, '\n')


    val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,1))
      .keyBy(0)

      /**
        * 定义window,并指定分配元素到window的方式
        * 可以在已经分区的KeyedStream上定义Windows。 Windows根据某些特征(例如,在最后5秒内到达的数据)对每个密钥中的数据进行分组。 有关窗口的完整说明,请参见windows。
        */
      .window(TumblingProcessingTimeWindows.of(Time.seconds(2)))


      /**
        * 按key进行处理,第一个参数,是字符串,放在每次处理的最前面第二个是表达式,第二个表达式有两个参数,第一个参数,就是第一个参数的值,第二个参数,我每次循环key时,迭代的下一个元素
        */
      .fold("字符串开始")((str, i) => { str + "-" + i} )








    dataStream2.print()




    println("=======================打印StreamPlanAsJSON=======================\n")
    println("JSON转图在线工具: https://flink.apache.org/visualizer")
    println(env.getStreamGraph.getStreamingPlanAsJSON)
    println("==================================================================\n")

    if(args == null || args.size ==0){
      env.execute("默认作业")
    }else{
      env.execute(args(0))
    }

    println("结束")

  }


}

  • 输出数据
6> 字符串开始-(a,1)-(a,1)-(a,1)
2> 字符串开始-(b,1)-(b,1)-(b,1)

DataStream union DataStream

  • 对两个DataStream进行合并,合并之后不能再次进行计算
  • 输入数据
a a b
c c a
  • 程序
package com.opensourceteams.module.bigdata.flink.example.datastream.operator.union

import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.scala._

/**
  * nc -lk 1234  输入数据
  */
object Run {

  def main(args: Array[String]): Unit = {


    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

   val dataStream1 = getDataStream(env,1234,"localhost")
   val dataStream2 = getDataStream(env,12345,"localhost")

    /**
      * 只是将两个流的数据,union在一起,之后,不能再进行操作了
      */
    val dataStream3 = dataStream1.union(dataStream2)




    dataStream3.print()




    println("=======================打印StreamPlanAsJSON=======================\n")
    println("JSON转图在线工具: https://flink.apache.org/visualizer")
    println(env.getStreamGraph.getStreamingPlanAsJSON)
    println("==================================================================\n")

    if(args == null || args.size ==0){
      env.execute("默认作业")
    }else{
      env.execute(args(0))
    }

    println("结束")

  }

  def getDataStream(env: StreamExecutionEnvironment,port:Int,host:String):DataStream[(String,Int)]={


    //env.setParallelism(1)  //设置并行度
    val dataStream = env.socketTextStream(host, port, '\n')

    val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,1))
      .keyBy(0)

      .timeWindow(Time.seconds(5))//每2秒滚动窗口
      .sum(1)

    dataStream2

  }


}
  • 输出数据
6> (a,2)
4> (c,2)
2> (b,1)
6> (a,1)

DataStream join DataStram

  • join,两上流根据key相等进行连接,然后调用apply函数,进行具体的相同key进行计算
  • 输入数据
a a b
c c c a b
  • 程序
package com.opensourceteams.module.bigdata.flink.example.datastream.operator.join

import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows

/**
  * nc -lk 1234  输入数据
  */
object Run {

  def main(args: Array[String]): Unit = {


    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

   val dataStream1 = getDataStream(env,1234,"localhost")
   val dataStream2 = getDataStream(env,12345,"localhost")

    /**
      * 只是将两个流的数据,union在一起,之后,不能再进行操作了
      */
    val dataStream3 = dataStream1.join(dataStream2)




    dataStream3.where(x => x._1).equalTo(x => x._1)
        .window(TumblingProcessingTimeWindows.of(Time.seconds(2)))
        .apply((a,b) => (a._1,a._2 + b._2) )

        .print()






    println("=======================打印StreamPlanAsJSON=======================\n")
    println("JSON转图在线工具: https://flink.apache.org/visualizer")
    println(env.getStreamGraph.getStreamingPlanAsJSON)
    println("==================================================================\n")

    if(args == null || args.size ==0){
      env.execute("默认作业")
    }else{
      env.execute(args(0))
    }

    println("结束")

  }

  def getDataStream(env: StreamExecutionEnvironment,port:Int,host:String):DataStream[(String,Int)]={


    //env.setParallelism(1)  //设置并行度
    val dataStream = env.socketTextStream(host, port, '\n')

    val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,1))
      .keyBy(0)

      .timeWindow(Time.seconds(5))//每2秒滚动窗口
      .sum(1)

    dataStream2

  }


}
  • 输出数据
6> (a,3)
2> (b,2)

DataStream.intervalJoin

  • 两流,根据key,连接,找到两流都有的共同元素,进行函数处理process()
  • 输入数据
c c a
a a b
  • 程序
package com.opensourceteams.module.bigdata.flink.example.datastream.operator.intervaljoin

import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.co.ProcessJoinFunction
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.util.Collector

/**
  * nc -lk 1234  输入数据
  */
object Run {

  def main(args: Array[String]): Unit = {


    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    env.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime)


   val dataStream1 = getDataStream(env,1234,"localhost")
   val dataStream2 = getDataStream(env,12345,"localhost")

    val dataStream3 = dataStream1.keyBy(0).intervalJoin(dataStream2.keyBy(0))




    dataStream3.between(Time.seconds(-5), Time.seconds(5))
      //.upperBoundExclusive(true) // optional
      //.lowerBoundExclusive(true) // optional
        .process(new ProcessJoinFunction[(String,Int),(String,Int),String] {
      override def processElement(left: (String, Int), right: (String, Int), ctx: ProcessJoinFunction[(String, Int), (String, Int), String]#Context, out: Collector[String]): Unit = {
        println(left + "," + right)
        out.collect( left + "," + right)
      }
    })





    println("=======================打印StreamPlanAsJSON=======================\n")
    println("JSON转图在线工具: https://flink.apache.org/visualizer")
    println(env.getStreamGraph.getStreamingPlanAsJSON)
    println("==================================================================\n")

    if(args == null || args.size ==0){
      env.execute("默认作业")
    }else{
      env.execute(args(0))
    }

    println("结束")

  }

  def getDataStream(env: StreamExecutionEnvironment,port:Int,host:String):DataStream[(String,Int)]={


    //env.setParallelism(1)  //设置并行度
    val dataStream = env.socketTextStream(host, port, '\n')

    val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,1))

    dataStream2

  }


}
  • 输出数据
(a,1),(a,1)
(a,1),(a,1)

DataStream.cogroup

  • 对两个数据流进行处理,按key去重取两个流的并集,再按key分别统计每一个流的元素,进行汇总处理
  • 第一个流的元素为 c,a 第二个流的元素为 a b ,所以一个统计了四个元素,每一个元素在每个流中千是多少,也统计出来了
  • 输入数据
c c a
a a b
  • 程序
模板
  • 输出数据
==============开始
first
[(a,1)]
second
[(a,1), (a,1)]
==============结束
==============开始
first
[(c,1), (c,1)]
second
[]
==============结束
==============开始
first
[]
second
[(b,1)]
==============结束

DataStraem.connect

  • 相当于两个数的数据,都通过 ConnectStream.函数来处理,函数都有两个方法,一个处理流一,一个处理流二
  • 输入数据
c c a
a a b
  • 程序
模板
  • 输出数据
(a,1)
(a,1)
(b,1)
(c,1)
(c,1)
(a,1)

DataStraem.coMap

  • 相当于两个数的数据,都通过 ConnectStream.函数来处理,函数都有两个方法,一个处理流一,一个处理流二
  • 输入数据
c c a
a a b
  • 程序
模板
  • 输出数据
(a,1)
(a,1)
(b,1)
(c,1)
(c,1)
(a,1)

DataStraem.coFlatMap

  • 相当于两个数的数据,都通过 ConnectStream.函数来处理,函数都有两个方法,一个处理流一,一个处理流二
  • 输入数据
c c a
a a b
  • 程序
package com.opensourceteams.module.bigdata.flink.example.datastream.operator.coFlatMap

import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.co.CoMapFunction
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}

/**
  * nc -lk 1234  输入数据
  */
object Run {

  def main(args: Array[String]): Unit = {


    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    env.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime)
    env.setParallelism(1) //由于想看数据结构,所以先设为1,这样


   val dataStream1 = getDataStream(env,1234,"localhost")
   val dataStream2 = getDataStream(env,12345,"localhost")

    val dataStream3 = dataStream1.connect(dataStream2)




    dataStream3
        .flatMap(x => x.toString.split(" ") , x => x.toString.split(" "))
        .print()



    println("=======================打印StreamPlanAsJSON=======================\n")
    println("JSON转图在线工具: https://flink.apache.org/visualizer")
    println(env.getStreamGraph.getStreamingPlanAsJSON)
    println("==================================================================\n")

    if(args == null || args.size ==0){
      env.execute("默认作业")
    }else{
      env.execute(args(0))
    }

    println("结束")

  }

  def getDataStream(env: StreamExecutionEnvironment,port:Int,host:String):DataStream[String]={


    //env.setParallelism(1)  //设置并行度
    val dataStream = env.socketTextStream(host, port, '\n')

   // val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,1))

    dataStream

  }


}
  • 输出数据
a
a
b
c
c
a

Datastream.assignAscendingTimestamps

  • 指定时间戳
  • 输入数据
c c a
  • 程序
package com.opensourceteams.module.bigdata.flink.example.datastream.operator.assignTimestamps

import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.time.Time

/**
  * nc -lk 1234  输入数据
  */
object Run {

  def main(args: Array[String]): Unit = {


    val port = 1234
    // get the execution environment
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)  //设置并行度
    val dataStream = env.socketTextStream("localhost", port, '\n')

    dataStream.assignAscendingTimestamps(x => System.currentTimeMillis())

    val dataStream2 = dataStream.flatMap(x => x.split(" ")).map((_,1))
      .keyBy(0)


      .timeWindow(Time.seconds(2))//每2秒滚动窗口
      .sum(1)





    dataStream2.print()




    println("=======================打印StreamPlanAsJSON=======================\n")
    println("JSON转图在线工具: https://flink.apache.org/visualizer")
    println(env.getStreamGraph.getStreamingPlanAsJSON)
    println("==================================================================\n")

    if(args == null || args.size ==0){
      env.execute("默认作业")
    }else{
      env.execute(args(0))
    }

    println("结束")

  }


}
  • 输出数据
(c,2)
(a,1)
相关实践学习
基于Hologres轻松玩转一站式实时仓库
本场景介绍如何利用阿里云MaxCompute、实时计算Flink和交互式分析服务Hologres开发离线、实时数据融合分析的数据大屏应用。
Linux入门到精通
本套课程是从入门开始的Linux学习课程,适合初学者阅读。由浅入深案例丰富,通俗易懂。主要涉及基础的系统操作以及工作中常用的各种服务软件的应用、部署和优化。即使是零基础的学员,只要能够坚持把所有章节都学完,也一定会受益匪浅。
相关文章
|
7月前
|
SQL API 数据处理
实时计算 Flink版产品使用合集之DataStream方式是否可以实现oracle--&gt;的数据同步
实时计算Flink版作为一种强大的流处理和批处理统一的计算框架,广泛应用于各种需要实时数据处理和分析的场景。实时计算Flink版通常结合SQL接口、DataStreamAPI、以及与上下游数据源和存储系统的丰富连接器,提供了一套全面的解决方案,以应对各种实时计算需求。其低延迟、高吞吐、容错性强的特点,使其成为众多企业和组织实时数据处理首选的技术平台。以下是实时计算Flink版的一些典型使用合集。
|
7月前
|
SQL Java 关系型数据库
Flink DataSet API迁移到DataStream API实战
本文介绍了作者的Flink项目从DataSet API迁移到DataStream API的背景、方法和遇到的问题以及解决方案。
224 3
|
2月前
|
消息中间件 分布式计算 大数据
大数据-121 - Flink Time Watermark 详解 附带示例详解
大数据-121 - Flink Time Watermark 详解 附带示例详解
86 0
|
2月前
|
消息中间件 关系型数据库 MySQL
大数据-117 - Flink DataStream Sink 案例:写出到MySQL、写出到Kafka
大数据-117 - Flink DataStream Sink 案例:写出到MySQL、写出到Kafka
204 0
|
2月前
|
消息中间件 NoSQL Kafka
大数据-116 - Flink DataStream Sink 原理、概念、常见Sink类型 配置与使用 附带案例1:消费Kafka写到Redis
大数据-116 - Flink DataStream Sink 原理、概念、常见Sink类型 配置与使用 附带案例1:消费Kafka写到Redis
200 0
|
2月前
|
SQL 消息中间件 分布式计算
大数据-115 - Flink DataStream Transformation 多个函数方法 FlatMap Window Aggregations Reduce
大数据-115 - Flink DataStream Transformation 多个函数方法 FlatMap Window Aggregations Reduce
43 0
|
4月前
|
消息中间件 传感器 数据处理
"揭秘实时流式计算:低延迟、高吞吐量的数据处理新纪元,Apache Flink示例带你领略实时数据处理的魅力"
【8月更文挑战第10天】实时流式计算即时处理数据流,低延迟捕获、处理并输出数据,适用于金融分析等需即时响应场景。其框架(如Apache Flink)含数据源、处理逻辑及输出目标三部分。例如,Flink可从数据流读取信息,转换后输出。此技术优势包括低延迟、高吞吐量、强容错性及处理逻辑的灵活性。
106 4
|
5月前
|
SQL 关系型数据库 MySQL
实时计算 Flink版产品使用问题之怎么使用DataStream生成结果表
实时计算Flink版作为一种强大的流处理和批处理统一的计算框架,广泛应用于各种需要实时数据处理和分析的场景。实时计算Flink版通常结合SQL接口、DataStream API、以及与上下游数据源和存储系统的丰富连接器,提供了一套全面的解决方案,以应对各种实时计算需求。其低延迟、高吞吐、容错性强的特点,使其成为众多企业和组织实时数据处理首选的技术平台。以下是实时计算Flink版的一些典型使用合集。
|
6月前
|
存储 监控 数据处理
Flink⼤状态作业调优实践指南:Datastream 作业篇
本文整理自俞航翔、陈婧敏、黄鹏程老师所撰写的大状态作业调优实践指南。
56656 5
Flink⼤状态作业调优实践指南:Datastream 作业篇
|
7月前
|
存储 算法 API
Flink DataStream API 批处理能力演进之路
本文由阿里云 Flink 团队郭伟杰老师撰写,旨在向 Flink Batch 社区用户介绍 Flink DataStream API 批处理能力的演进之路。
635 2
Flink DataStream API 批处理能力演进之路