about云日志分析,那么过滤清洗日志。该如何实现。这里参考国外的一篇文章,总结分享给大家。
使用spark分析网站访问日志,日志文件包含数十亿行。现在开始研究spark使用,他是如何工作的。几年前使用hadoop,后来发现spark也是容易的。
下面是需要注意的:
如果你已经知道如何使用spark并想知道如何处理spark访问日志记录,我写了这篇短的文章,介绍如何从Apache访问日志文件中生成URL点击率的排序
spark安装需要安装hadoop,并且二者版本要合适。安装可参考下面文章
about云日志分析项目准备6:Hadoop、Spark集群搭建
http://www.aboutyun.com/forum.php?mod=viewthread&tid=20620
进入
./bin/spark-shell
可能会出错
java.io.FileNotFoundException: File file:/data/spark_data/history/event-log does not exist
解决办法:
mkdir -p /data/spark_data/history/event-log
详细错误如下
17/10/08 17:00:23 INFO client.AppClient$ClientEndpoint: Executor updated: app-20171008170022-0000/1 is now RUNNING 17/10/08 17:00:25 ERROR spark.SparkContext: Error initializing SparkContext. java.io.FileNotFoundException: File file:/data/spark_data/history/event-log does not exist kSubmit.main(SparkSubmit.scala) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) java.lang.NullPointerException arkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) <console>:16: error: not found: value sqlContext import sqlContext.implicits._ ^ <console>:16: error: not found: value sqlContext import sqlContext.sql ^
进入spark shell
val textFile=sc.textFile("file:///data/spark/README.md")
说明:
记得这里如果自己创建的文件可能会读取不到。报错如下
java.io.FileNotFoundException: File file:/data/spark/change.txt does not exist Submit.scala) Caused by: java.io.FileNotFoundException: File file:/data/spark/change.txt does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:140) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766) at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:108) at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67) at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:240) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:211) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
执行
textFile.count
textFile.first
输出如下内容
scala> textFile.first 17/10/08 18:34:23 INFO spark.SparkContext: Starting job: first at <console>:30 17/10/08 18:34:23 INFO scheduler.DAGScheduler: Got job 1 (first at <console>:30) with 1 output partitions 17/10/08 18:34:23 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (first at <console>:30) 17/10/08 18:34:23 INFO scheduler.DAGScheduler: Parents of final stage: List() 17/10/08 18:34:23 INFO scheduler.DAGScheduler: Missing parents: List() 17/10/08 18:34:23 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (file:///data/spark/README.md MapPartitionsRDD[1] at textFile at <console>:27), which has no missing parents 17/10/08 18:34:23 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.1 KB, free 517.2 MB) 17/10/08 18:34:23 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1843.0 B, free 517.2 MB) 17/10/08 18:34:23 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.1.10:41717 (size: 1843.0 B, free: 517.4 MB) 17/10/08 18:34:23 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006 17/10/08 18:34:23 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (file:///data/spark/README.md MapPartitionsRDD[1] at textFile at <console>:27) 17/10/08 18:34:23 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 1 tasks 17/10/08 18:34:23 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, slave2, partition 0,PROCESS_LOCAL, 2128 bytes) 17/10/08 18:34:23 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on slave2:35228 (size: 1843.0 B, free: 517.4 MB) 17/10/08 18:34:23 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 116 ms on slave2 (1/1) 17/10/08 18:34:23 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 17/10/08 18:34:23 INFO scheduler.DAGScheduler: ResultStage 1 (first at <console>:30) finished in 0.117 s 17/10/08 18:34:23 INFO scheduler.DAGScheduler: Job 1 finished: first at <console>:30, took 0.161753 s res1: String = # Apache Spark