kafka-0.10.0官网翻译(一)入门指南

本文涉及的产品
日志服务 SLS,月写入数据量 50GB 1个月
简介: 1.1 IntroductionKafka is a distributed streaming platform. What exactly does that mean?kafka是一个分布式的流式平台,它到底是什么意思? We think of a streaming platform as...

1.1 Introduction
Kafka is a distributed streaming platform. What exactly does that mean?
kafka是一个分布式的流式平台,它到底是什么意思?


We think of a streaming platform as having three key capabilities:
我们认为流式平台有以下三个主要的功能:
  It let's you publish and subscribe to streams of records. In this respect it is similar to a message queue or enterprise messaging system.
  它能让你推送和订阅流记录。在这个方面它类似于一个消息队列或者企业级的消息系统。
  It let's you store streams of records in a fault-tolerant way.
  它能让你存储流记录以一种容错的方式。
  It let's you process streams of records as they occur.
  它让你处理流记录当流数据来到时。

新浪微博:intsmaze刘洋洋哥


What is Kafka good for?
kafka有什么好处?
  It gets used for two broad classes of application:
  它被用于两大类别的应用程序:
  Building real-time streaming data pipelines that reliably get data between systems or applications
  建立实时的流式数据通道,这个通道能可靠的获取到在系统或应用间的数据
  Building real-time streaming applications that transform or react to the streams of data
  建立实时流媒体应用来转换流数据或对流数据做出反应。


  To understand how Kafka does these things, let's dive in and explore Kafka's capabilities from the bottom up.
  为了明白kafka能怎么做这些事情,让我们从下面开始深入探索kafka的功能:
First a few concepts:
首先看这几个概念:
  Kafka is run as a cluster on one or more servers.
  kafka作为集群运行在一个或多个服务器。
  The Kafka cluster stores streams of records in categories called topics.
  kafka集群存储的流记录以类别划分称为主题。
  Each record consists of a key, a value, and a timestamp.
  每条记录包含一个键,一个值和一个时间戳。
    
  Kafka has four core APIs:
  kafka有四个核心的apis:
  The Producer API allows an application to publish a stream records to one or more Kafka topics.
  生成者api允许一个应用去推送一个流记录到一个或多个kafka主题上。
  The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
  消费者api允许一个应用去订阅一个或多个主题,对他们产生过程的记录。
  The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
  这个Streams API允许应用去作为一个流处理器,消费一个来至于一个或多个主题的输入流,生产一个输出流到一个或多个输出流主题,有效地将输入流转换为输出流。
  The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to
  Connector API允许建立和允许可重用的生产者或消费者去连接kafka主题到存在的应用或数据系统。例如,关系数据库的连接器可能捕获每一个变化。


  In Kafka the communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol. This protocol is versioned and maintains backwards compatibility with older version. We provide a Java client for Kafka, but clients are available in many languages.
  在kafka的客户和服务器之间的通信是用一个简单的,高性能的,语言无关的TCP协议完成的。这个协议的版本能向后维护来兼容旧版本。我们为kafka提供一个java版本的客户端,其实这个客户端有很多语言版本供选择。


Topics and Logs 主题和日志
  Let's first dive into the core abstraction Kafka provides for a stream of records—the topic.
  我们首先深入kafka核心概念,kafka提供了一连串的记录称为主题。

  A topic is a category or feed name to which records are published. Topics in Kafka are always multi-subscriber; that is, a topic can have zero, one, or many consumers that subscribe to the data written to it.

  主题就是一个类别或者命名哪些记录会被推送走。kafka中的主题总是有多个订阅者。所以,一个主题可以有零个,一个或多个消费者去订阅写到这个主题里面的数据。
  For each topic, the Kafka cluster maintains a partitioned log that looks like this:
  针对每一个主题,这个kafka集群维护一个像下面这样的分区日志:

  Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log. The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.
  每个分区是一个有序,不变的序列的记录,它被不断追加—这种结构化的操作日志。分区的记录都分配了一个连续的id号叫做偏移量。偏移量唯一的标识在分区的每一条记录。

  The Kafka cluster retains all published records—whether or not they have been consumed—using a configurable retention period. For example if the retention policy is set to two days, then for the two days after a record is published, it is available for consumption, after which it will be discarded to free up space. Kafka's performance is effectively constant with respect to data size so storing data for a long time is not a problem.

  kafka集群使用一个可配置的保存期来保存所以已经推送出去的记录,不论他们是否已经被消费掉。例如,如果保存的策略设置为两天,然后记录被推送出去两天后,这个记录可以消费,之后,它将被丢弃来腾出空间。kafka的性能是有效常数对数据大小所以存储数据很长一段时间不是一个问题。

  In fact, the only metadata retained on a per-consumer basis is the offset or position of that consumer in the log. This offset is controlled by the consumer: normally a consumer will advance its offset linearly as it reads records, but, in fact, since the position is controlled by the consumer it can consume records in any order it likes. For example a consumer can reset to an older offset to reprocess data from the past or skip ahead to the most recent record and start consuming from "now".
  事实上,唯一的元数据保留在每个消费者的基础上 偏移量是通过消费者进行控制:通常当消费者读取一个记录后会线性的增加他的偏移量。但是,事实上,自从记录的位移由消费者控制后,消费者可以在任何顺序消费记录。例如,一个消费者可以重新设置偏移量为之前使用的偏移量来重新处理数据或者跳到最近的记录开始消费。
  This combination of features means that Kafka consumers are very cheap—they can come and go without much impact on the cluster or on other consumers. For example, you can use our command line tools to "tail" the contents of any topic without changing what is consumed by any existing consumers.
  kafka的组合特性意味着kafka消费者们是很方便的,他们能够加入或者离开不会影响集群或者其他的消费者。例如,你能够使用我们的命令行工具去追踪任何主题的内容不改变消费被任何存在的消费者。
  The partitions in the log serve several purposes. First, they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data. Second they act as the unit of parallelism—more on that in a bit.
  日志划分分区有多个目的。第一:他们允许日志的大小可以超过他们部署在一台单机的限制。每个分区的服务器主机上必须适合它。


Distribution
分布
  The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions. Each partition is replicated across a configurable number of servers for fault tolerance.
  日志的分区被分布在kafka集群的服务器上,每个服务器处理数据和请求一个共享的分区。每个分区复制在一个可配置的容错服务器数量。

  Each partition has one server which acts as the "leader" and zero or more servers which act as "followers". The leader handles all read and write requests for the partition while the followers passively replicate the leader. If the leader fails, one of the followers will automatically become the new leader. Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.
  每个分区都有一个服务器充当“领导者”和零个或多个服务器充当“追随者”。leader处理所有对分区读写请求时followers就会被动复制这个leader的分区。如果这个leader发送故障,这些followers中的一个将自动的成为一个新的leader。Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.

Producers
生产者
  Producers publish data to the topics of their choice. The producer is responsible for choosing which record to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the record). More on the use of partitioning in a second!
  生产者推送数据到他们选择的主题。生产者负责选择哪个记录分配到指定主题的哪个分区中。通过循环的方式可以简单地来平衡负载记录到分区上或可以根据一些语义分区函数来确定记录到哪个分区上(根据记录的key进行划分)。马上你会看到关于更多的划分使用。

Consumers

消费者
  Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.
  消费者们标识他们自己通过消费组名称,每一条被推送到主题的记录只被交付给订阅该主题的每一个消费组。消费者可以在单独的实例流程或在不同的机器上。

  If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer instances.
  如果所有的消费者实例都在同一个消费组中,那么一条消息将会有效地负载平衡给这些消费者实例。
  If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.
  如果所有的消费者实例在不同的消费组中,那么每一条消息将会被广播给所有的消费者处理。
  

  A two server Kafka cluster hosting four partitions (P0-P3) with two consumer groups. Consumer group A has two consumer instances and group B has four.
  两个服务器的kafka集群管理四个分区(P0-P3)作用于两个消费者组。消费组A有两个消费者实例,消费组B有四个消费者实例。
  More commonly, however, we have found that topics have a small number of consumer groups, one for each "logical subscriber". Each group is composed of many consumer instances for scalability and fault tolerance. This is nothing more than publish-subscribe semantics where the subscriber is a cluster of consumers instead of a single process.

  更常见的,我们发现主题有一个小数量的消费群体one for each "logical subscriber"。

  The way consumption is implemented in Kafka is by dividing up the partitions in the log over the consumer instances so that each instance is the exclusive consumer of a "fair share" of partitions at any point in time. This process of maintaining membership in the group is handled by the Kafka protocol dynamically. If new instances join the group they will take over some partitions from other members of the group; if an instance dies, its partitions will be distributed to the remaining instances.

  Kafka only provides a total order over records within a partition, not between different partitions in a topic. Per-partition ordering combined with the ability to partition data by key is sufficient for most applications. However, if you require a total order over records this can be achieved with a topic that has only one partition, though this will mean only one consumer process per consumer group.

Guarantees
保证
  At a high-level Kafka gives the following guarantees:
  kafka的高级api可以赋予以下保证:
  Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a record M1 is sent by the same producer as a record M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.
  消息被生产者发送到一个特定的主题分区,消息将以发送的顺序追加到这个分区上面。比如,如果M1和M2消息都被同一个消费者发送,M1先发送,M1的偏移量将比M2的小且更早出现在日志上面。
  A consumer instance sees records in the order they are stored in the log.
  一个消费者实例按照记录存储在日志上的顺序读取。
  For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any records committed to the log.
  一个主题的副本数是N,我们可以容忍N-1个服务器发生故障没而不会丢失任何提交到日志中的记录。

More details on these guarantees are given in the design section of the documentation.
关于担保的更多的细节将在文档的设计章节被给出来。

 

作者: intsmaze(刘洋)
老铁,你的--->推荐,--->关注,--->评论--->是我继续写作的动力。
微信公众号号:Apache技术研究院
由于博主能力有限,文中可能存在描述不正确,欢迎指正、补充!
本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接,否则保留追究法律责任的权利。
相关文章
|
6月前
|
消息中间件 Java Kafka
kafka入门demo
kafka入门demo
70 0
|
消息中间件 监控 关系型数据库
【Kafka系列】(一)Kafka入门(下)
【Kafka系列】(一)Kafka入门(下)
|
6月前
|
消息中间件 分布式计算 Kafka
SparkStreaming(SparkStreaming概述、入门、Kafka数据源、DStream转换、输出、关闭)
SparkStreaming(SparkStreaming概述、入门、Kafka数据源、DStream转换、输出、关闭)(一)
|
6月前
|
消息中间件 Java Kafka
Kafka【环境搭建 01】kafka_2.12-2.6.0 单机版安装+参数配置及说明+添加到service服务+开机启动配置+验证+chkconfig配置说明(一篇入门kafka)
【2月更文挑战第19天】Kafka【环境搭建 01】kafka_2.12-2.6.0 单机版安装+参数配置及说明+添加到service服务+开机启动配置+验证+chkconfig配置说明(一篇入门kafka)
212 1
|
6月前
|
消息中间件 存储 Kafka
Kafka【基础入门】
Kafka【基础入门】
65 1
|
6月前
|
消息中间件 存储 分布式计算
Apache Kafka-初体验Kafka(01)-入门整体认识kafka
Apache Kafka-初体验Kafka(01)-入门整体认识kafka
76 0
|
6月前
|
消息中间件 算法 Kafka
Kafka入门,这一篇就够了(安装,topic,生产者,消费者)
Kafka入门,这一篇就够了(安装,topic,生产者,消费者)
265 0
|
消息中间件 存储 Kafka
(四)kafka从入门到精通之安装教程
Kafka是一个高性能、低延迟、分布式的分布式数据库,可以在分布式环境中实现数据的实时同步和分发。Zookeeper是一种开源的分布式数据存储系统,它可以在分布式环境中存储和管理数据库中的数据。它的主要作用是实现数据的实时同步和分发,可以用于实现分布式数据库、分布式文件系统、分布式日志系统等。Zookeeper的设计目标是高可用性、高性能、低延迟,它支持多种客户端协议,包括TCP和HTTP,可以方便地与其他分布式系统进行集成。
129 0
|
消息中间件 传感器 Kafka
(三)kafka从入门到精通之使用场景
Kafka 是一种流处理平台,主要用于处理大量数据流,如实时事件、日志文件和传感器数据等。Kafka的目的是实现高吞吐量、低延迟和高可用性的数据处理。Kafka提供了一个高度可扩展的架构,可以轻松地添加和删除节点,并且能够处理数百亿条消息/分区。Kafka的消息可以容错,即使某个节点失败,消息也会在集群中的其他节点上得到处理。总的来说,Kafka 是一个非常强大的数据处理平台,可以用于实时数据处理、日志文件处理、传感器数据处理和流处理等场景。
149 0
|
消息中间件 存储 Java
【Kafka系列】(一)Kafka入门(上)
【Kafka系列】(一)Kafka入门