cassandra-stress是cassandra自带的压测工具，可以针对具体的表schema设计，模拟各种负载情况，测试集群的读写性能。这个工具功能十分强大，但是网上能找到的（中文）资料并不多，尤其是对yaml配置文件的介绍很少。本文简单介绍这个工具的用法，重点会描述yaml配置文件的格式。

基本用法

cassandra-stress工具在cassandra的tools/bin目录下面，其命令行格式为cassandra-stress <command> [options]。command和options的详细介绍可以通过cassandra-stress help <command|option>查看帮助或是查看cassandra文档。下面介绍一些常用的命令和选项。

常用命令

read:
并发读测试。在运行这个选项之前需要先运行写测试写入数据。
write:
并发写测试。
mixed:
读写混合测试。可以配置读写的比例和数据分布。
counter_write:
counter写测试。
counter_read:
counter读测试。同样，也需要先进行写测试。
user:
使用用户提供的查询语句进行压测。下一小节会详细介绍。

常用选项

cl=
指定压测期间的一致性级别。可选 ONE, QUORUM, LOCAL_QUORUM, EACH_QUORUM, ALL, ANY。默认为LOCAL_ONE。

n=
指定要执行的操作次数。

profile=
指定yaml配置文件。下一个小节会详细介绍配置文件。这个参数只在命令为user时起作用。

-node
指定目标集群的nodes。用法为-node [<host>] [file=<files>] [whitelist <whitelist>]
这里面host参数指定目标集群的nodes。files、whitelist选项的使用可以参考帮助。

-rate
设置写入速度相关的参数。用法：

-rate threads=<threads> [throttle=< throttle >] [fixed=<fixed rate>]
或
-rate [threads>=<min threads>] [threads<=<max threads>] [auto]
        
          
        
        
        
          
          AI 代码解读

这里面，threads参数指定同时有多少个客户端并行写入；throttle限制最大读写速度（op/s）；fixed选项则表示以固定的速度读写。
auto方式下，会不断增大线程数直到吞吐量达到饱和（判断依据是3次测试中吞吐量没有改进）。

-schema
指定replication策略，压缩算法，compaction策略等。用法：

-schema [replication(<replication>)] [keyspace=<keyspace>] [compaction(<compaction>)] [compression=< compression >]
        
          
        
        
        
          
          AI 代码解读

-col
列的配置。例如数据大小的分布，每次读/写多少个列的分布等。用法：

-col names=<names> [slice <slice>] [super=<super>] [comparator=<comparator>] [timestamp=<timestamp>] [size=<size dist>]
 或
-col [n=<count dist>] [slice <slice>] [super=<super>] [comparator=<comparator>] [timestamp=<timestamp>] [size=<size dist>]
        
          
        
        
        
          
          AI 代码解读

上面的描述比较抽象，举一个例子。如下命令以Local_Quorum方式写入100万条数据，32个列，每列固定写入2048字节的内容，客户端线程数=500个，副本数=3

cassandra-stress write n=1000000 cl=LOCAL_QUORUM -rate threads=500 \
    -col "size=fixed(2048)" "n=fixed(32)" -schema "replication(factor=3)" -node localhost
        
          
        
        
        
          
          AI 代码解读

yaml配置文件

我们以tools目录下的cqlstress-example.yaml为例说明yaml文件的配置格式。

#
# This is an example YAML profile for cassandra-stress
#
# insert data
# cassandra-stress user profile=/home/jake/stress1.yaml ops(insert=1)
#
# read, using query simple1:
# cassandra-stress profile=/home/jake/stress1.yaml ops(simple1=1)
#
# mixed workload (90/10)
# cassandra-stress user profile=/home/jake/stress1.yaml ops(insert=1,simple1=9)


#
# Keyspace info
#
keyspace: stresscql

#
# The CQL for creating a keyspace (optional if it already exists)
#
keyspace_definition: |
  CREATE KEYSPACE stresscql WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};

#
# Table info
#
table: typestest

#
# The CQL for creating a table you wish to stress (optional if it already exists)
#
table_definition: |
  CREATE TABLE typestest (
        name text,
        choice boolean,
        date timestamp,
        address inet,
        dbl double,
        lval bigint,
        ival int,
        uid timeuuid,
        value blob,
        PRIMARY KEY((name,choice), date, address, dbl, lval, ival, uid)
  ) 
    WITH compaction = { 'class':'LeveledCompactionStrategy' }
#    AND compression = { 'sstable_compression' : '' }
#    AND comment='A table of many types to test wide rows'

#
# Optional meta information on the generated columns in the above table
# The min and max only apply to text and blob types
# The distribution field represents the total unique population
# distribution of that column across rows.  Supported types are
# 
#      EXP(min..max)                        An exponential distribution over the range [min..max]
#      EXTREME(min..max,shape)              An extreme value (Weibull) distribution over the range [min..max]
#      GAUSSIAN(min..max,stdvrng)           A gaussian/normal distribution, where mean=(min+max)/2, and stdev is (mean-min)/stdvrng
#      GAUSSIAN(min..max,mean,stdev)        A gaussian/normal distribution, with explicitly defined mean and stdev
#      UNIFORM(min..max)                    A uniform distribution over the range [min, max]
#      FIXED(val)                           A fixed distribution, always returning the same value
#      SEQ(min..max)                        A fixed sequence, returning values in the range min to max sequentially (starting based on seed), wrapping if necessary.
#      Aliases: extr, gauss, normal, norm, weibull
#
#      If preceded by ~, the distribution is inverted
#
# Defaults for all columns are size: uniform(4..8), population: uniform(1..100B), cluster: fixed(1)
#
columnspec:
  - name: name
    size: uniform(1..10)
    population: uniform(1..10)     # the range of unique values to select for the field (default is 100Billion)
  - name: date
    cluster: uniform(20..40)
  - name: lval
    population: gaussian(1..1000)
    cluster: uniform(1..4)

insert:
  partitions: uniform(1..50)       # number of unique partitions to update in a single operation
                                  # if batchcount > 1, multiple batches will be used but all partitions will
                                  # occur in all batches (unless they finish early); only the row counts will vary
  batchtype: LOGGED               # type of batch to use
  select: uniform(1..10)/10       # uniform chance any single generated CQL row will be visited in a partition;
                                  # generated for each partition independently, each time we visit it

#
# A list of queries you wish to run against the schema
#
queries:
   simple1:
      cql: select * from typestest where name = ? and choice = ? LIMIT 100
      fields: samerow             # samerow or multirow (select arguments from the same row, or randomly from all rows in the partition)
   range1:
      cql: select * from typestest where name = ? and choice = ? and date >= ? LIMIT 100
      fields: multirow            # samerow or multirow (select arguments from the same row, or randomly from all rows in the partition)


#
# A list of bulk read queries that analytics tools may perform against the schema
# Each query will sweep an entire token range, page by page.
#
token_range_queries:
  all_columns_tr_query:
    columns: '*'
    page_size: 5000

  value_tr_query:
    columns: value
        
          
        
        
        
          
          AI 代码解读

下面介绍文件中各个配置项的含义：
keyspace/keyspace_definition/table/table_definition
这几项应该不用解释了。。。

columnspec
配置列的信息。作用类似于命令行中的-col选项。
name指定列名。
size指定列长度的分布。例如这里name字段的长度是均匀分布的，也就是有10%的记录长度为1，10%的记录长度为2，...
population指定列取值的分布。举例来说，lval字段的值符合高斯分布。而name字段指定population参数就比较令人困惑。。。其实代码里面是按照uniform分布来生成随机数种子，然后用这个随机数种子来产生字符串的内容。

insert
指定写入操作的选项。
partitions参数指定partition key的分布。
batchtype指定batch的类型。
select指定在每个partiton内数据的分布。

queries
指定一系列需要执行的查询。具体参数应该不需要翻译了。。。

token_range_queries
指定大范围的查询。
以all_columns_tr_query为例，这个查询会转换成为一系列select * from typestest where token(name,choice) > ? and token(name,choice) < ?这样的查询。具体的参数应该不需要更多解释了。

使用如下命令执行压测：
cassandra-stress user profile=tools/cqlstress-example.yaml n=1000000 ops(insert=3,simple1=1)