前段时间测试了seastar的用户态协议栈,性能很强悍,做一次总结。
seastar
seastar是一个高性能的IO框架,C++14炫技式实现。
http://www.seastar-project.org/
单张万兆网卡的性能
1core ----> 53w 2core ----> 106w 3core ----> 159w 4core ----> 200w 5core ----> 262w 6core ----> 316w 7core ----> 365w 8core ----> 402w 9core ----> 410w 9core ----> 409w 16core ----> 889w
测试场景是echo1个字节,1000个连接,可以看到非常线性。
seastar环境搭建
- 拷贝scylladb.tar到/opt
- 修改动态库路径
vim /etc/ld.so.conf.d/scylla.x86_64.conf /opt/scylladb/lib64 ldconfig ldconfig -p
- 添加gcc-5.3路径
export PATH=/opt/scylladb/bin/:$PATH
- 安装依赖
sudo yum install cryptopp.x86_64 cryptopp-devel.x86_64 -y -b test sudo yum install -y libaio-devel hwloc-devel numactl-devel libpciaccess-devel cryptopp-devel libxml2-devel xfsprogs-devel gnutls-devel lksctp-tools-devel lz4-devel gcc make protobuf-devel protobuf-compiler libunwind-devel systemtap-sdt-devel
- config.py配置dpdk
./configure.py --enable-dpdk --disable-xen --with apps/echo/tcp_echo
- 编译seastar
ninjia-build -j90
- 编译dpdk的内核模块并安装内核模块
cd seastar/dpdk ./tools/dpdk-setup.sh modprobe uio insmod igb_uio.ko
接管网卡
- 选择要被dpdk接管的网卡,解除banding
echo -eth4 > /sys/class/net/bond0/bonding/slaves
- 使用seastar中的脚本查看被接管网卡(dpdk中也有一个脚本)
./scripts/dpdk_nic_bind.py --status Network devices using DPDK-compatible driver ============================================ <none> Network devices using kernel driver =================================== 0000:03:00.0 'I350 Gigabit Network Connection' if=eno1 drv=igb unused=igb_uio 0000:03:00.1 'I350 Gigabit Network Connection' if=eno2 drv=igb unused=igb_uio 0000:03:00.2 'I350 Gigabit Network Connection' if=eno3 drv=igb unused=igb_uio 0000:03:00.3 'I350 Gigabit Network Connection' if=eno4 drv=igb unused=igb_uio 0000:04:00.0 '82599EB 10-Gigabit SFI/SFP+ Network Connection' if=eth4 drv=ixgbe unused=igb_uio 0000:04:00.1 '82599EB 10-Gigabit SFI/SFP+ Network Connection' if=eth5 drv=ixgbe unused=igb_uio 可以看到eth4是Eth的82599万兆网卡,也是我即将接管的网卡。
- 接管网卡
./scripts/dpdk_nic_bind.py --bind=igb_uio eth4
- 检查接管成功
./scripts/dpdk_nic_bind.py --status Network devices using DPDK-compatible driver ============================================ 0000:04:00.0 '82599EB 10-Gigabit SFI/SFP+ Network Connection' drv=igb_uio unused= Network devices using kernel driver =================================== 0000:03:00.0 'I350 Gigabit Network Connection' if=eno1 drv=igb unused=igb_uio 0000:03:00.1 'I350 Gigabit Network Connection' if=eno2 drv=igb unused=igb_uio 0000:03:00.2 'I350 Gigabit Network Connection' if=eno3 drv=igb unused=igb_uio 0000:03:00.3 'I350 Gigabit Network Connection' if=eno4 drv=igb unused=igb_uio 0000:04:00.1 '82599EB 10-Gigabit SFI/SFP+ Network Connection' if=eth5 drv=ixgbe unused=igb_uio Other network devices =====================
运行seastar的用户态协议栈
- 配置巨页
echo 1024 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages mount -t hugetlbfs /dev/hugepages/
- 运行测试程序server
./build/release/apps/echo/tcp_echo_server --network-stack native --dpdk-pmd --collectd 0 --port 10000 --dhcp 0 --host-ipv4-addr 10.107.139.20 --netmask-ipv4-addr 255.255.252.0 --gw-ipv4-addr 100.81.243.247 --no-handle-interrupt --poll-mode --poll-aio 0 --hugepages /dev/hugepages --memory 30G
--hugepages /dev/hugepages 指定了数据包内存分配方式,可以减少一次内存拷贝
- 运行测试程序client
./build/release/apps/echo/tcp_echo_client --network-stack native --dpdk-pmd --dhcp 1 --poll-mode --poll-aio 0 --hugepages /dev/hugepages --memory 30G --smp 16 -s "10.107.139.20:10000" --conn 10
劫持网卡的数据包
- seastar在每个core上跑一个用户态的协议栈,这个协议栈的输入是2层包,如何在用户态获取网卡上到来的2层数据包呢?
seastar支持两种模式的数据包获取方式: 1)通过tap,vhost-net,,vring获取,这个是开发模式,不需要接管网卡,方便部署; 2)dpdk接管网卡,polling数据包
seastar协议栈初始化
native-stack.cc中设置协议栈工厂函数
network_stack_registrator nns_registrator{ "native", nns_options(), native_network_stack::create };
根据命令行参数netowrk-stack调用 native_network_stack::create
生成dpdk_net_device
(dpdk_device*) dev = create_dpdk_net_devic()
初始化dpdk_device
dpdk dev的初始化 rte_eth_dev_configure
每个cpu上init_local_queue
auto qp = sdev->init_local_queue(opts, qid); 创建一个dpdk::dpdk_qp<false>,在dpdk_qp的构造函数中会: 0) 构造父类qp,会注册poll_tx 1) 注册rx_gc 2) 注册_tx_buf_factory.gc() 3) init_rx_mbuf_pool 4) rte_eth_rx_queue_setup 5) rte_eth_tx_queue_setup
每个cpu上dpdk_device::init_port_fini
rte_eth_dev_start rte_eth_dev_rss_reta_update 注册定时器rte_eth_link_get_nowait
每个cpu上set_local_queue, 设置qp指针
_queues[engine().cpu_id()] = qp.get();
每个cpu上create_native_stack
interface::_rx = device::receive() { auto sub = _queues[engine().cpu_id()]->_rx_stream.listen(std::move(next_packet)); //注册2层协议处理函数!!! _queues[engine().cpu_id()]->rx_start(); //rx_start会注册dpdk_qp::poll_rx_once()到poll_once中 return std::move(sub); }
收包过程reactor::run() ----> poll_once()
在初始化interface的时候会把对应的qp的poll_rx_once注册到当前engine的poll_once中
dpdk_qp::poll_rx_once() { uint16_t rx_count = rte_eth_rx_burst(_dev->port_idx(), _qid, buf, packet_read_size); process_packets(buf, rx_count); }
void process_packets(struct rte_mbuf **bufs, uint16_t count) { for (uint16_t i = 0; i < count; i++) { struct rte_mbuf *m = bufs[i]; // 处理VLAN // 处理rx_csum_offload _dev->l2receive(std::move(*p)); } } void l2receive(packet p) { _queues[engine().cpu_id()]->_rx_stream.produce(std::move(p)); } ----> _rx_stream的回调是在interface构造的时候赋值的: interface::dispatch_packet() { }