2021 年的 eBPF-阿里云开发者社区

诞生之初，eBPF 选择了一条非常务实的演进之路，没有选择重新造轮子，从优化 cBPF (net: filter: rework/optimize internal BPF interpreter's instruction set) 开始并在此基础上不断进化演进。经过社区几年的快速迭代，在 5.10 版本中 eBPF 技术已经非常成熟，生态也已经初具规模，特别是在跟踪、网络和安全领域，明星产品包括 Cillium、BCC 和 Sysdig 等，同时面向 eBPF 开发者的 eBPF Summit 2020 也如约而至。

特性演进

自 Linux 4.19 内核以来，社区一共为 eBPF 带来了超过 160 个新特性和优化改进，主要集中在这三部分：BPF (VM、bytecode、程序类型和数据结构)、网络、安全。

字节码、虚拟机

首先是 BPF 自身能力的增强，包括 BPF 虚拟机、字节码、程序类型和基础数据结构等。截止到 Linux LTS 5.10 版本为例，几个最为亮眼的特性是：

5.1：支持 bpf_spin_lock；
5.3：支持有界循环；
5.6：verifier 允许的最大 BPF 指令数量放宽至 100 万条；
5.10：开始支持 BPF 程序睡眠；

从上面的特性可以看出来，这是一个非常明显的信号，社区想要扩大 eBPF 的能力，不仅仅局限在编写简单的 kprobe 跟踪程序，希望 eBPF 能够逐步地替换掉部分内核的能力，当然这一切都是在保证安全的前提下。

eBPF verifier 允许最大 BPF 指令放宽至 100 万条，这个为 BPF 支持更复杂的场景带来了可能，要知道之前 verifier 的指令上限是 4096，100 万条指令意味着可以将用户态复杂的逻辑放到内核态，以后有可能将内核中的一部分能力通过 eBPF 安全的暴露出来，用户态可以自由的定制并热加载到内核中，同时性能相对于原生没有明显下降。除此之外，类似 nginx 这种性能敏感的应用有可能编译为 eBPF ISA 字节码并在内核中运行，带来更极致的性能。

其次 eBPF verifier 终于谨慎的支持有界循环，而不必使用类似达夫设备 #pragma unroll 的技巧。接下来 bpf_spin_lock 也为编写复杂的算法带来了可能，常见算法基本离不开并发安全。但是 eBPF 第一准则仍是安全！安全！安全！eBPF verifier 决不允许用户载入到内核的 BPF 程序将内核 panic。虽然这些特性只是一小步，但确实简化了用户编写 eBPF 的开发成本。当初不支持有界循环时，例如用户打算编写一个 NAT eBPF 程序在内核态遍历转发 map 是无法实现的，需要在用户态 syscall 操作 map，再通知内核根据确定的 key 得到所需的 value。

程序类型、数据结构

eBPF VM 和字节码的能力增强，新增的程序类型、数据结构和 helper 函数，一同为跟踪、网络和安全领域带来了更多可能性，列举几处：

数据结构：queue、stack、mmap BPF array map、global data 和 ring buffer 等；
跟踪：kprobe/uprobe 支持同时多 probe 等；
网络：sk_local_storage、各种 iterator、struct_ops 和 sk/skb helper 等；
安全：Google 贡献的 bpf-lsm；
其他：CO-RE、status monitor、dispatcher、BTF check asm 更快更安全、完善 ISA 指令、提高 verifer 稳定性等；

生态建设

eBPF 技术的蓬勃向上，离不开周边生态的工具和文档的建设。工欲善其事，必先利其器，对于开发者而言，如何通过工具、库和文档快速验证并构建所需的应用是至关重要的能力。

工具链

几年前，iovisor 为 eBPF 带来了 BCC、bpftrace 等工具，成为了当时 eBPF 在跟踪领域的最佳实践，同时非常多的开发者基于 BCC 的运行时编译的 Python 库二次开发大量的 eBPF 工具。由于 BCC Python 的运行时编译的方式的局限性，BCC 目前正在逐步过渡到基于社区 libbpf 的方案，通过 BTF + CO-RE + libbpf 实现了一次编译，到处运行的能力。同时 libbpf + bpftool 也极大地解放了开发者开发 eBPF 的心智负担，做到 out-of-tree 开发、bundle eBPF object 和 userspace bianry，将 eBPF map 创建、load、attach kprobe 等细节封装，对外提供 skeleton 快速开发。

影响力

社区一开始并没有大力的宣传 eBPF 的能力，但是仍然吸引非常多的开发者参与其中。2019 年 LSFMM 上网络子系统维护者 Miller 所说："there is no "advertising machine" for this technology"。最近几年，eBPF 社区开始不断的向外发声：各大会议上陆续增加 eBPF 相关讨论，The 2019 Linux Storage, Filesystem, and Memory-Management Summit，同时 eBPF Summit 2020 也如期举行。Cillium 作为最为广泛运用的 eBPF 产品，其中 cillium 多位核心开发者也是 eBPF 子系统的核心开发者。ebpf.io 网站也在 cillium 社区的帮助下，顺利上线并为大家介绍和了解 eBPF 最新特性和场景。

除了社区外，Sysdig、Google、Facebook 等等公司也在大力推广 eBPF。例如 Google Kubernetes Engine 已经宣布将 Cilium 作为数据平面。除此之外，eBPF 的布道者 Brendan Gregg 基于 eBPF 在性能调优等场景编写了新书《BPF Performance Tools》。国内网易、腾讯、字节和华为等厂商也多次发文介绍通过 eBPF 来增强系统的可观察性、加速云原生网络等。

未来展望

当前 eBPF 还有很多可以增强之处，例如随着 eBPF 字节码行数的放宽，可见 eBPF 的程序复杂度和体积也会成倍增加，当前还没有一整套可复用的库帮助用户复用代码，快速构建开发能力。当前用户开发的 eBPF 模式，更多的是类似开发监控脚本等短平快的开发模式。随着 BPF BTF 特性的不断完善，以后可以通过 BTF 让 eBPF 程序感知到内核所提供的 eBPF 库，帮助用户开发重复代码，提高开发效率。同时 eBPF 自身的可调试能力也需要增强，特别是 eBPF 程序经常注入在热路径，如果编写低效的代码会带来额外的开销。

除此之外，eBPF 作为一种字节码，可以提供一种平台无关、高效、安全的通用字节码。例如有内核开发者使用 eBPF 编写内核红外线社区驱动，旨在简化兼容数百个红外线设备。使用 eBPF 编写闭源的 NVIDIA eBPF 驱动。除内核外，在 Rust Conf 2020 上，开发者基于用户态 eBPF VM - uBPF + Rust 编写了通过 eBPF 字节码执行的智能合约，几年前也有相关开发者提出了 common eBPF 能力，将 eBPF 硬化至网卡驱动。OSDI 2020 最佳论文《hXDP: Efficient Software Packet Processing on FPGA NICs》也提出了一种将 eBPF 字节码迁移至 FPGA 设备运行的思路，值的我们继续探索。

附录

eBPF 关键特性一览（自 5.0 内核起）：

Linux 5.0

queue / stacks maps
socket lookup
per cpu cgroup local storage
flow dissectors

Linux 5.1

add ip encapsulations headers to packets
add jump32 instruction
access tcp_sock bpf_sock, BPF_FUNC_sk_fullsock & BPF_FUNC_tcp_sock
bpftools dump bpf related parameters
skb_shared_info->gso_segs
support __int128 type
per program stats to monitor usage eBPF
bpf_spin_lock
queue/stack manipulations
avoid unloading xdp prog not attached by samples
RISCV BPF JIT supported

Linux 5.2

kbuild, generate BTF type info for vmlinux and kernel modules
global data support
BPF_CGROUP_SYSCTL call sysctl proc_handler authorized
new arguments for bpf_attr for BPF_PROG_TEST_RUN
bpf sk local storage make bpf networking programming more intuitive
var offset stack access
checking SYN cookies from xdp and tc cls
extend bpf_skb_adjust_room growth to mark inner mac headers
import verifier scalability
opt-in interface for tracepoints to exporse a writeable context
bpf_skb_ecn_set_ce callable from BPF_GROG_TYPE_SCHED_ACT

Linux 5.3

libbpf BTF-to-C dumping support
new way to specify BPF maps
propagating congestion notifications to TCP from cgroupo inet skb egress
SO_DETACH_REUSEPORT_BPF to detach BPF prog from reuseport sk
sock_ops_callback can be selectively enabled on a socket
CGROUP_SKB porg to use bpf_skb_cgroup_id
eliminate zero extensions for sub-register write
export bpf_sock for BPF_PROG_TYPE_CGROUP_SOCK_ADDR prog type
wide aligned stores for some filed of bpf_sock_addr
fq's Earliest Departure Time to HBM
bounded loops
bpf getsockopt and setsockopt hooks

Linux 5.4

additions of multiprobes to kprobe and uprobe events: support more than on probe in the same locations
XDP devmap_hash looking up devices by hashed index
BPF ids in procfs for FD to BTF objects
BPF_BTF_GET_NEXT_ID byf() syscall to list all BTF objs
BPF_F_TEST_STATE_FREQ stress test
export BTF info through /sys/kernel/btf
implement ceneral part of CO_RE
BPF helper to generate SYN cookies
bpftool add net attach/detach command to attach XDP prog
bpftool work with frozen map
bpftool add support for reporting the effective cgroup progs

Linux 5.5

in-kernel BTF to type check BPF assembly code. allows safer and faster BPF tracing
Introduce BPF trampoline to allow kernel code to call into BPF programs with practically zero overhead
support for memory-mapping BPF array maps
Optimize BPF tail calls for direct jumps
Add probe_read_user, probe_read_kernel and probe_read_user_str, probe_read_kernel_str helpers
bpftool: Allow to read btf as raw data
low_dissector: add mode to enforce global BPF flow dissector

Linux 5.6

Introduce the BPF dispatcher, a mechanism to avoid indirect calls and helps to avoid repotlines performance hit
Introduce BPF STRUCT_OPS.
Introduce batch ops that can be added to bpf maps to lookup/lookup_and_delete/update/delete more than 1 element at the time
Emit audit messages upon successful prog load and unload
Program extensions or dynamic re-linking
Introduce static vs global functions and function by function verification, another step toward dynamic re-linking
replacing cgroup-bpf programs attached with BPF_F_ALLOW_MULTI flag so that any program in a list can be updated to a new version without service interruption and order of programs can be preserved
Implements a new BPF feature probe, which increases the maximum program size to 1M

Linux 5.7

Extend SOCKMAP to store listening as well as established sockets
Make BPF and PREEMPT_RT co-exist
BPF programs may want to know whether an skb is gso.
Add bpf_sk_assign eBPF helper, it allows assigning a previously-found socket to the skb as the packet is received towards the stack, to cause the stack to guide the packet towards that socket subject to local routing configuration.
Add bpf_sk_storage_get() and bpf_sk_storage_delete() helper to the bpf_tcp_ca's struct_ops
Provide bpf_sk_storage data during inet_diag's dump
Add support for storing UDP sockets in sockmap and sockhash
Adds bpftool struct_ops to support struct_ops features
Introduce bpftool prog profile command, which uses hardware counters to profile BPF programs
Allow per-file SELinux labeling for bpffs
bpf-lsm: A BPF-based Linux Security Module

Linux 5.8

Introduce CAP_PERFMON to secure system performance monitoring and observability
Introduce CAP_BPF to split BPF operations that are allowed under CAP_SYS_ADMIN into combination of CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN and keep some of them under CAP_SYS_ADMIN. The user process has to have: CAP_BPF to create maps and do other sys_bpf() commands, CAP_BPF and CAP_PERFMON to load tracing programs, and CAP_BPF plus CAP_NET_ADMIN to load networking programs
bpftool: Allow probing for CONFIG_HZ from kernel config
Add get{peer,sock}name cgroup attach types to the BPF sock_addr programs in order to enable rewriting sockaddr structs
Add sk_msg and networking helpers to all networking programs with perfmon_capable() capabilities
Implement a new BPF ring buffer
The bpf iterator provides in-kernel aggregation abilities for kernel data. This can greatly improve performance compared to e.g., iterating all process directories under /proc
Introduce a new bpf_link type for attaching to network namespace
Add rx_queue_mapping to bpf_sock
Sharing bpf runtime stats with BPF_ENABLE_STATS
bpf_{g,s}etsockopt for struct bpf_sock_addr

Linux 5.9

Add a text poke event to record changes to kernel text (i.e. self-modifying code) in order to support tracers like Intel PT decoding through jump labels, kprobes and ftrace trampolines
Add a new BPF program type named BPF_PROG_TYPE_SK_LOOKUP
XDP link: Following cgroup and netns examples, implement bpf_link support for XDP
Add BPF_CGROUP_INET_SOCK_RELEASE hook. Sometimes it's handy to know when the socket gets freed.
Add support of SO_KEEPALIVE flag and TCP related options to bpf_setsockopt() routine.
Add d_path helper
Add new BPF link operation that allows processes with BPF link FD to force-detach it from respective BPF hook, similarly how BPF link is auto-detached when such BPF hook (e.g., cgroup, net_device, netns, etc) is removed.
Expose socket storage to BPF_PROG_TYPE_CGROUP_SOCK
Implement bpf iterator for map elements. User can have a bpf program in kernel to run with each map element, do checking, filtering, aggregation, modifying values etc.
Iterator for tcp and udp sockets. This gives great flexibility for users to examine kernel data structure without using e.g. /proc/net
Introduces a new helper bpf_get_task_stack()

Linux 5.10

Sockmap iterator
BPF TCP header options
Introduce minimal support for sleepable progs
Add a kernel module with user mode driver that populates bpffs with two BPF iterators
Add tcp_notsent_lowat bpf setsockopt
BTF support for ksyms
Add support attaching freplace BPF programs to multiple targets. This is needed to support incremental attachment of multiple XDP programs using the libxdp dispatcher model
Allow updating sockmap / sockhash from BPF
Implement link_query for bpf iterators

2021 年的 eBPF

特性演进

字节码、虚拟机

程序类型、数据结构

生态建设

工具链

影响力

未来展望

附录

Linux 5.0

Linux 5.1

Linux 5.2

Linux 5.3

Linux 5.4

Linux 5.5

Linux 5.6

Linux 5.7

Linux 5.8

Linux 5.9

Linux 5.10

热门文章

最新文章

相关电子书

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

2021 年的 eBPF

特性演进

字节码、虚拟机

程序类型、数据结构

生态建设

工具链

影响力

未来展望

附录

Linux 5.0

Linux 5.1

Linux 5.2

Linux 5.3

Linux 5.4

Linux 5.5

Linux 5.6

Linux 5.7

Linux 5.8

Linux 5.9

Linux 5.10

热门文章

最新文章

相关电子书