Kubernetes 1.29版本中已经将nftables作为一个featureGates,本文简单整理了nftables的用法,便于后续理解kubernetes的nftables规则。文末给出了使用kubeadm部署启用nftables featureGates的配置文件。
nftables和iptables的不同之处
- nftables使用了新的语法:nftables使用了类似tcpdump的紧凑语法
- 可以完全配置tables和chains:iptables中有一些预定义的tables和chains(即使不需要),会对性能造成一定的影响。而nftables则没有预定义的tables和chains,因此需要明确定义各个table,以及其包含的对象(chains、sets、maps、flowtables和state object)。你可以定义table和chain的名称以及netfilter hook优先级。
- 单个nftables可以执行多个动作:iptables中通过匹配只能执行单个动作,但在nftables 规则中可以包含0或多个expressions(用于匹配报文),以及1或多个statements,每个expression会测试一个报文是否匹配特定的payload字段或报文/流的元数据。多个expressions会从左到右作线性评估,如果第一个expression匹配成功,则继续评估下一个expression。如果匹配了所有的expressions,则会执行statement。
每个statement会执行一个动作,如设置netfilter mark、计算报文数、记录报文日志或做出诸如接收或丢弃报文或跳到另一个chain的决定。多个statements也是从左到右线性执行的,这样一条规则可以通过多个statements执行多个动作(需要注意的是verdict statement 会终结规则)。 - chain和rule中没有内置的计数器:nftables中的计数器是可选的,可以在需要时启用
- 可以更好地支持动态规则更新:iptables的规则是一体式的,一条规则的变更会影响到整体规则。而nftables的ruleset在内部以链表表示。现在,添加或删除一条规则不会影响道其他规则,从而简化了内部状态信息的维护。
- 简化了IPv4/IPv6双栈管理: nftables的inet family 可以同时支持IPv4和IPv6的chain。不再需要脚本来复制规则
- 新的通用型基础结构:该基础结构与nftables的核心紧密集成,并支持高级配置,如maps, verdict maps 和 intervals ,以实现面向性能的报文分类。
- 支持串联(Concatenations):从Linux kernel 4.1开始,可以串联多个keys,并将它们与maps 和 verdict maps相结合。其思想是构建一个元组,并该元组的值进行哈希以获得接近O(1)的查找效率。
- 无需升级内核即可支持新协议:内核升级一项耗时且艰巨的任务,尤其是当你必须在网络中维护多个防火墙时。分发的内核版本通常滞后于最新版本。当虚拟机使用nftables时,支持新协议通常不需要更新内核,只需要更新用户空间的nft即可。
nftables的结构
跟iptables一样,nftables也使用了table->chain->rule的概念。并使用family的概念区分了报文类型。
ADDRESS FAMILIES
Address families决定了处理的报文类型,默认使用ip
。
ip | IPv4 address family. |
ip6 | IPv6 address family. |
inet | Internet (IPv4/IPv6) address family. |
arp | ARP address family, handling IPv4 ARP packets. |
bridge | Bridge address family, handling packets which traverse a bridge device. |
netdev | Netdev address family, handling packets on ingress and egress. |
内核在报文处理路径的不同阶段设置了hooks
,不同的address family对应各自的hooks。如 **IPv4/IPv6/ **的hooks如下:
Hook | Description |
prerouting | 所有报文进入系统前都会经过prerouting hook的处理。在执行路由前调用,早期过滤或通过变更报文属性来影响路由结果。 |
input | 在报文进入本地系统时会经过input hook 的处理 |
forward | 在将报文转发到不同的主机时会经过forward hook的处理 |
output | 本地进程发送的报文会经过outpu hook的处理 |
postrouting | 所有离开系统的报文都会经过 postrouting hook的处理 |
ingress | 所有进入系统的报文都会经过该hook的处理。它会在3层协议处理前被调用,因此它更早于prerouting hook,,可以用于过滤和策略处理。Ingress只针对Inet family (Linux kernel 5.10以上)生效。 |
特殊语法
连接符
nft中使用.
表示连接符,如下面表示1.1.1.1 and 2.2.2.2 and TCP
或1.1.1.1 and 3.3.3.3 and UDP
nft add rule ip filter input ip saddr . ip daddr . ip protocol { 1.1.1.1 . 2.2.2.2 . tcp, 1.1.1.1 . 3.3.3.3 . udp} counter accept
Intervals
可以使用value-value的方式表示Intervals,一个Intervals可以视作一个参数
% nft add rule filter input ip daddr 192.168.0.1-192.168.0.250 drop
通用参数
位置参数handle
nftables中使用handle
来表示位置ID,用于添加或删除表项,对应命令中的handle参数。可以通过-a
参数输出handle信息。
$ nft -a list ruleset
注释comment
注释为一个单词或双引号中的多个单词。在hash中,需要使用斜杠来转移引号,如\"enable ssh for servers\"
DATA TYPES
DATA TYPES为Expression提供了数据类型定义。
RULESET
- {list | flush} ruleset [family]
注意黑体字是关键字段,斜体为自定义字段,带[]的表示有默认值,为可选字段
ruleset表示所有的tables、chains等。
Example
nft list ruleset
: 显示所有的nftables规则,对应iptables的iptables-save
命令.
- 还可以指定family类型,如
nft list ruleset inet
nft flush ruleset
: 清空所有的nftables规则
备份和恢复
备份
% echo "flush ruleset" > backup.nft % nft list ruleset >> backup.nft
恢复
% nft -f backup.nft
TABLES
{add | create} table [family] table [ {comment comment ;} { flags 'flags ; }]
{delete | destroy | list | flush} table [family] table
list tables [family]
delete table [family] handle handle
destroy table [family] handle handle
tables包含chains、sets和stateful objects,并使用address family和名称进行区分,address family必须是ip
, ip6
, inet
, arp
, bridge
, netdev
。inet
用于创建 IPv4/IPv6混合tables,如果不指定,默认为ip
。add
和create
的区别是,前者不会在table存在的情况下返回错误,而后者会。delete
和destroy
的区别也是如此,delete会在table不存在的情况下返货错误,而destroy则不会。
Example
$ nft add table inet my_table
CHAINS
- {add | create} chain [family] table chain [{ type type hook hook [device device] priority priority ; [policy policy ;] [comment comment ;] }]
- {delete | destroy | list | flush} chain ['family] table chain
- list chains [family]
- delete chain [family] table handle handle
- destroy chain [family] table handle handle
- rename chain [family] table chain newname
chains中包含了规则(rules)。chains有两种:base chains和regular chains。base chains为来自网络栈的报文的入口,regular chain用于跳转和组织规则。
对于base chain需要指定 type, hook 和 priority参数:
Type | Families | Hooks | Description |
filter | all | all | Standard chain type to use in doubt. |
nat | ip, ip6, inet | prerouting, input, output, postrouting | 该类型的Chains可以根据conntrack表项执行NAT地址转换。一条连接中,只有第一个报文会经过该chain,其包含的rules定义了创建conntrack表项的详细信息。 |
route | ip, ip6 | output | 如果报文已遍历了该类型的chain并将被接受,则如果IP首部的相关字段发送了更改,则将执行新的路由查找。这可以在Nftables中实现策略路由选择器。 |
priority
priority参数接收一个有符号整数或一个标准优先级名称,其指定了具有相同hook下的chain的执行顺序。数值越小优先级越高。
对于nat
类型的chain,其优先级的下限为-200。
标准优先级名称、family和hook兼容性矩阵图:
Name | Value | Families | Hooks |
raw | -300 | ip, ip6, inet | all |
mangle | -150 | ip, ip6, inet | all |
dstnat | -100 | ip, ip6, inet | prerouting |
filter | 0 | ip, ip6, inet, arp, netdev | all |
security | 50 | ip, ip6, inet | all |
srcnat | 100 | ip, ip6, inet | postrouting |
可以使用基本的算术表达式(加法或减法)结合标准优先级名称来实现相对优先级,如 mangle - 5 表示-155。在使用
list
等命令显示时也使用这种表达方式。
bridge family的标准优先级名称和hook兼容性:
Name | Value | Hooks |
dstnat | -300 | prerouting |
filter | -200 | all |
out | 100 | output |
srcnat | 300 | postrouting |
policy
定义接收还是拒绝匹配到chain规则的报文,可选值为accept
(默认值)或drop
Example
$ nft add chain inet mytable myin { type filter hook input priority 1; policy accept;}
下面展示了优先级和策略的用法:
table inet filter { # This chain is evaluated first due to priority chain services { type filter hook input priority 0; policy accept; # If matched, this rule will prevent any further evaluation tcp dport http drop # If matched, and despite the accept verdict, the packet proceeds to enter the chain below tcp dport ssh accept # Likewise for any packets that get this far and hit the default policy } # This chain is evaluated last due to priority chain input { type filter hook input priority 1; policy drop; # All ingress packets end up being dropped here! } }
RULES
- {add | insert} rule [family] table chain [handle handle | index index] statement ... [comment comment]
- replace rule [family] table chain handle handle statement ... [comment comment]
- {delete | reset} rule [family] table chain handle handle
- destroy rule [family] table chain handle handle
- reset rules [family] [table [chain]]
规则中,如果没有指定family,则默认使用ip
。规则包含两部分:expression和statement
add
和inset
都用于添加规则,前者用于将规则追加到给定的chain中(或给定的位置handle
之后),后者用于将规则插入到给定的chain的开头(给给定的位置handle
的前面)
Example
$ nft add rule ip filter output ip daddr 192.168.0.0/24 accept
# nft -a list ruleset table inet filter { chain input { type filter hook input priority filter; policy accept; ct state established,related accept # handle 4 ip saddr 10.1.1.1 tcp dport ssh accept # handle 5 ... # delete the rule with handle 5 nft delete rule inet filter input handle 5
SETS
- add set [family] table set { type type | typeof expression ; [flags flags ;] [timeout timeout ;] [gc-interval gc-interval ;] [elements = { element[, ...] } ;] [size size ;] [comment comment ;] [policy 'policy ;] [auto-merge ;] }
- {delete | destroy | list | flush} set [family] table set
- list sets [family]
- delete set [family] table handle handle
- {add | delete | destroy } element [family] table set { element[, ...] }
Keyword | Description | Type |
type | 定义集合元素的数据类型 | 支持字符串: ipv4_addr , ipv6_addr , ether_addr , inet_proto , inet_service , mark |
typeof | 定义集合元素的数据类型 | 从Expression派生出的数据类型 |
flags | 集合标识 | 字符串: constant, dynamic, interval, timeout. 用于描述集合属性 |
timeout | 元素停留在集合中的时间,如果集合来自报文路径(ruleset),则为强制字段 | 字符串, 十进制加单位表示. 单位为: d, h, m, s |
gc-interval | GC间隔,在指定timeout或flag timeout时生效 | 字符串, 十进制加单位表示. 单位为: d, h, m, s |
elements | 定义集合中的元素 | 集合的数据类型 |
size | 集合的最大元素数目,如果集合来自报文路径(ruleset),则为强制字段 | unsigned integer (64 bit) |
policy | 集合策略 | 字符串: performance [默认], memory |
auto-merge | 自动连接或合并集合元素(仅适用于interval集合) |
nftables提供了两种集合的概念:匿名集合和命名集合。
Example
下面给出了set的基本用法:
$ nft add set ip filter flags_set {type ipv4_addr\; flags constant, interval\;} $ nft add set ip filter daddrs {type ipv4_addr \; flags timeout \; elements={192.168.1.1 timeout 10s, 192.168.1.2 timeout 30s} \;}
匿名集合
匿名集合使用大括号表示,内部使用逗号划分元素,一旦规则被移除,该集合也会被移除,且集合中的内容不可变(除非删除再添加)。匿名集合不需要定义元素类型。下面几个使用匿名集合的例子。
$ nft add rule filter input ip saddr { 10.0.0.0/8, 192.168.0.0/16 } tcp dport { 22, 443 } accept $ nft add rule ip6 filter input tcp dport {telnet, http, https} accept $ nft add rule ip6 filter input icmpv6 type { nd-neighbor-solicit, echo-request, nd-router-advert, nd-neighbor-advert } accept
命名集合
命名集合可以在任意时间添加或移除元素,通过在名称前面使用@前缀来引用命名集合。命名集合需要定义元素的类型,然后再给出元素的值。
$ nft add set ip filter blackhole { type ipv4_addr\; comment \"drop all packets from these hosts\" \; } $ nft add element ip filter blackhole { 192.168.3.4 } #添加元素 $ nft add element ip filter blackhole { 192.168.1.4, 192.168.1.5 } #添加元素
$ nft add rule ip filter input ip saddr @blackhole drop #引用命名集合 $ nft add rule ip filter output ip daddr != @blackhole accept
MAPS
- add map [family] table map { type type | typeof expression [flags flags ;] [elements = { element[, ...] } ;] [size size ;] [comment comment ;] [policy 'policy ;] }
- {delete | destroy | list | flush} map [family] table map
- list maps [family]
Keyword | Description | Type |
type | 定义Map元素的数据类型 | 字符串: ipv4_addr, ipv6_addr, ether_addr, inet_proto, inet_service, mark, counter, quota. 不能使用Counter 和 quota 作为keys |
typeof | 定义Map元素的数据类型 | 从表示式派生出的数据类型 |
flags | map 标识 | string, same as set flags |
elements | 定义map中的元素 | map数据类型 |
size | Map的最大元素数目 | unsigned integer (64 bit) |
policy | map 策略 | 字符串: performance [默认], memory |
匿名map
如果端口是80,则DNAT到192.168.1.100,如果端口是8888,则DNAT到192.168.1.101
$ nft add rule ip nat prerouting dnat to tcp dport map { 80 : 192.168.1.100, 8888 : 192.168.1.101 }
命名map
$ nft add map nat porttoip { type inet_service: ipv4_addr\; } $ nft add element nat porttoip { 80 : 192.168.1.100, 8888 : 192.168.1.101 }
TCP/80且源地址为192.168.1.100的出站报文会被SNAT为TCP/8888,地址为192.168.1.101
$ nft add rule ip nat postrouting snat to tcp dport map @porttoip
ELEMENTS
- {add | create | delete | destroy | get } element [family] table set { ELEMENT[, ...] }
ELEMENT := key_expression OPTIONS [: value_expression]
OPTIONS := [timeout TIMESPEC] [expires TIMESPEC] [comment string]
TIMESPEC := [numd][numh][numm][num[s]]
Option | Description |
timeout | sets/maps flag 中的timeout 值 |
expires | timeout开始的倒计时计数器,当数值为0时,会删除掉element |
comment | 单个元素的注释字段 |
element命令用于修改命名集合和命名maps的内容。key_expression用于集合类型的值,value_expression用于map类型,表示map的数据部分。
Example
$ nft add table inet myfilter $ nft add set inet myfilter myset {type ipv4_addr\; flags timeout\; } $ nft add element inet myfilter myset {10.0.0.1 timeout 10s }
FLOWTABLES
Flowtables用于加速报文的转发,它可以绕过传统的转发路径。
userspace process ^ | | | _____|____ ____\/___ / \ / \ | input | | output | \__________/ \_________/ ^ | | | _________ __________ --------- _____\/_____ / \ / \ |Routing | / \ --> ingress ---> prerouting ---> |decision| | postrouting|--> neigh_xmit \_________/ \__________/ ---------- \____________/ ^ | ^ | ^ | flowtable | ____\/___ | | | | / \ | | __\/___ | | forward |------------ | |-----| | \_________/ | |-----| | 'flow offload' rule | |-----| | adds entry to | |_____| | flowtable | | | | / \ | | /hit\_no_| | \ ? / | \ / | |__yes_________________fastpath bypass ____________________________| Fig.1 Netfilter hooks and flowtable interactions
STATEFUL OBJECTS
可以使用"type name"来引用Statefulset object,如"counter name"、"quota name"、"limit name"
COUNTER
- add counter [family] table counter_name [{ [ packets packets bytes bytes ; ] [ comment comment ; }]
- delete counter [family] table counter_name
- list counters
计数器,统计报文总数和自上次重置以来所接收到的总字节数。需要明确指定统计的每个规则的计数器。
匿名counter
匿名counter只属于其所在的rule。下面匿名counter用于统计所有路由到本地的TCP流量:
table ip counter_demo { chain IN { type filter hook input priority filter; policy drop; ip protocol tcp counter } }
命名counter
$ nft add counter filter http $ nft add rule filter input tcp dport 80 counter name \"http\"
在map中使用counter:
$ nft add counter filter http $ nft add counter filter https $ nft add rule filter input counter name tcp dport map { 80 : \"http\", 443 : \"https\" }
重置counter:
$ nft reset counter filter http
Quotas
- add quota [family] table name { [over|until] bytes BYTE_UNIT [ used bytes BYTE_UNIT ] ; [ comment comment ; ] }
- BYTE_UNIT := bytes | kbytes | mbytes
- delete quota [family] table name
- list quotas
Keyword | Description | Type |
quota | 配额限制,作为配额名称 | 两个参数:unsigned integer (64 bit) 和 字符串: bytes, kbytes, mbytes。 在这些参数前面加上"over" 和 "until" |
used | 一开始使用的配额值 | 两个参数:unsigned integer (64 bit) 和 字符串: bytes, kbytes, mbytes |
comment | 每个quota的注释 | 字符串 |
配额,用于定义:
- 字节数上限
- 设置初始的字节数(默认0字节)
- 计算从初始字节数开始接收到的总字节数,直到字节数达到或超过上限。
匿名Quota
下面配置要求到端口udp/5060的数据量不超过100 mbytes。
table inet anon_quota_demo { chain IN { type filter hook input priority filter; policy drop; udp dport 5060 quota until 100 mbytes accept } }
命名Quota
下面定义了一对quota,并在chain IN中引用了这两个quotas。
- 到端口udp/5060的数据不超过100 mbytes,并丢弃其余到该端口的报文;
- 到端口tcp/80的数据不超过500 mbytes,,并丢弃其余到该端口的报文;
- tcp/443 (https)的报文不设限;
- 丢弃其余报文(policy drop)
table inet quota_demo { quota q_until_sip { until 100 mbytes used 0 bytes } quota q_over_http { over 500 mbytes ; comment "cap http (but not https)" ; } chain IN { type filter hook input priority filter; policy drop; udp dport 5060 quota name "q_until_sip" accept tcp dport 80 quota name "q_over_http" drop tcp dport { 80, 443 } accept } }
下面是在map中引用quota的例子
$ nft add quota filter user123 { over 20 mbytes } $ nft add quota filter user124 { over 20 mbytes } $ nft add rule filter input quota name ip saddr map { 192.168.10.123 : \"user123\", 192.168.10.124 : \"user124\" }
Limits
使用token bucket进行限流。
匿名limits
下面限制每秒最多10个ICMP echo-request报文
$ nft add rule filter input icmp type echo-request limit rate 10/second accept
命名limits
下面定义了两个limits,分别为:
- 接收所有类型的icmp报文,限制最大速率为400 packets / minute.
- 接收到端口 tcp/25 (smtp)的流量,最大速率为 1024 bytes / second,可接收的最大突发流量为512 bytes
- 丢弃其他流量(policy drop)
table inet limit_demo { limit lim_400ppm { rate 400/minute ; comment "use to limit incoming icmp" ; } limit lim_1kbps { rate over 1024 bytes/second burst 512 bytes ; comment "use to limit incoming smtp" ; } chain IN { type filter hook input priority filter; policy drop; meta l4proto icmp limit name "lim_400ppm" accept tcp dport 25 limit name "lim_1kbps" accept } }
Others
- Conntrack helpers (ct helper, Layer 7 ALG)
- Conntrack timeout policies (ct timeout)
- Conntrack expectations (ct expectation)
数据类型
EXPRESSIONS
PRIMARY EXPRESSIONS
expression用于匹配报文。
primary expression是最低阶的表达式,表示来自报文的payload、元数据或有状态模块的常量或单个数据。
下面仅列出了部分expressions。
META EXPRESSIONS
- meta {length | nfproto | l4proto | protocol | priority}
- [meta] {mark | iif | iifname | iiftype | oif | oifname | oiftype | skuid | skgid | nftrace | rtclassid | ibrname | obrname | pkttype | cpu | iifgroup | oifgroup | cgroup | random | ipsec | iifkind | oifkind | time | hour | day }
从上面看出,元数据有很多种,每种元数据都对应特定的类型。这里根据元数据的类型进行了划分。
meta expression指与报文有关的元数据。
meta expressions有两种:unqualified 和 qualified meta expressions,区别就是有没有meta
关键字:
# qualified meta expression filter output meta oif eth0 filter forward meta iifkind { "tun", "veth" } # unqualified meta expression filter output oif eth0
ROUTING EXPRESSIONS
- rt [ip | ip6] {classid | nexthop | mtu | ipsec}
Keyword | Description | Type |
classid | Routing realm | realm |
nexthop | Routing nexthop | ipv4_addr/ipv6_addr |
mtu | TCP maximum segment size of route | integer (16 bit) |
ipsec | route via ipsec tunnel or transport | boolean |
routeing expression指与报文有关的路由数据:
# IP family independent rt expression filter output rt classid 10 # IP family dependent rt expressions ip filter output rt nexthop 192.168.0.1 ip6 filter output rt nexthop fd00::1 inet filter output rt ip nexthop 192.168.0.1 inet filter output rt ip6 nexthop fd00::1 # outgoing packet will be encapsulated/encrypted by ipsec filter output rt ipsec exists
NUMGEN EXPRESSION
- numgen {inc | random} mod NUM [ offset NUM ]
生成一个数字,inc 或 random 用于控制其操作模式,inc模式下,最后的返回值会简单递增,而random模式下,会返回一个随机数。mod用于执行取模操作,可选的offset可以让返回值加上一个固定的偏移量。
numgen通常用于负载均衡:
# round-robin between 192.168.10.100 and 192.168.20.200: add rule nat prerouting dnat to numgen inc mod 2 map \ { 0 : 192.168.10.100, 1 : 192.168.20.200 } # probability-based with odd bias using intervals: add rule nat prerouting dnat to numgen random mod 10 map \ { 0-2 : 192.168.10.100, 3-9 : 192.168.20.200 }
HASH EXPRESSIONS
- jhash {ip saddr | ip6 daddr | tcp dport | udp sport | ether saddr} [. ...] mod NUM [ seed NUM ] [ offset NUM ]
- symhash mod NUM [ offset NUM ]
使用哈希函数来生成一个数字,可用的哈希函数为jhash(Jenkins hash),symhash(Symmetric Hash)。jhash需要一个expression来确定哈希参数(报文首部),可以串联多个参数。mod用于执行取模操作,可选的seed执行了哈希函数的种子,可选的offset可以让返回值加上一个固定的偏移量。
jhash和symhash的通常用于负载均衡:
# load balance based on source ip between 2 ip addresses: add rule nat prerouting dnat to jhash ip saddr mod 2 map \ { 0 : 192.168.10.100, 1 : 192.168.20.200 } # symmetric load balancing between 2 ip addresses: add rule nat prerouting dnat to symhash mod 2 map \ { 0 : 192.168.10.100, 1 : 192.168.20.200 }
PAYLOAD EXPRESSIONS
指来自报文payload的数据。下面给出常用报文的expressions。
ETHERNET HEADER EXPRESSION
- ether {daddr | saddr | type}
Keyword | Description | Type |
daddr | Destination MAC address | ether_addr |
saddr | Source MAC address | ether_addr |
type | EtherType | ether_type |
VLAN HEADER EXPRESSION
除非接口配置了reorder_hdr off ,否则无法在 ip, ip6 和 inet family上生效。
- vlan {id | dei | pcp | type}
Keyword | Description | Type |
id | VLAN ID (VID) | integer (12 bit) |
dei | Drop Eligible Indicator | integer (1 bit) |
pcp | Priority code point | integer (3 bit) |
type | EtherType | ether_type |
IPV4 HEADER EXPRESSION
- ip {version | hdrlength | dscp | ecn | length | id | frag-off | ttl | protocol | checksum | saddr | daddr }
Keyword | Description | Type |
version | IP header version (4) | integer (4 bit) |
hdrlength | IP header length including options | integer (4 bit) FIXME scaling |
dscp | Differentiated Services Code Point | dscp |
ecn | Explicit Congestion Notification | ecn |
length | Total packet length | integer (16 bit) |
id | IP ID | integer (16 bit) |
frag-off | Fragment offset | integer (16 bit) |
ttl | Time to live | integer (8 bit) |
protocol | Upper layer protocol | inet_proto |
checksum | IP header checksum | integer (16 bit) |
saddr | Source address | ipv4_addr |
daddr | Destination address | ipv4_addr |
ICMP HEADER EXPRESSION
- icmp {type | code | checksum | id | sequence | gateway | mtu}
当在inet、bridge或netdev family中使用时,将导致对IPv4的隐式依赖。
Keyword | Description | Type |
type | ICMP type field | icmp_type |
code | ICMP code field | integer (8 bit) |
checksum | ICMP checksum field | integer (16 bit) |
id | ID of echo request/response | integer (16 bit) |
sequence | sequence number of echo request/response | integer (16 bit) |
gateway | gateway of redirects | integer (32 bit) |
mtu | MTU of path MTU discovery | integer (16 bit) |
TCP HEADER EXPRESSION
- tcp {sport | dport | sequence | ackseq | doff | reserved | flags | window | checksum | urgptr}
Keyword | Description | Type |
sport | Source port | inet_service |
dport | Destination port | inet_service |
sequence | Sequence number | integer (32 bit) |
ackseq | Acknowledgement number | integer (32 bit) |
doff | Data offset | integer (4 bit) FIXME scaling |
reserved | Reserved area | integer (4 bit) |
flags | TCP flags | tcp_flag |
window | Window | integer (16 bit) |
checksum | Checksum | integer (16 bit) |
urgptr | Urgent pointer | integer (16 bit) |
UDP HEADER EXPRESSION
- udp {sport | dport | length | checksum}
Keyword | Description | Type |
sport | Source port | inet_service |
dport | Destination port | inet_service |
length | Total packet length | integer (16 bit) |
checksum | Checksum | integer (16 bit) |
EXTENSION HEADER EXPRESSIONS
指与协议首部有关的扩展字段,如IPv6扩展首部,TCP选项和IPv4选项。
CONNTRACK EXPRESSIONS
- ct {state | direction | status | mark | expiration | helper | label | count | id}
- ct [original | reply] {l3proto | protocol | bytes | packets | avgpkt | zone}
- ct {original | reply} {proto-src | proto-dst}
- ct {original | reply} {ip | ip6} {saddr | daddr}
用于匹配报文的连接元数据
STATEMENTS
Statements表示需要执行的动作,可以是变更控制流(return、jump到另一个chain、accept或drop报文)或执行动作,如logging、拒绝报文等等。
下面给出了部分主要的statements。
VERDICT STATEMENT
- {accept | drop | queue | continue | return}
- {jump | goto} chain
verdict statement 用于修改ruleset中的控制流,并确定报文策略。
accept 和drop都属于 绝对verdicts。
accept | 结束ruleset评估并接收报文。 |
drop | 结束ruleset评估,并丢弃报文。drop是立即执行的,不再评估后续的chains或hooks,因此无法在该后续的chain中accept报文 |
queue | 结束ruleset评估,并将报文传给用户空间队列中。用户空间必须提供一个drop或accept的verdict statement 。在accept的情况下,会在下一个base chain hook中继续处理报文,而不是跟在queue verdict中的规则中处理。 |
continue | 使用下一个规则继续评估ruleset,如果一条规则没有指定verdict statement ,则默认使用该statement。 |
return | 从当前chain中返回,并继续使用下一个chain中的规则进行评估,如果执行该statement的是一个base chain, 则它相当于一个base chain policy. |
jump chain | 继续在 chain的第一条规则中进行评估,ruleset的当前位置会被push到一个调用栈中,并在新的chain评估结束或执行了return 之后返回到之前的位置。如果chain中的某个规则执行了绝对VERDICT,则将立即终止对ruleset的评估,并采取响应的操作。 |
goto chain | 类似 jump,但不会将当前位置push到调用栈,意味着在新chain评估结束之后会继续执行下一个chain,而不会返回之前的位置。 |
Example
# process packets from eth0 and the internal network in from_lan # chain, drop all packets from eth0 with different source addresses. filter input iif eth0 ip saddr 192.168.0.0/24 jump from_lan filter input iif eth0 drop
REJECT STATEMENT
- reject [ with REJECT_WITH ]
REJECT_WITH := icmp icmp_code |
icmpv6 icmpv6_code |
icmpx icmpx_code |
tcp reset
拒绝匹配到的报文,并返回错误。默认的错误为port-unreachable。只能用于 input, forward 或 output hooks
LOG STATEMENT
- log [prefix quoted_string] [level syslog-level] [flags log-flags]
- log group nflog_group [prefix quoted_string] [queue-threshold value] [snaplen size]
- log level audit
用于将匹配到的报文记录到内核日志中,可以使用dmesg(1) 读取日志或从 syslog中读取日志。当然也可以通过监听nflog_group将日志转发到用户空间。
COUNTER STATEMENT
设置匹配报文的字节数。
- counter packets number bytes number
- counter { packets number | bytes number }
MAP STATEMENT
- expression map { MAP_ELEMENTS }
MAP_ELEMENTS := MAP_ELEMENT [, MAP_ELEMENTS]
MAP_ELEMENT := key : value
map statement用于根据特定的key查找数据,key通常是一个expression的返回值。
# select DNAT target based on TCP dport,key为dport # connections to port 80 are redirected to 192.168.1.100, # connections to port 8888 are redirected to 192.168.1.101 nft add rule ip nat prerouting dnat tcp dport map { 80 : 192.168.1.100, 8888 : 192.168.1.101 } # source address based SNAT,key为saddr # packets from net 192.168.1.0/24 will appear as originating from 10.0.0.1, # packets from net 192.168.2.0/24 will appear as originating from 10.0.0.2 nft add rule ip nat postrouting snat to ip saddr map { 192.168.1.0/24 : 10.0.0.1, 192.168.2.0/24 : 10.0.0.2 }
VMAP STATEMENT
- expression vmap { VMAP_ELEMENTS }
VMAP_ELEMENTS := VMAP_ELEMENT [, VMAP_ELEMENTS]
VMAP_ELEMENT := key : verdict
vmap statement和map statement类似,但包含一个verdicts字段
# jump to different chains depending on layer 4 protocol type: $ nft add rule ip filter input ip protocol vmap { tcp : jump tcp-chain, udp : jump udp-chain , icmp : jump icmp-chain }
NAT STATEMENTS
- snat [[ip | ip6] [ prefix ] to] ADDR_SPEC [:PORT_SPEC] [FLAGS]
- dnat [[ip | ip6] [ prefix ] to] ADDR_SPEC [:PORT_SPEC] [FLAGS]
- masquerade [to :PORT_SPEC] [FLAGS]
- redirect [to :PORT_SPEC] [FLAGS]
ADDR_SPEC := address | address - address
PORT_SPEC := port | port - port
FLAGS := FLAG [, FLAGS]
FLAG := persistent | random | fully-random
nat statement仅适用于nat 类型的chain。
snat 和 masquerade statements都会修改报文的源地址。snat只能用在postrouting和input chain,masquerade只能用在postrouting chain。dnat 和 redirect statements只能用在prerouting和output chain,用于修改报文的目的地址。
masquerade statement是一种特殊的snat,用于将报文的源地址转换为出接口的IP地址。特别适用于使用动态IP地址的网关。
redirect statement是一种特殊的dnat,将目的地址转换为本地主机地址。如果只想更改不同接口上传入流量的目的端口,就可以使用该statement。
在内核4.18之前,nat statement要求同时有prerouting和postrouting base chain,否则netfilter将看不到返回路径上的数据包,因此不会反向转换地址。
Expression | Description | Type |
address | 指定要修改的报文的源/目的地址,可以指定一个mapping,包含由任意expression key与地址值组合的列表。 | ipv4_addr, ipv6_addr, 如 abcd::1234, 或使用mapping,如 meta mark map |
port | 指定要修改的报文的源/目的端口 | 端口号 (16 bit) |
基本用法
# create a suitable table/chain setup for all further examples add table nat add chain nat prerouting { type nat hook prerouting priority dstnat; } add chain nat postrouting { type nat hook postrouting priority srcnat; } # translate source addresses of all packets leaving via eth0 to address 1.2.3.4 add rule nat postrouting oif eth0 snat to 1.2.3.4 # redirect all traffic entering via eth0 to destination address 192.168.1.120 add rule nat prerouting iif eth0 dnat to 192.168.1.120 # translate source addresses of all packets leaving via eth0 to whatever # locally generated packets would use as source to reach the same destination add rule nat postrouting oif eth0 masquerade # redirect incoming TCP traffic for port 22 to port 2222 add rule nat prerouting tcp dport 22 redirect to :2222 # inet family: # handle ip dnat: add rule inet nat prerouting dnat ip to 10.0.2.99 # handle ip6 dnat: add rule inet nat prerouting dnat ip6 to fe80::dead # this masquerades both ipv4 and ipv6: add rule inet nat postrouting meta oif ppp0 masquerade
高级用法
# map prefixes in one network to that of another, e.g. 10.141.11.4 is mangled to 192.168.2.4, # 10.141.11.5 is mangled to 192.168.2.5 and so on. add rule nat postrouting snat ip prefix to ip saddr map { 10.141.11.0/24 : 192.168.2.0/24 } # map a source address, source port combination to a pool of destination addresses and ports: add rule nat postrouting dnat to ip saddr . tcp dport map { 192.168.1.2 . 80 : 10.141.10.2-10.141.10.5 . 8888-8999 } # The above example generates the following NAT expression: # # [ nat dnat ip addr_min reg 1 addr_max reg 10 proto_min reg 9 proto_max reg 11 ] # # which expects to obtain the following tuple: # IP address (min), source port (min), IP address (max), source port (max) # to be obtained from the map. The given addresses and ports are inclusive. # This also works with named maps and in combination with both concatenations and ranges: table ip nat { map ipportmap { typeof ip saddr : interval ip daddr . tcp dport flags interval elements = { 192.168.1.2 : 10.141.10.1-10.141.10.3 . 8888-8999, 192.168.2.0/24 : 10.141.11.5-10.141.11.20 . 8888-8999 } } chain prerouting { type nat hook prerouting priority dstnat; policy accept; ip protocol tcp dnat ip to ip saddr map @ipportmap } } # @ipportmap maps network prefixes to a range of hosts and ports. # The new destination is taken from the range provided by the map element. # Same for the destination port. # Note the use of the "interval" keyword in the typeof description. # This is required so nftables knows that it has to ask for twice the # amount of storage for each key-value pair in the map. # ": ipv4_addr . inet_service" would allow associating one address and one port # with each key. But for this case, for each key, two addresses and two ports # (The minimum and maximum values for both) have to be stored.
MONITOR
- monitor [new | destroy] MONITOR_OBJECT
monitor trace
MONITOR_OBJECT := tables | chains | sets | rules | elements | ruleset
Monitor 命令可以监听nf_tables子系统产生的Netlink事件。这些事件要么与对象的创建和删除有关,要么与启用meta nftrace的报文有关。这些事件将以JSON或原生的nft格式打印到stdout。
可以使用MONITOR_OBJECT过滤与具体对象有关的事件。
使用new或destroy来过滤与特定动作有关的事件。
第二种形式的调用没有选项,只打印启用nftrace的报文生成的事件。
启用nftrace
为了启用nftrace,只需要在规则之后跟上下面statement即可:
meta nftrace set 1
当然也可以为特定的报文启用nftrace,下面为tcp报文启用nftrace:
ip protocol tcp meta nftrace set 1
使用单独chain来启用nftrace
推荐使用单独的chain来启用nftrace,下面用于跟踪prerouting chain,如果有已经有一个prerouting chain,则需要确保trace_chain的优先级更高。
$ nft add chain filter trace_chain { type filter hook prerouting priority -301\; } $ nft add rule filter trace_chain meta nftrace set 1
在调试结束之后,删除创建的trace_chain
即可:
% nft delete chain filter trace_chain
monitor trace事件
在启用nftrace之后,就可以监控产生的跟踪事件:
#Listen to all events, report in native nft format. % nft monitor #Listen to deleted rules, report in JSON format. % nft -j monitor destroy rules #Listen to both new and destroyed chains, in native nft format. % nft monitor chains #Listen to ruleset events such as table, chain, rule, set, counters and quotas, in native nft format. % nft monitor ruleset #Trace incoming packets from host 10.0.0.1. % nft add rule filter input ip saddr 10.0.0.1 meta nftrace set 1 % nft monitor trace
kubernetes中启用nftables
kubernetes v1.29中需要在kube-proxy中启用NFTablesProxyMode
feature gate并指定mode为nftables
时才能使用nftables
。下面是一个使用kubeadm启动单节点集群的例子
apiVersion: kubeadm.k8s.io/v1beta3 bootstrapTokens: - groups: - system:bootstrappers:kubeadm:default-node-token token: abcdef.0123456789abcdef ttl: 24h0m0s usages: - signing - authentication kind: InitConfiguration nodeRegistration: criSocket: unix:///var/run/containerd/containerd.sock imagePullPolicy: IfNotPresent name: node taints: null --- apiServer: timeoutForControlPlane: 4m0s apiVersion: kubeadm.k8s.io/v1beta3 certificatesDir: /etc/kubernetes/pki clusterName: kubernetes controllerManager: {} etcd: local: dataDir: /var/lib/etcd imageRepository: registry.k8s.io kind: ClusterConfiguration kubernetesVersion: 1.29.0 networking: dnsDomain: cluster.local serviceSubnet: 10.96.0.0/12 scheduler: {} --- apiVersion: kubeproxy.config.k8s.io/v1alpha1 kind: KubeProxyConfiguration featureGates: NFTablesProxyMode: true mode: nftables