标签
PostgreSQL , Oracle , checksum , IO错误
背景
开启数据库block checksum可以发现磁盘、存储、IO系统的问题引入一些的物理错误。
PostgreSQL checksum
在PostgreSQL中,默认强制对XLOG(WAL)开启了checksum,因此可以保证从redo buffer写入redo file里面的数据是一致的,读取的时候也会根据每个REDO PAGE的checksum检查REDO PAGE的内容是否正确。(换言之没有参数来关闭WAL的checksum)
而对于DATAFILE,是否开启CHECKSUM则取决于initdb时配置的checksum参数。
initdb
-k, --data-checksums use data page checksums
通过pg_controldata检查当前数据库集群是否开启了checksum。
pg_controldata
Data page checksum version: 0
PostgreSQL 11 允许用户动态的修改checksum的开关,而不是初始化实例时固定:
Oracle checksum
Checksums allow Oracle to detect corruption caused by underlying disks, storage systems, or I/O systems.
If this parameter is set to OFF, DBWn calculates checksums only for the SYSTEM tablespace, but not for user tablespaces. In addition, no log checksum is performed when this parameter is set to OFF.
https://docs.oracle.com/cd/B28359_01/server.111/b28320/initparams046.htm#REFRN10030
DB_BLOCK_CHECKSUM determines whether DBWn and the direct loader will calculate a checksum (a number calculated from all the bytes stored in the block) and store it in the cache header of every data block when writing it to disk. Checksums are verified when a block is read - only if this parameter is TYPICAL or FULL and the last write of the block stored a checksum. In FULL mode, Oracle also verifies the checksum before a change application from update/delete statements and recomputes it after the change is applied. In addition, Oracle gives every log block a checksum before writing it to the current log.
Starting with Oracle Database 11g, most of the log block checksum is done by the generating foreground processes, while the LGWR performs the rest of the work, for better CPU and cache efficiency. Prior to Oracle Database 11g, the LGWR solely performed the log block checksum. When this parameter is set to FULL, the LGWR verifies the checksum of each log block generated by the foreground processes before writing it to disk.
If this parameter is set to OFF, DBWn calculates checksums only for the SYSTEM tablespace, but not for user tablespaces. In addition, no log checksum is performed when this parameter is set to OFF.
Checksums allow Oracle to detect corruption caused by underlying disks, storage systems, or I/O systems. If set to FULL, DB_BLOCK_CHECKSUM also catches in-memory corruptions and stops them from making it to the disk. Turning on this feature in TYPICAL mode causes only an additional 1% to 2% overhead. In the FULL mode it causes 4% to 5% overhead. Oracle recommends that you set DB_BLOCK_CHECKSUM to TYPICAL.
For backward compatibility the use of TRUE (implying TYPICAL) and FALSE (implying OFF) values is preserved.
PostgreSQL checksum 行为
1、开启checksum后,PostgreSQL 从shared buffer把数据write出去,需要计算checksum。
2、开启checksum后,从shared buffer外面(disk, os page cache)读取BLOCK到shared buffer里面,需要计算block的checksum,对比存储在page head里头的checksum是否一致。
3、已经在shared buffer里面的block,变更、读取时并不需要计算checksum。
checksum计算逻辑
1、pg_checksum_page
src/include/storage/checksum_impl.h
/*
* Compute the checksum for a Postgres page. The page must be aligned on a
* 4-byte boundary.
*
* The checksum includes the block number (to detect the case where a page is
* somehow moved to a different location), the page header (excluding the
* checksum itself), and the page data.
*/
uint16
pg_checksum_page(char *page, BlockNumber blkno)
{
PageHeader phdr = (PageHeader) page;
uint16 save_checksum;
uint32 checksum;
/* We only calculate the checksum for properly-initialized pages */
Assert(!PageIsNew(page));
/*
* Save pd_checksum and temporarily set it to zero, so that the checksum
* calculation isn't affected by the old checksum stored on the page.
* Restore it after, because actually updating the checksum is NOT part of
* the API of this function.
*/
save_checksum = phdr->pd_checksum;
phdr->pd_checksum = 0;
checksum = pg_checksum_block(page, BLCKSZ);
phdr->pd_checksum = save_checksum;
/* Mix in the block number to detect transposed pages */
checksum ^= blkno;
/*
* Reduce to a uint16 (to fit in the pd_checksum field) with an offset of
* one. That avoids checksums of zero, which seems like a good idea.
*/
return (checksum % 65535) + 1;
}
2、异步写(bg writer, backend process evict dirty page)时,计算checksum
src/backend/storage/buffer/bufmgr.c
/*
* FlushBuffer
* Physically write out a shared buffer.
*
* NOTE: this actually just passes the buffer contents to the kernel; the
* real write to disk won't happen until the kernel feels like it. This
* is okay from our point of view since we can redo the changes from WAL.
* However, we will need to force the changes to disk via fsync before
* we can checkpoint WAL.
*
* The caller must hold a pin on the buffer and have share-locked the
* buffer contents. (Note: a share-lock does not prevent updates of
* hint bits in the buffer, so the page could change while the write
* is in progress, but we assume that that will not invalidate the data
* written.)
*
* If the caller has an smgr reference for the buffer's relation, pass it
* as the second parameter. If not, pass NULL.
*/
static void
FlushBuffer(BufferDesc *buf, SMgrRelation reln)
/*
* Update page checksum if desired. Since we have only shared lock on the
* buffer, other processes might be updating hint bits in it, so we must
* copy the page to private storage if we do checksumming.
*/
bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
3、将数据从shared buffer外围读入shared buffer时,校验checksum
src/backend/storage/page/bufpage.c
/*
* PageIsVerified
* Check that the page header and checksum (if any) appear valid.
*
* This is called when a page has just been read in from disk. The idea is
* to cheaply detect trashed pages before we go nuts following bogus item
* pointers, testing invalid transaction identifiers, etc.
*
* It turns out to be necessary to allow zeroed pages here too. Even though
* this routine is *not* called when deliberately adding a page to a relation,
* there are scenarios in which a zeroed page might be found in a table.
* (Example: a backend extends a relation, then crashes before it can write
* any WAL entry about the new page. The kernel will already have the
* zeroed page in the file, and it will stay that way after restart.) So we
* allow zeroed pages here, and are careful that the page access macros
* treat such a page as empty and without free space. Eventually, VACUUM
* will clean up such a page and make it usable.
*/
bool
PageIsVerified(Page page, BlockNumber blkno)
{
PageHeader p = (PageHeader) page;
size_t *pagebytes;
int i;
bool checksum_failure = false;
bool header_sane = false;
bool all_zeroes = false;
uint16 checksum = 0;
/*
* Don't verify page data unless the page passes basic non-zero test
*/
if (!PageIsNew(page))
{
if (DataChecksumsEnabled())
{
checksum = pg_checksum_page((char *) page, blkno);
if (checksum != p->pd_checksum)
checksum_failure = true;
}
...................
/*
* Set checksum for a page in shared buffers.
*
* If checksums are disabled, or if the page is not initialized, just return
* the input. Otherwise, we must make a copy of the page before calculating
* the checksum, to prevent concurrent modifications (e.g. setting hint bits)
* from making the final checksum invalid. It doesn't matter if we include or
* exclude hints during the copy, as long as we write a valid page and
* associated checksum.
*
* Returns a pointer to the block-sized data that needs to be written. Uses
* statically-allocated memory, so the caller must immediately write the
* returned page and not refer to it again.
*/
char *
PageSetChecksumCopy(Page page, BlockNumber blkno)
{
static char *pageCopy = NULL;
/* If we don't need a checksum, just return the passed-in data */
if (PageIsNew(page) || !DataChecksumsEnabled())
return (char *) page;
/*
* We allocate the copy space once and use it over on each subsequent
* call. The point of palloc'ing here, rather than having a static char
* array, is first to ensure adequate alignment for the checksumming code
* and second to avoid wasting space in processes that never call this.
*/
if (pageCopy == NULL)
pageCopy = MemoryContextAlloc(TopMemoryContext, BLCKSZ);
memcpy(pageCopy, (char *) page, BLCKSZ);
((PageHeader) pageCopy)->pd_checksum = pg_checksum_page(pageCopy, blkno);
return pageCopy;
}
/*
* Set checksum for a page in private memory.
*
* This must only be used when we know that no other process can be modifying
* the page buffer.
*/
void
PageSetChecksumInplace(Page page, BlockNumber blkno)
{
/* If we don't need a checksum, just return */
if (PageIsNew(page) || !DataChecksumsEnabled())
return;
((PageHeader) page)->pd_checksum = pg_checksum_page((char *) page, blkno);
}
性能测试
对比开启checksum, 关闭checksum的性能。
从代码我们已经了解什么时候会需要计算checksum,所以设计一个这个的CASE,数据大量写出,使得bgwrite writeout dirty page频繁。
用到如下CASE
《HTAP数据库 PostgreSQL 场景与性能测试之 42 - (OLTP+OLAP) unlogged table 不含索引多表批量写入》
测试结果
checksum=0
transaction type: ./test.sql
scaling factor: 1
query mode: prepared
number of clients: 56
number of threads: 56
duration: 120 s
number of transactions actually processed: 885490
latency average = 7.588 ms
latency stddev = 10.896 ms
tps = 7376.390493 (including connections establishing)
tps = 7377.158206 (excluding connections establishing)
script statistics:
- statement latencies in milliseconds:
0.002 \set sid random(1,1024)
7.586 select ins_sensor(:sid, 1000);
checksum=1
transaction type: ./test.sql
scaling factor: 1
query mode: prepared
number of clients: 56
number of threads: 56
duration: 120 s
number of transactions actually processed: 867269
latency average = 7.748 ms
latency stddev = 20.287 ms
tps = 7225.742548 (including connections establishing)
tps = 7226.431737 (excluding connections establishing)
script statistics:
- statement latencies in milliseconds:
0.002 \set sid random(1,1024)
7.746 select ins_sensor(:sid, 1000);
开启checksum时,观察perf,可以看到pg_checksum_page占用了少量overhead,体现在性能差异上也非常的小。
0.64% postgres [.] pg_checksum_page
小结
在bgwriter, backend process write dirty page时,需要计算checksum。(耗费CPU)
在从shared buffer外面读入page时,需要校验checksum。(耗费CPU)
当shared buffer较少,同时产生脏页较快,bg writer 或者backend process 刷脏页较频繁时,可能会使得计算checksum引入一定的CPU消耗。实测极端写出的情况下,pg_checksum_page引入了0.64%左右的开销。
因此开启checksum,实际引入的开销并不大。
参考
1、https://www.postgresql.org/docs/11/static/runtime-config-preset.html
data_checksums (boolean)
Reports whether data checksums are enabled for this cluster. See data checksums for more information.
2、https://www.postgresql.org/docs/11/static/runtime-config-developer.html
ignore_checksum_failure (boolean)
Only has effect if data checksums are enabled.
Detection of a checksum failure during a read normally causes PostgreSQL to report an error, aborting the current transaction. Setting ignore_checksum_failure to on causes the system to ignore the failure (but still report a warning), and continue processing. This behavior may cause crashes, propagate or hide corruption, or other serious problems. However, it may allow you to get past the error and retrieve undamaged tuples that might still be present in the table if the block header is still sane. If the header is corrupt an error will be reported even if this option is enabled. The default setting is off, and it can only be changed by a superuser.
3、https://www.postgresql.org/docs/11/static/pageinspect.html#id-1.11.7.31.4
page_header(page bytea) returns record
page_header shows fields that are common to all PostgreSQL heap and index pages.
A page image obtained with get_raw_page should be passed as argument. For example:
test=# SELECT * FROM page_header(get_raw_page('pg_class', 0));
lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid
-----------+----------+--------+-------+-------+---------+----------+---------+-----------
0/24A1B50 | 0 | 1 | 232 | 368 | 8192 | 8192 | 4 | 0
The returned columns correspond to the fields in the PageHeaderData struct. See src/include/storage/bufpage.h for details.
The checksum field is the checksum stored in the page, which might be incorrect if the page is somehow corrupted. If data checksums are not enabled for this instance, then the value stored is meaningless.
page_checksum(page bytea, blkno int4) returns smallint
page_checksum computes the checksum for the page, as if it was located at the given block.
A page image obtained with get_raw_page should be passed as argument. For example:
test=# SELECT page_checksum(get_raw_page('pg_class', 0), 0);
page_checksum
---------------
13443
Note that the checksum depends on the block number, so matching block numbers should be passed (except when doing esoteric debugging).
The checksum computed with this function can be compared with the checksum result field of the function page_header. If data checksums are enabled for this instance, then the two values should be equal.
4、https://www.postgresql.org/docs/11/static/pgverifychecksums.html
pg_verify_checksums
pg_verify_checksums — verify data checksums in an offline PostgreSQL database cluster
Synopsis
pg_verify_checksums [option] [[-D] datadir]
Description
pg_verify_checksums verifies data checksums in a PostgreSQL cluster.
5、https://docs.oracle.com/cd/B28359_01/server.111/b28320/initparams046.htm#REFRN10030
DB_BLOCK_CHECKSUM determines whether DBWn and the direct loader will calculate a checksum (a number calculated from all the bytes stored in the block) and store it in the cache header of every data block when writing it to disk. Checksums are verified when a block is read - only if this parameter is TYPICAL or FULL and the last write of the block stored a checksum. In FULL mode, Oracle also verifies the checksum before a change application from update/delete statements and recomputes it after the change is applied. In addition, Oracle gives every log block a checksum before writing it to the current log.
Starting with Oracle Database 11g, most of the log block checksum is done by the generating foreground processes, while the LGWR performs the rest of the work, for better CPU and cache efficiency. Prior to Oracle Database 11g, the LGWR solely performed the log block checksum. When this parameter is set to FULL, the LGWR verifies the checksum of each log block generated by the foreground processes before writing it to disk.
If this parameter is set to OFF, DBWn calculates checksums only for the SYSTEM tablespace, but not for user tablespaces. In addition, no log checksum is performed when this parameter is set to OFF.
Checksums allow Oracle to detect corruption caused by underlying disks, storage systems, or I/O systems. If set to FULL, DB_BLOCK_CHECKSUM also catches in-memory corruptions and stops them from making it to the disk. Turning on this feature in TYPICAL mode causes only an additional 1% to 2% overhead. In the FULL mode it causes 4% to 5% overhead. Oracle recommends that you set DB_BLOCK_CHECKSUM to TYPICAL.
For backward compatibility the use of TRUE (implying TYPICAL) and FALSE (implying OFF) values is preserved.