How InnoDB performs a checkpoint

简介: 原文链接 InnoDB’s checkpoint algorithm is not well documented. It is too complex to explain in even a long blog post, because to under...


InnoDB’s checkpoint algorithm is not well documented. It is too complex to explain in even a long blog post, because to understand checkpoints, you need to understand a lot of other things that InnoDB does. I hope that explaining how InnoDB does checkpoints in high-level terms, with simplifications, will be helpful. A lot of the simplifications are because I do not want to explain the complexities of how the simple rules can be tweaked for optimization purposes, while not violating the ACID guarantees they enforce.

A bit of background: Gray and Reuter’s classic text on transaction processing introduced two types of checkpoints beginning on page 605. There is a sharp checkpoint, and there is a fuzzy checkpoint.

A sharp checkpoint is accomplished by flushing all modified pages for committed transactions to disk, and writing down the log sequence number (LSN) of the most recent committed transaction. Modified pages for uncommitted transactions should not be flushed – that would violate the rule of write-ahead logging. (This is a deliberate and gross over-simplification; I will not draw attention to further simplifications I make.) Upon recovery, log REDO can start from the LSN at which the checkpoint took place. A sharp checkpoint is called “sharp” because everything that is flushed to disk for the checkpoint is consistent as of a single point in time – the checkpoint LSN.

A fuzzy checkpoint is more complex. It flushes pages as time passes, until it has flushed all pages that a sharp checkpoint would have done. It completes by writing down two LSNs: when the checkpoint started and when it ended. But the pages it flushed might not all be consistent with each other as of a single point in time, which is why it’s called “fuzzy.” A page that got flushed early might have been modified since then, and a page that got flushed late might have a newer LSN than the starting LSN. A fuzzy checkpoint can conceptually be converted into a sharp checkpoint by performing REDO from the starting LSN to the ending LSN. Upon recovery, then, REDO can begin from the LSN at which the checkpoint started.

It is often said that InnoDB does fuzzy checkpointing. The truth is, it does both types. When the database shuts down, it performs a sharp checkpoint. During normal operation, it performs fuzzy checkpoints. And InnoDB’s implementation of fuzzy checkpoints is not exactly the same as that described in Gray & Reuter.

Here is where the weeds get deep: I will try to explain some of the subtleties that let InnoDB provide uniform quality of service by performing checkpoints almost constantly, instead of checkpoints being significant events that occur periodically. It can be said that InnoDB actually never does a checkpoint during normal operation. Instead, the state of the database on disk is a constantly advancing fuzzy checkpoint. The advances are performed by regular flushing of dirty pages as a normal part of the database’s operation. The details are far too many and complex to write here, in part because they have changed significantly as new versions have been released, but I will try to sketch the outline.

InnoDB maintains a large buffer pool in memory with many database pages, and doesn’t write modifications to disk immediately. Instead, it keeps dirty pages in memory, hoping that they will be modified many times before they are written to disk. This is called write combining, and is a performance optimization. InnoDB keeps track of the pages in the buffer pool through several lists: the free list notes which pages are available to be used, the LRU list notes which pages have been used least recently, and the flush list contains all of the dirty pages in LSN order, least-recently-modified first.

These lists are all important, but for the simplified explanation here, I will focus on the flush list. InnoDB has limited space in the buffer pool, and if there aren’t any free spots to store a page that InnoDB needs to read from disk, it must flush and free a dirty page to make room. This is slow, so InnoDB tries to avoid the need for this by flushing dirty pages continually, keeping a reserve of clean pages that can be replaced without having to be flushed. It flushes the oldest-modified pages from the flush list on a regular basis, trying to keep from hitting certain high-water marks. It chooses the pages based on their physical locations on disk and their LSN (which is their modification time).

In addition to avoiding the high-water marks, InnoDB must avoid a very important low-water mark as well. The transaction logs (aka REDO logs, WAL logs) in InnoDB are fixed-size, and are written in a circular fashion. But spaces in the logs cannot be overwritten if they contain records of changes to a dirty page that hasn’t been flushed yet. If that happened and the server crashed, all records of those changes would be lost. Therefore, InnoDB has a limited amount of time to write out a given page’s modifications, because the ongoing transaction logging is hungry for space in the logs. The size of the logs imposes the limit. If the log writing activity wraps around in a circle and bumps into its own tail, it will cause a very bad server stall while InnoDB scrambles to free up some room in the logs. This is why InnoDB generally chooses to flush in oldest-modification order: the oldest-modified pages are the furthest behind in the logs, and will be bumped into first. The oldest unflushed dirty page’s LSN is the low-water mark in the transaction logs, and InnoDB tries to raise that low-water mark to keep as much room available in the transaction logs as it can. Making the logs larger reduces the urgency of freeing up log space and and permits various performance optimizations to do the flushing more efficiently.

And now, with that simplified explanation in place, we can understand how InnoDB actually makes a fuzzy checkpoint. When InnoDB flushes dirty pages to disk, it finds the oldest dirty page’s LSN and treats that as the checkpoint low-water mark. It then writes this to the transaction log header. You can see this in the functions log_checkpoint_margin() and log_checkpoint(). Therefore, every time InnoDB flushes dirty pages from the head of the flush list, it is actually making a checkpoint by advancing the oldest LSN in the system. And that is how continual fuzzy checkpointing is implemented without ever “doing a checkpoint” as a separate event. If there is a server crash, then recovery simply proceeds from the oldest LSN onwards.

When InnoDB shuts down, it does some additional work. First, it stops all updates to data; then it flushes all dirty buffers to disk; then it writes the current LSN to the transaction logs. This is the sharp checkpoint. Additionally, it writes the LSN to the first page of each data file in the database, which serves as a signal that they have been checkpointed up to that LSN. This permits further optimizations during recovery and when opening these data files.

There is a lot more to study if you want to learn how it’s really done in detail; there are many fine points to the process. This is one area where the usually excellent manual is a bit lacking. Some of the best resources are as follows:

Gray and Reuter’s book
Mark Callaghan’s note on fuzzy checkpoints
Peter Zaitsev’s post on why the flushing algorithm in older InnoDB used to cause spikes
Mark Callaghan’s slides from Percona Performance Conference 2009
A related topic that is equally as complex is how InnoDB flushes dirty pages at the right speed to keep up with the database’s workload. Too fast and the server does too much work; too slow and it gets behind and hurries to catch up, causing spikes of furious flushing activity and degraded quality of service. Percona Server has arguably the most advanced and effective algorithms for this, in the XtraDB storage engine (a variant of InnoDB). Percona Server calls it “adaptive checkpointing.” InnoDB followed suit by implementing something similar, but harder to tune correctly. InnoDB calls it “adaptive flushing,” which is a more accurate name. Much (and I do mean much!) has been written about this. I know that Vadim has done hundreds of benchmarks to analyze how flushing and checkpointing works, some of them many hours long to study long-term performance characteristics. I will point you to a couple of pages that I think are the most succinct summaries of the implementation and how it performs:

Dimitri Kravtchuk’s blog post about adaptive flushing and the innodb_io_capacity variable
Vadim’s benchmarks of Percona Server and MySQL 5.5.8, showing how to tune so that “adaptive flushing” works well
Percona Server documentation for adaptive checkpointing
My own blog post about balancing dirty page flushing and write combining
If these types of topics interest you, you should attend Percona Live in San Francisco in a couple of weeks. Both Peter Zaitsev and Mark Callaghan will be speaking there on topics such as InnoDB internals, along with a variety of other speakers; there is a several-hour class on InnoDB internals.

I’m Baron Schwartz, the founder and CEO of VividCortex. I am the author of High Performance MySQL and many open-source tools for performance analysis, monitoring, and system administration. I contribute to various database communities such as Oracle, PostgreSQL, Redis and MongoDB.

目录
相关文章
|
SQL NoSQL AliSQL
MySQL RENAME hang问题分析与修复
问题现象:alter 过程 rename 文件一直失败,导致 crash。[Warning] InnoDB: Cannot rename file ./tradesupplymember/memberpropertyinfo_0024.ibd (space id 78102), retried 1000 times. There are either pending IOs or flushes 
696 0
MySQL RENAME hang问题分析与修复
Grafana实现图表双Y坐标轴展示
Grafana实现图表双Y坐标轴展示
|
存储 缓存 关系型数据库
MySQL 查询优化方法
在数据库应用中,高效的查询性能至关重要。本文探讨了常用的 MySQL 查询优化方法,包括索引优化(选择合适的索引字段、复合索引、定期维护索引)、查询语句优化(避免全表扫描、限制返回行数、避免使用不必要的函数)、表结构优化(选择合适的数据类型、分区表、定期清理无用数据)及数据库配置优化(调整缓存大小、优化存储引擎参数)。通过这些方法,可以显著提高 MySQL 的查询性能,为应用程序提供更好的用户体验。
1005 4
|
存储 关系型数据库 分布式数据库
内附原文|详解SIGMOD’24最佳论文:PolarDB破解多主架构经典难题
在今年的SIGMOD会议上,阿里云瑶池数据库团队的论文《PolarDB-MP: A Multi-Primary Cloud-Native Database via Disaggregated Shared Memory》获得了Industry Track Best Paper Award,这是中国企业独立完成的成果首次摘得SIGMOD最高奖。PolarDB-MP是基于分布式共享内存的多主云原生数据库,本文将介绍这篇论文的具体细节。
内附原文|详解SIGMOD’24最佳论文:PolarDB破解多主架构经典难题
|
数据采集 XML 前端开发
Python爬虫实战:利用代理IP爬取百度翻译
Python 爬虫实战:利用代理 IP 爬取百度翻译
756 2
|
自然语言处理 监控 关系型数据库
mysql造数据占用临时表空间
【5月更文挑战第20天】MySQL在处理复杂查询时可能使用临时表,可能导致性能下降。临时表用于排序、分组和连接操作。常见问题包括内存限制、未优化的查询、数据类型不当和临时表清理。避免过度占用的策略包括优化查询、调整系统参数、优化数据类型和事务管理。使用并行查询、分区表和监控工具也能帮助管理临时表空间。通过智能问答工具如通义灵码,可实时续写SQL和获取优化建议。注意监控`Created_tmp_tables`和`Created_tmp_disk_tables`以了解临时表使用状况。
714 5
|
存储 SQL 关系型数据库
MySQL之深入InnoDB存储引擎——Checkpoint机制
一、引入 由于页的操作首先都是在缓冲池中完成的,那么如果一条DML语句改变了页中的记录,那么此时页就是脏的,即缓冲池中页的版本要比磁盘的新。那么数据库需要将新版本的页刷新到磁盘。倘若每次一个页发生变化就刷新,那么开销会很大,若热点数据集中在某几个页中,那么数据库的性能将变得非常差。
|
SQL 数据库
【SQL】已解决:SQL分组去重并合并相同数据
【SQL】已解决:SQL分组去重并合并相同数据
294 1
|
设计模式 数据可视化 程序员
设计模式学习心得之前置知识 UML图看法与六大原则(上)
设计模式学习心得之前置知识 UML图看法与六大原则(上)
142 0
|
监控 Linux
深入了解Linux的pmap命令:进程内存映射的利器
`pmap`是Linux下分析进程内存映射的工具,显示内存区域、权限、大小等信息。通过`/proc/[pid]/maps`获取数据,特点包括详细、实时和灵活。参数如`-x`显示扩展信息,`-d`显示设备。示例:`pmap -x 1234`查看进程1234的映射。注意权限、实时性和准确性。结合其他工具定期监控,排查内存问题。