问题现象:
alter 过程 rename 文件一直失败,导致 crash。
[Warning] InnoDB: Cannot rename file ./tradesupplymember/memberpropertyinfo_0024.ibd (space id 78102), retried 1000 times. There are either pending IOs or flushes or the file is being extended.
InnoDB: Assertion failure in thread 140414061098752 in file handler0alter.cc line 7763
官方bug:
https://bugs.mysql.com/bug.php?id=88511
问过赵伟,无法复现,只是猜测。在commit ddl的阶段,dml无法获取mdl锁。
系统当时的状态是真实hang住了。因为当前crash线程持有了dict_sys->mutex!
pt-pmp规约堆栈可以看到大量线程被卡住。
找到原因之后给官方提了一个新的bug:
https://bugs.mysql.com/bug.php?id=108087
影响的版本5.6, 5.7
错误日志:
[Warning] InnoDB: A long semaphore wait:
--Thread 139837716776704 has waited at ha_innodb.cc line 15484 for 488 seconds the semaphore:
Mutex at 0x7f320178a718, Mutex DICT_SYS created dict0dict.cc:1220, lock var 1
[Warning] InnoDB: A long semaphore wait:
--Thread 139837805946624 has waited at ibuf0ibuf.cc line 4494 for 489 seconds the semaphore:
X-lock on RW-latch at 0x7f30d9ba3760 created in file buf0buf.cc line 1461
a writer (thread id 139837600429824) has reserved it in mode SX
number of readers 0, waiters flag 1, lock_word: 10000000
Last time read locked in file row0sel.cc line 3526
Last time write locked in file buf/buf0flu.cc line 1213
2022-06-20T12:33:25.608401+08:00 0 [Warning] InnoDB: A long semaphore wait: --这里非常可疑!
--Thread 139837501830912 has waited at buf0buf.cc line 3992 for 382 seconds the semaphore:
S-lock on RW-latch at 0x7f2f94670290 created in file buf0buf.cc line 1461
a writer (thread id 139837501830912) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file row0merge.cc line 1976
Last time write locked in file buf/buf0buf.cc line 5178
[Warning] InnoDB: Cannot rename file ./tradesupplymember/memberpropertyinfo_0024.ibd (space id 78102), retried 1000 times. There are either pending IOs or flushes or the file is being extended.
InnoDB: Assertion failure in thread 140414061098752 in file handler0alter.cc line 7763
/u01/mysql57_20210630/bin/mysqld(my_print_stacktrace+0x2c)[0xf4cdbc]
/u01/mysql57_20210630/bin/mysqld(handle_fatal_signal+0x4b6)[0x7edce6]
/lib64/libpthread.so.0(+0xf620)[0x7f32030f1620]
/lib64/libc.so.6(gsignal+0x37)[0x7f3201ef2277]
/lib64/libc.so.6(abort+0x148)[0x7f3201ef3968]
/u01/mysql57_20210630/bin/mysqld[0x7b5607]
/u01/mysql57_20210630/bin/mysqld(_ZN11ha_innobase26commit_inplace_alter_tableEP5TABLEP18Alter_inplace_infob+0x1747)[0xfb5b47]
/u01/mysql57_20210630/bin/mysqld(_Z17mysql_alter_tableP3THDPKcS2_P24st_ha_create_informationP10TABLE_LISTP10Alter_info+0x356e)[0xd5142e]
/u01/mysql57_20210630/bin/mysqld(_ZN19Sql_cmd_alter_table7executeEP3THD+0x5a0)[0xe7f980]
/u01/mysql57_20210630/bin/mysqld(_Z21mysql_execute_commandP3THDb+0x8b2)[0xce2d02]
/u01/mysql57_20210630/bin/mysqld(_Z11mysql_parseP3THDP12Parser_state+0x3ad)[0xce765d]
/u01/mysql57_20210630/bin/mysqld(_Z16dispatch_commandP3THDPK8COM_DATA19enum_server_command+0xb5f)[0xce826f]
/u01/mysql57_20210630/bin/mysqld(_Z21do_threadpool_commandP12connection_tP8COM_DATA19enum_server_command+0x6e)[0x13504ce]
/u01/mysql57_20210630/bin/mysqld(_Z31threadpool_process_request_unixP3THDP12connection_tPi+0x157)[0x13506f7]
/u01/mysql57_20210630/bin/mysqld[0x1350cac]
/u01/mysql57_20210630/bin/mysqld[0x13516a5]
/u01/mysql57_20210630/bin/mysqld(pfs_spawn_thread+0x1b1)[0x12c3321]
/lib64/libpthread.so.0(+0x7e25)[0x7f32030e9e25]
/lib64/libc.so.6(clone+0x6d)[0x7f3201fb9f1d]
数据分析:
mysql> select count(*) from memberpropertyinfo_0024;
+----------+
| count(*) |
+----------+
| 8297851 |
+----------+
1 row in set (1.06 sec)
-rw-r----- 1 mysql mysql 4,978,638,848 Jul 27 12:14 /home/mysql/data3125/dbs3125/tradesupplymember/memberpropertyinfo_0024.ibd
core dump分析:
执行语句:
alter table memberpropertyinfo_0029 engine=innodb
对应代码:RDS_57: 7f1a8727c92
thread apply all bt full
在stack trace中搜索71538的线程。
dict_sys->mutex占用,整个rename操作都在里面,非常不合理。
8404 /* Latch the InnoDB data dictionary exclusively so that no deadlocks
8405 or lock waits can happen in it during the data dictionary operation. */
8406 row_mysql_lock_data_dictionary(trx);
1. rename在等待什么?thread#1:等待penging io完成,其中之一为thread#92正在完成io。
gdb:
打印ctx->old_table:
m_name = 0x7f2e35a24118 "tradesupplymember/memberpropertyinfo_0029"},
dir_path_of_temp_table = 0x0, data_dir_path = 0x0, tablespace = {m_name = 0x0},
space = 71538, flags = 33, flags2 = 80, ibd_file_missing = 0, cached = 1,
to_be_dropped = 1, n_def = 37, n_cols = 37, n_t_cols = 37, n_t_def = 37, n_v_def = 0,
n_v_cols = 0, can_be_evicted = 1, corrupted = 0, drop_aborted = 0,
thread#120: 打印space的具体信息。thread#120也被thread#1阻塞,情况与thread#114类似。
(gdb) p *space->chain.start
$4 = {space = 0x7f2e2705cfd8,
name = 0x7f2efd099b68 "./tradesupplymember/memberpropertyinfo_0029.ibd", is_open = true,
handle = {m_file = 157, m_psi = 0x0}, sync_event = 0x7f2e26b1e558, is_raw_disk = false,
size = 567808, flush_size = 567808, init_size = 4, max_size = 18446744073709551614,
n_pending = 19, n_pending_flushes = 0, being_extended = false,
modification_counter = 3184883449, flush_counter = 3184883449, chain = {prev = 0x0,
next = 0x0}, LRU = {prev = 0x0, next = 0x0}, punch_hole = true, block_size = 4096,
atomic_write = false, magic_n = 89389}
code:
3441 if (node->n_pending > 0
3442 || node->n_pending_flushes > 0
3443 || node->being_extended) {
3444 /* There are pending i/o's or flushes or the file is
3445 currently being extended, sleep for a while and
3446 retry */
3447 sleep = true;
stack:
#5 0x00000000007b5607 in ut_dbg_assertion_failed (
expr=expr@entry=0x178e20a "error == DB_SUCCESS",
file=file@entry=0x178e082 "handler/handler0alter.cc", line=line@entry=7763)
at ut/ut0dbg.cc:75
#6 0x0000000000fb5b47 in commit_cache_rebuild (ctx=0x7f2e26ac18a0)
at handler/handler0alter.cc:7763
#7 ha_innobase::commit_inplace_alter_table (this=0x7f2e26cbb030,
altered_table=<optimized out>, ha_alter_info=<optimized out>, commit=<optimized out>)
at handler/handler0alter.cc:8682
#8 0x0000000000d5142e in mysql_inplace_alter_table (target_mdl_request=0x7f2e73af0ea0,
alter_ctx=0x7f2e73af19f0, inplace_supported=HA_ALTER_INPLACE_NO_LOCK_AFTER_PREPARE,
ha_alter_info=0x7f2e73af1460, altered_table=0x7f2e26bdf820, table=0x7f2e26b17a20,
table_list=0x7f2e1a196278, thd=0x7f2e1a2bd000) at sql_table.cc:7890
#9 mysql_alter_table (thd=thd@entry=0x7f2e1a2bd000, new_db=<optimized out>,
new_name=<optimized out>, create_info=create_info@entry=0x7f2e73af25c0,
table_list=table_list@entry=0x7f2e1a196278, alter_info=alter_info@entry=0x7f2e73af2510)
at sql_table.cc:10210
#10 0x0000000000e7f980 in Sql_cmd_alter_table::execute (this=<optimized out>,
thd=0x7f2e1a2bd000) at sql_alter.cc:340
#11 0x0000000000ce2d02 in mysql_execute_command (thd=thd@entry=0x7f2e1a2bd000,
first_level=first_level@entry=true) at sql_parse.cc:5320
#12 0x0000000000ce765d in mysql_parse (thd=thd@entry=0x7f2e1a2bd000,
parser_state=parser_state@entry=0x7f2e73af3c90) at sql_parse.cc:6122
#13 0x0000000000ce826f in dispatch_command (thd=thd@entry=0x7f2e1a2bd000,
com_data=com_data@entry=0x7f31f81e2630, command=command@entry=COM_QUERY)
at sql_parse.cc:1647
#14 0x00000000013504ce in do_threadpool_command (connection=connection@entry=0x7f31f81e25e0,
com_data=com_data@entry=0x7f31f81e2630, command=COM_QUERY) at threadpool_unix.cc:2553
#15 0x00000000013506f7 in threadpool_process_request_unix (thd=0x7f2e1a2bd000,
connection=connection@entry=0x7f31f81e25e0, skip_io=skip_io@entry=0x7f2e73af4440)
at threadpool_unix.cc:1913
#16 0x0000000001350cac in handle_event (connection=connection@entry=0x7f31f81e25e0,
thdstack=thdstack@entry=0x7f2e73af45d0, this_thread=0x7f2e26a80020)
at threadpool_unix.cc:1989
2. io为何没有完成?thread#92:等待page(196609)的读锁,而锁被thread#114占有。
gdb:
请求io的页
(gdb) p page_id -ibuf_bitmap page
$15 = (const page_id_t &) @0x7f2e86bfab00: { m_space = 71538, m_page_no = 196609,
m_fold = 1998854003}
已完成io的页
(gdb) p bpage->id
$2 = {m_space = 71538, m_page_no = 207774, m_fold = 1998865168}
code:
4488 page_t* bitmap_page;
4489 ulint bitmap_bits;
4490
4491 ibuf_mtr_start(&mtr);
4492
4493 bitmap_page = ibuf_bitmap_get_map_page(
4494 page_id, *page_size, &mtr);
4495
4496 bitmap_bits = ibuf_bitmap_page_get_bits(
4497 bitmap_page, page_id, *page_size,
4498 IBUF_BITMAP_BUFFERED, &mtr);
4499
4500 ibuf_mtr_commit(&mtr);
4501
4502 if (!bitmap_bits) {
4503 /* No inserts buffered for this page */
4504
4505 fil_space_release(space);
4506 return;
4507 }
stack:
#6 0x0000000001157e65 in pfs_rw_lock_x_lock_func (lock=lock@entry=0x7f30d9ba3760,
file_name=file_name@entry=0x17907e0 "ibuf/ibuf0ibuf.cc", line=line@entry=4494, pass=0)
at include/sync0rw.ic:719
#7 0x000000000115e24a in buf_page_get_gen (page_id=..., page_size=...,
rw_latch=rw_latch@entry=2, guess=<optimized out>, guess@entry=0x0, mode=mode@entry=10,
file=<optimized out>, line=4494, mtr=0x7f2e86bfbf00,
dirty_with_no_latch=dirty_with_no_latch@entry=false) at buf/buf0buf.cc:4610
#8 0x0000000000fc5bc3 in ibuf_bitmap_get_map_page_func (page_id=..., page_size=...,
file=<optimized out>, line=<optimized out>, mtr=<optimized out>) at ibuf/ibuf0ibuf.cc:818
#9 0x0000000000fd0b7e in ibuf_merge_or_delete_for_page (block=block@entry=0x7f31e2055f30,
page_id=..., page_size=page_size@entry=0x7f31e2055f40,
update_ibuf_bitmap=update_ibuf_bitmap@entry=1) at ibuf/ibuf0ibuf.cc:4494
#10 0x0000000001160e8f in buf_page_io_complete (bpage=0x7f31e2055f30, evict=evict@entry=false
#11 0x00000000011d292d in fil_aio_wait (segment=segment@entry=4) at fil/fil0fil.cc:5951
#12 0x00000000010bedd8 in io_handler_thread (arg=<optimized out>) at srv/srv0start.cc:321
#13 0x00007f32030e9e25 in start_thread () from /lib64/libpthread.so.0
#14 0x00007f3201fb9f1d in clone () from /lib64/libc.so.6
3. 批量刷脏为何不能完成?thread#114:一直在等待thread#1的rename完成(在rename完成之前不会发起新的io),同时持有thread#92的锁。
gdb:
double write为120页的batch:
(gdb) p buf_dblwr->buf_block_arr[56].id
$74 = { m_space = 71538, m_page_no = 196609, m_fold = 1998854003}
code:
1054 space = fil_space_get_by_id(space_id);
1055
1056 if (space != NULL && space->stop_ios) {
1057 /* We are going to do a rename file and want to stop
1058 new i/o's for a while. */
1059
1060 if (count2 > 20000) {
1061 ib::warn() << "Tablespace " << space->name
1062 << " has i/o ops stopped for a long"
1063 " time " << count2;
1064 }
1065
1066 mutex_exit(&fil_system->mutex);
stack:
#0 0x00007f32030f0e8d in nanosleep () from /lib64/libpthread.so.0
#1 0x0000000001014320 in os_thread_sleep (tm=tm@entry=20000) at os/os0thread.cc:287
#2 0x00000000011daa57 in fil_mutex_enter_and_prepare_for_io (space_id=71538)
at fil/fil0fil.cc:1079
#3 0x00000000011dae5a in fil_io (type=..., sync=<optimized out>, page_id=..., page_size=...,
byte_offset=byte_offset@entry=0, len=16384, buf=0x7f3117aa8000, message=0x7f311609d260)
at fil/fil0fil.cc:5645
#4 0x0000000001167019 in buf_dblwr_write_block_to_datafile (bpage=0x7f311609d260,
sync=sync@entry=false) at buf/buf0dblwr.cc:931
#5 0x0000000001169c44 in buf_dblwr_flush_buffered_writes () at buf/buf0dblwr.cc:1077
#6 0x000000000116a1e4 in buf_dblwr_add_to_batch (bpage=bpage@entry=0x7f31af1dd930)
at buf/buf0dblwr.cc:1158
#7 0x0000000001176537 in buf_flush_write_block_low (sync=false, flush_type=BUF_FLUSH_LIST,
bpage=0x7f31af1dd930) at buf/buf0flu.cc:1114
#8 buf_flush_page (buf_pool=<optimized out>, bpage=0x7f31af1dd930,
flush_type=<optimized out>, sync=<optimized out>) at buf/buf0flu.cc:1245
#9 0x0000000001177227 in buf_flush_try_neighbors (page_id=...,
flush_type=flush_type@entry=BUF_FLUSH_LIST, n_flushed=n_flushed@entry=220,
n_to_flush=n_to_flush@entry=402) at buf/buf0flu.cc:1469
#10 0x000000000117850c in buf_flush_page_and_try_neighbors (flush_type=BUF_FLUSH_LIST,
count=<synthetic pointer>, n_to_flush=<optimized out>, bpage=<optimized out>)
at buf/buf0flu.cc:1541
#11 buf_do_flush_list_batch (buf_pool=buf_pool@entry=0x7f31f34a3958, min_n=min_n@entry=402,
lsn_limit=lsn_limit@entry=18446744073709551615) at buf/buf0flu.cc:1795
#12 0x000000000117a841 in buf_flush_batch (lsn_limit=18446744073709551615, min_n=402,
flush_type=BUF_FLUSH_LIST, buf_pool=<optimized out>) at buf/buf0flu.cc:1873
#13 buf_flush_do_batch (type=BUF_FLUSH_LIST, n_processed=0x7f30a724fd30,
lsn_limit=18446744073709551615, min_n=402, buf_pool=<optimized out>) at buf/buf0flu.cc:2033
#14 pc_flush_slot () at buf/buf0flu.cc:2891
#15 0x000000000117c005 in buf_flush_page_cleaner_coordinator (arg=<optimized out>)
at buf/buf0flu.cc:3316
#16 0x00007f32030e9e25 in start_thread () from /lib64/libpthread.so.0
#17 0x00007f3201fb9f1d in clone () from /lib64/libc.so.6
Note:这里同时也会卡主其他page cleaner线程,因为double write buffer被占满了。
#0 0x00007f32030ed9b5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x0000000001013b61 in wait (this=<optimized out>) at os/os0event.cc:179
#2 wait_low (reset_sig_count=<optimized out>, this=0x7f2efc721ab8) at os/os0event.cc:366
#3 os_event_wait_low (event=0x7f2efc721ab8, reset_sig_count=reset_sig_count@entry=42437697) at os/os0event.cc:611
#4 0x0000000001169d96 in buf_dblwr_add_to_batch (bpage=bpage@entry=0x7f2f8c0d4e60) at buf/buf0dblwr.cc:1114
#5 0x0000000001176537 in buf_flush_write_block_low (sync=false, flush_type=BUF_FLUSH_LIST, bpage=0x7f2f8c0d4e60) at buf/buf0flu.cc:1114
#6 buf_flush_page (buf_pool=<optimized out>, bpage=0x7f2f8c0d4e60, flush_type=<optimized out>, sync=<optimized out>) at buf/buf0flu.cc:1245
#7 0x0000000001177227 in buf_flush_try_neighbors (page_id=..., flush_type=flush_type@entry=BUF_FLUSH_LIST, n_flushed=n_flushed@entry=269, n_to_flush=n_to_flush@entry=402) at buf/buf0flu.cc:1469
#8 0x000000000117850c in buf_flush_page_and_try_neighbors (flush_type=BUF_FLUSH_LIST, count=<synthetic pointer>, n_to_flush=<optimized out>, bpage=<optimized out>) at buf/buf0flu.cc:1541
#9 buf_do_flush_list_batch (buf_pool=buf_pool@entry=0x7f31f3545118, min_n=min_n@entry=402, lsn_limit=lsn_limit@entry=18446744073709551615) at buf/buf0flu.cc:1795
#10 0x000000000117a841 in buf_flush_batch (lsn_limit=18446744073709551615, min_n=402, flush_type=BUF_FLUSH_LIST, buf_pool=<optimized out>) at buf/buf0flu.cc:1873
#11 buf_flush_do_batch (type=BUF_FLUSH_LIST, n_processed=0x7f30a724fe98, lsn_limit=18446744073709551615, min_n=402, buf_pool=<optimized out>) at buf/buf0flu.cc:2033
#12 pc_flush_slot () at buf/buf0flu.cc:2891
#13 0x000000000117b94d in buf_flush_page_cleaner_worker (arg=<optimized out>) at buf/buf0flu.cc:3522
#14 0x00007f32030e9e25 in start_thread () from /lib64/libpthread.so.0
#15 0x00007f3201fb9f1d in clone () from /lib64/libc.so.6
Note:如果不是读取ibuf对应当前space的页面,也会有问题。因为batch中可能持有其他页的锁。
4. Thread#92的读IO是谁发起的?Thread#120
Thread#120主动做change buffer的merge,通过异步读取页面实现。
Threadf#92中页面{m_space = 71538, m_page_no = 207774, m_fold = 1998865168}应为Thread#120发起。当然,也不排除由预读发起。
stack:
#0 0x00007f32030f0e8d in nanosleep () from /lib64/libpthread.so.0
#1 0x0000000001014320 in os_thread_sleep (tm=tm@entry=20000) at os/os0thread.cc:287
#2 0x00000000011daa57 in fil_mutex_enter_and_prepare_for_io (space_id=71538)
at fil/fil0fil.cc:1079
#3 0x00000000011dae5a in fil_io (type=..., sync=sync@entry=false, page_id=...,
page_size=..., byte_offset=byte_offset@entry=0, len=16384, buf=0x7f2fea3ac000,
message=message@entry=0x7f2fe348b090) at fil/fil0fil.cc:5645
#4 0x000000000118b7d5 in buf_read_page_low (err=0x7f2e771f9efc, sync=<optimized out>,
type=0, mode=<optimized out>, page_id=..., page_size=..., unzip=true)
at buf/buf0rea.cc:203
#5 0x000000000118da6d in buf_read_ibuf_merge_pages (sync=sync@entry=false,
space_ids=space_ids@entry=0x7f2e771f9fb0, page_nos=page_nos@entry=0x7f2e771f9f70,
n_stored=1) at buf/buf0rea.cc:847
#6 0x0000000000fcd780 in ibuf_merge_pages (n_pages=n_pages@entry=0x7f2e771fa600,
sync=sync@entry=false) at ibuf/ibuf0ibuf.cc:2560
#7 0x0000000000fcfa2b in ibuf_merge (sync=false, sync=false, n_pages=0x7f2e771fa600)
at ibuf/ibuf0ibuf.cc:2664
#8 ibuf_merge_in_background (full=full@entry=false) at ibuf/ibuf0ibuf.cc:2729
#9 0x00000000010bc059 in srv_master_do_active_tasks () at srv/srv0srv.cc:2209
救急解法:关闭change buffer!!!
| innodb_change_buffering | all |
MariaDB解法:
https://jira.mariadb.org/browse/MDEV-16124
https://github.com/MariaDB/server/commit/55abcfa7b70968246a1a26a8839013ebb8f5c506
Marko的修改利用操作系统本身支持在rename过程中进行io操作,去除了stop_ios逻辑。属于最优雅的解法,不建议在GA版本中采用。另外如果采用,要注意xtrbackup也要同步更新。
MDEV-16124 fil_rename_tablespace() times out and crashes server
during table-rebuilding ALTER TABLE
InnoDB insisted on closing the file handle before renaming a file.
Renaming a file should never be a problem on POSIX systems. Also on
Windows it should work if the file was opened in FILE_SHARE_DELETE
mode.
fil_space_t::stop_ios: Remove. We no longer need to stop file access
during rename operations.
fil_mutex_enter_and_prepare_for_io(): Remove the wait for stop_ios.
fil_rename_tablespace(): Remove the retry logic; do not close the
file handle. Remove the unused fault injection that was added along
with the DATA DIRECTORY functionality (MySQL WL#5980).
os_file_create_simple_func(), os_file_create_func(),
os_file_create_simple_no_error_handling_func(): Include FILE_SHARE_DELETE
in the share_mode. (We will still prevent multiple InnoDB instances
from using the same files by not setting FILE_SHARE_WRITE.)
AliSQL解法:
MySQL-8.0上不存在该问题,通过代码分析发现sunny在重构代码时把这个bug修复了(偶然因素,官方并没有意识到这个bug存在)。
https://github.com/mysql/mysql-server/commit/e592f604488d3caf50b0765e9d4fa0378423fd6c
WL#8619 - InnoDB: Provide offline database portability without ISL files
Summary of changes.
1. Introduce --innodb-directories := "dir1;...;dirN". This is renamed
from --innodb-scan-directories which was introduced in WL#9499.
Currently this is not dynamic.
2. Revert to 5.6 behaviour where we scan directories and find .ibd files.
If you move tablespaces to a new location, add the path to
innodb-data-directories so that InnoDB can "discover" the files on
startup.
3. Tablespaces can only be created under known directories. This
eliminates the need for .isl files.
4. Upgrade redo log format, not backward compatible
5. Change redo log type IDs
6. Code cleanup
7. Shard the fil system data structures for performance
8. Remove WL#9499 files that were used to track open tablespaces.
RB#16842 Approved by: Jimmy and Kevin.
注意看第7条!
AliSQL只需要参考这个修改即可。
官方在后续的开发中,将stop_ios改为了prevent_file_open。
https://github.com/mysql/mysql-server/commit/b184bd30f94df30a8bf178fc327590c5865d33bc
commit b184bd30f94df30a8bf178fc327590c5865d33bc
Author: Marcin Babij <marcin.babij@oracle.com>
Date: Thu Sep 23 21:04:35 2021 +0200
WL#14591 InnoDB: Make system variable `innodb_open_files` dynamic