之前发生过一起连续drop压缩表,最后长时间等待信号量crash,线上alert log里的报错是:
OS WAIT ARRAY INFO: reservation count 36647199, signal count 34050225 --Thread 1331538240 has waited at row0purge.c line 680 for 950.00 seconds the semaphore: S-lock on RW-latch at 0xe60b60 '&dict_operation_lock' a writer (thread id 1383319872) has reserved it in mode exclusive number of readers 0, waiters flag 1, lock_word: 0 Last time read locked in file row0purge.c line 680 Last time write locked in file /home/jiyuan/rpmbuild/BUILD/tb-mysql-5.5.18/storage/innobase/row/row0mysql.c line 3213
row_drop_table_for_mysql { …… /* Serialize data dictionary operations with dictionary mutex: no deadlocks can occur then in these operations */ if (trx->dict_operation_lock_mode != RW_X_LATCH) { /* Prevent foreign key checks etc. while we are dropping the table */ row_mysql_lock_data_dictionary(trx); locked_dictionary = TRUE; } …… }
Index: tests/db/common.lua =================================================================== --- tests/db/common.lua (revision 2462) +++ tests/db/common.lua (revision 2467) @@ -18,9 +18,15 @@ end i = table_id + + local comp_str=" " + if mysql_use_compress ~= 0 then + comp_str=" ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE="..mysql_use_compress + end - print("Creating table 'sbtest" .. i .. "'...") - if (db_driver == "mysql") then + print("Creating table 'sbtest" .. i) + + if (db_driver == "mysql") then query = [[ CREATE TABLE sbtest]] .. i .. [[ ( id INTEGER UNSIGNED NOT NULL ]] .. ((oltp_auto_inc and "AUTO_INCREMENT") or "") .. [[, @@ -28,7 +34,7 @@ c CHAR(120) DEFAULT '' NOT NULL, pad CHAR(60) DEFAULT '' NOT NULL, ]] .. index_name .. [[ (id) - ) /*! ENGINE = ]] .. mysql_table_engine .. " MAX_ROWS = " .. myisam_max_rows .. " */" + ) /*! ENGINE = ]] .. mysql_table_engine ..comp_str.. " MAX_ROWS = " .. myisam_max_rows .. " */" elseif (db_driver == "drizzle") then query = [[ Index: tests/db/select_random_points.lua =================================================================== --- tests/db/select_random_points.lua (revision 2462) +++ tests/db/select_random_points.lua (revision 2467) @@ -13,6 +13,11 @@ db_connect() print("Creating table 'sbtest'...") + + local comp_str=" " + if mysql_use_compress ~= 0 then + comp_str=" ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE="..mysql_use_compress + end if (db_driver == "mysql") then query = [[ @@ -22,7 +27,7 @@ c CHAR(120) DEFAULT '' NOT NULL, pad CHAR(60) DEFAULT '' NOT NULL, PRIMARY KEY (id) - ) /*! ENGINE = ]] .. mysql_table_engine .. " MAX_ROWS = " .. myisam_max_rows .. " */" + ) /*! ENGINE = ]] .. mysql_table_engine ..comp_str.. " MAX_ROWS = " .. myisam_max_rows .. " */" elseif (db_driver == "oracle") then query = [[ Index: drivers/mysql/drv_mysql.c =================================================================== --- drivers/mysql/drv_mysql.c (revision 2462) +++ drivers/mysql/drv_mysql.c (revision 2467) @@ -75,7 +75,8 @@ {"mysql-ssl", "use SSL connections, if available in the client library", SB_ARG_TYPE_FLAG, "off"}, {"myisam-max-rows", "max-rows parameter for MyISAM tables", SB_ARG_TYPE_INT, "1000000"}, {"mysql-debug", "dump all client library calls", SB_ARG_TYPE_FLAG, "off"}, - + {"mysql-use-compress", "whether to create a compress table or not, and this value equal to key_block_size", + SB_ARG_TYPE_INT , "0"}, {NULL, NULL, SB_ARG_TYPE_NULL, NULL} };
–Thread 140431724488448 has waited at buf0flu.c line 1454 for 303.00 seconds the semaphore:S-lock on RW-latch at 0x2bce158 ‘&buf_pool->page_hash_latch’a writer (thread id 140431697413888) has reserved it in mode exclusivenumber of readers 0, waiters flag 1, lock_word: 0Last time read locked in file buf0buf.c line 2482
#0 0x00000000008a8c2d in buf_buddy_free_low ()#1 0x0000000000815234 in buf_LRU_block_remove_hashed_page ()#2 0x0000000000815fb3 in buf_LRU_free_block ()#3 0x0000000000817143 in buf_LRU_search_and_free_block ()#4 0x00000000008187bc in buf_LRU_get_free_block ()#5 0x000000000080bf0e in buf_page_get_gen ()#6 0x00000000007fdc58 in btr_search_drop_page_hash_when_freed ()#7 0x000000000084c489 in fseg_free_extent ()#8 0x000000000084d284 in fseg_free_step ()#9 0x00000000007ec4a9 in btr_free_but_not_root ()#10 0x000000000081d50e in dict_drop_index_tree ()#11 0x00000000007b6035 in row_upd_clust_step ()#12 0x00000000007b69de in row_upd_step ()#13 0x0000000000899b88 in que_run_threads ()#14 0x000000000089a286 in que_eval_sql ()#15 0x00000000007a5ce9 in row_drop_table_for_mysql ()#16 0x00000000007918ae in ha_innobase::delete_table(char const*) ()
Thread 3 (Thread 0x7f2531467700 (LWP 24352)):#0 0x00000000008a8c2d in buf_buddy_free_low ()#1 0x00000000008150f4 in buf_LRU_block_remove_hashed_page ()#2 0x0000000000817def in buf_LRU_invalidate_tablespace ()#3 0x0000000000847cab in fil_delete_tablespace ()#4 0x00000000007a606d in row_drop_table_for_mysql ()#5 0x00000000007918ae in ha_innobase::delete_table(char const*) ()#6 0x0000000000694c0e in ha_delete_table(THD*, handlerton*, char const*, char const*, char const*, bool) ()#7 0x0000000000602e55 in mysql_rm_table_no_locks(THD*, TABLE_LIST*, bool, bool, bool, bool) ()#8 0x00000000006034ad in mysql_rm_table(THD*, TABLE_LIST*, char, char) ()#9 0x00000000005abd29 in mysql_execute_command(THD*) ()#10 0x00000000005afaa8 in mysql_parse(THD*, char*, unsigned int, Parser_state*) ()#11 0x00000000005b12c7 in dispatch_command(enum_server_command, THD*, char*, unsigned int) ()#12 0x00000000006355ff in do_handle_one_connection(THD*) ()#13 0x0000000000635700 in handle_one_connection ()#14 0x0000000000900da6 in pfs_spawn_thread ()#15 0x0000003888c077e1 in start_thread () from /lib64/libpthread.so.0#16 0x00000038888e68ed in clone () from /lib64/libc.so.6
Thread 41 (Thread 0x7fb8d04dc700 (LWP 2554)):#0 0x0000003888c0b44c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0#1 0x0000000000884811 in os_event_wait_low ()#2 0x00000000007c42cf in sync_array_wait_event ()#3 0x00000000007c54a3 in mutex_spin_wait ()#4 0x0000000000803b0e in pfs_mutex_enter_func ()#5 0x0000000000807775 in buf_print_io ()#6 0x00000000007bf89d in srv_printf_innodb_monitor.clone.3 ()#7 0x00000000007bfef0 in srv_monitor_thread ()#8 0x0000003888c077e1 in start_thread () from /lib64/libpthread.so.0#9 0x00000038888e68ed in clone () from /lib64/libc.so.6
#0 0x0000003888c0b44c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0#1 0x0000000000884811 in os_event_wait_low ()#2 0x00000000007c42cf in sync_array_wait_event ()#3 0x00000000007c4510 in rw_lock_s_lock_spin ()#4 0x000000000080fcac in buf_flush_page_and_try_neighbors ()#5 0x0000000000810dbf in buf_flush_batch ()#6 0x0000000000811540 in buf_flush_list ()#7 0x00000000007bca2e in srv_master_thread ()#8 0x0000003888c077e1 in start_thread () from /lib64/libpthread.so.0#9 0x00000038888e68ed in clone () from /lib64/libc.so.6
May be related : http://bugs.mysql.com/bug.php?id=59727
>>读取记录中存储的索引root_page_no、space id、是否是压缩表(zip_size = fil_space_get_zip_size(space))>>释放所有除了root page外的所有pagebtr_free_but_not_root(space, zip_size, root_page_no);>>>释放leaf page segment
finished = fseg_free_step(root + PAGE_HEADER + PAGE_BTR_SEG_LEAF,
&mtr);
函数fseg_free_step可以被重复调用来释放叶子page,这样可以避免过大的mtr.
从函数参数可以看出来,root + PAGE_HEADER + PAGE_BTR_SEG_LEAF指向了LEAF SEGMENT(每个索引有两个segment,一个管理叶子节点,一个管理非叶子节点)
–>读取一个extend descriptor
descr = xdes_get_descriptor(space, zip_size, header_page, mtr);
读取该space的0号page(其中记录了space header,在该page的FSP_HEADER_OFFSET处记录)。
然后再根据space header获取xdes_t。这里的header_page为3, 也就是root节点。实际上btree的root节点存储了两个SEGMENT的segment header.
inode = fseg_inode_try_get(header, space, zip_size, mtr);//读取该segment对应的segmeng inode
descr = fseg_get_first_extent(inode, space, zip_size, mtr);
获取该segment的第一个extent。extend list是按照FSEG_FULL->FSEG_NOT_FULL->FSEG_FREE来进行排序
//TODO:对文件系统部分进行系统分析
–>如果descr不为NULL
fseg_free_extent //释放一个extend
依次遍历该extend上的page,并从page hash index中删除,调用如下函数进行
btr_search_drop_page_hash_when_freed(
space, zip_size, first_page_in_extent + i);
|–>先获取对应的block(buf_page_get_gen)
|–>再调用btr_search_drop_page_hash_index去移除adaptive hash index中对应该page的记录
性能瓶颈1:发生在buf_page_get_gen,函数的参数mode为BUF_PEEK_IF_IN_POOL,表示如果page不在bp中,则返回NULL。但如果page在bp中,并且是压缩表时,则继续下面的逻辑,会从bp中获取一个block(block = buf_LRU_get_free_block(buf_pool)),这里可能是耗时的地方然后从inode上删除对应的extent节点
最后调用fsp_free_extent将该extend加到空闲链表上。
–>最后返回false到上层,下次会继续循环调用该函数来做释放
–>释放碎片页及inode等,这些都是很快的操作,不会成为瓶颈
>>>如果上一步调用函数fseg_free_step返回值为false,表明该segment的释放还没完成,goto leaf_loop继续调用fseg_free_step
>>>释放非叶子节点segement(通过PAGE_BTR_SEG_TOP来指定)
finished = fseg_free_step_not_header(
root + PAGE_HEADER + PAGE_BTR_SEG_TOP, &mtr);
流程和fseg_free_step类似,但不释放对应的inode
>>释放root pagebtr_free_root(space, zip_size, root_page_no, mtr); //释放btree的根节点>>将对应的索引在SYS_INDEXS中的记录DICT_SYS_INDEXES_PAGE_NO_FIELD置空page_rec_write_field(rec, DICT_SYS_INDEXES_PAGE_NO_FIELD,FIL_NULL, mtr);
>>检查是否有pending io/pending ibuf merge,有的话,则循环等待>>如果开启了innodb_lazy_drop_table,调用函数buf_LRU_mark_space_was_deleted(id)>>>扫描bp的LRU,找到对应的page,然后设置bpage->space_was_being_deleted = TRUE>>>>>如果没有设置,则调用buf_LRU_invalidate_tablespace
>>>buf_LRU_drop_page_hash_for_tablespace //删除所有跟该tablespace id 相关的page hash index记录
扫描buf_pool->LRU,采用批量清理的方法,首先临时分配一个1024个元素的page_arr,每当读到一个和当前drop table相关的page时,就加入到数组中,如果数组满了,则调用buf_LRU_drop_page_hash_batch删除对应page hash index记录
疑问:这个之前在dict_drop_index_tree 已经做过一次删除page hash index entries,为何这里还要再做??
>>>buf_LRU_invalidate_tablespace_buf_pool_instance
从backtrace来看,这里也是耗费时间主要的地方,会再次扫描buf_pool->LRU,读取和space id吻合的page
–>如果page被Pin住(buf_page_get_io_fix(bpage) != BUF_IO_NONE或者bpage->buf_fix_count > 0),读取下一个page
–>将page从LRU移除
buf_LRU_block_remove_hashed_page(bpage, TRUE)
|–>buf_LRU_remove_block(bpage); //将bpage从LRU上移除
|–>读取page hash中对应的page(buf_page_hash_get_low),做必要的检查后,从page_hash中将当前page移除。
|–>对于普通page,只需要进行memset.如果是压缩表,还需要调用buf_buddy_free来释放压缩页内存
性能瓶颈2:从观察到的现象看,这里buf_buddy_free被长时间频繁调用到