Oracle Gateway for SQL Server时2PC分布式事务异常处理 ora-01017 ora-02063
现象描述:
(1).数据库通过调用透明网络实现分布式事务,但透明网关停用后,失败的分布式事务并未清理。
(2).数据库 alert 日志
Wed Jun 08 08:40:34 2022
Errors in file /data01/u01/app/oracle/diag/rdbms/orcls3/orcl/trace/orcl_reco_4243.trc:
ORA-01017: invalid username/password; logon denied
[Oracle][ODBC SQL Server Wire Protocol driver][SQL Server]用户 'RECOVER' 登录失败。 {28000,NativeErr = 18456}
ORA-02063: preceding 2 lines from YDH
Wed Jun 08 08:41:37 2022
Errors in file /data01/u01/app/oracle/diag/rdbms/orcls3/orcl/trace/orcl_reco_4243.trc:
ORA-01017: invalid username/password; logon denied
[Oracle][ODBC SQL Server Wire Protocol driver][SQL Server]用户 'RECOVER' 登录失败。 {28000,NativeErr = 18456}
ORA-02063: preceding 2 lines from YDH
(3).数据库 RECO 进程 trc 日志orcl_reco_4243.trc
Trace file /data01/u01/app/oracle/diag/rdbms/orcls3/orcl/trace/orcl_reco_4243.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
ORACLE_HOME = /data01/u01/app/oracle/product/11.2.0/dbhome_1
System name: Linux
Node name: host2548
Release: 3.10.0-1160.49.1.el7.x86_64
Version: #1 SMP Tue Nov 9 16:09:48 UTC 2021
Machine: x86_64
Instance name: orcl
Redo thread mounted by this instance: 1
Oracle process number: 66
Unix process pid: 4243, image: oracle@host2548 (RECO)
*** 2022-06-08 00:17:02.699
*** SESSION ID:(144.1) 2022-06-08 00:17:02.699
*** CLIENT ID:() 2022-06-08 00:17:02.699
*** SERVICE NAME:(SYS$BACKGROUND) 2022-06-08 00:17:02.699
*** MODULE NAME:() 2022-06-08 00:17:02.699
*** ACTION NAME:() 2022-06-08 00:17:02.699
ERROR, tran=10.10.367997, session#=1, ose=0:
ORA-01017: invalid username/password; logon denied
[Oracle][ODBC SQL Server Wire Protocol driver][SQL Server]用户 'RECOVER' 登录失败。 {28000,NativeErr = 18456}
ORA-02063: preceding 2 lines from YDHUB8
故障原因:
Oracle 数据库中,RECO 进程用于自动地解决分布式事务发生错误的情况。一个节点上的 RECO 进程将会自动连接至存在 in-doubt 分布式事务的数据库上。当 RECO 进程建立了数据库连接后,它将会自动解决处于 in-dount 状态的分布式事务,并将解决后的事务从 pending transation 表中删除。
(In a distributed database, the RECO) automatically resolves failures in distributed transactions. The RECO process of a node automatically connects to other databases involved in an in-doubt distributed transaction. When RECO reestablishes a connection between the databases, it automatically resolves all in-doubt transactions, removing from each database's pending transaction table any rows that correspond to the resolved transactions.)
故障场景中,数据库通过调用透明网络实现分布式事务,但透明网关停用后,失败的分布式事务并未清理。
例如:分布式事务于 PREPARE PHASE 阶段出现异常。
**则在本地端查询 dba_2pc_pending 可得类似如下结果:
set line 1000
col GLOBAL_TRAN_ID for a30
col TRAN_COMMENT for a30
col OS_TERMINAL for a20
col host for a20
col os_user for a20
select * from dba_2pc_pending;
LOCAL_TRAN_ID GLOBAL_TRAN_ID STATE MIXED ADVICE TRAN_COMMENT FAIL_TIME FORCE_TIME RETRY_TIME OS_USER OS_TERMINAL HOST DB_USER COMMIT#
---------------------- ------------------------------ ---------------- ----- ------ ------------------------------ --------- ---------- ---------- -------------------- -------------------- -------------------- ------------------------------ ----------------
1.30.308101 orcl.1683138c.1.30.308101 collecting no 24-3月-20 08-6月-22 orclService host2252 WORKGROUP\host2252 WH 6810389850898
已选择 1 行。
**在远端查询 可得类似如下结果:
SQL> select local_tran_id,state from dba_2pc_pending;
no rows selected
即表示本地数据库要求其他点做好 commit 或者 rollback 准备,现在正在"收集"其他点的数据库的返回信息,但是此时出现了错误,远端数据库未知状态(in doubt)。
而 RECO 进程不断自动处理无法解决的分布式事务,因此在数据库 alert.log 日志中不断报错。
故障处理步骤:
为防止 RECO 进程不断自动处理无法解决的分布式事务,需要将本地端的全局协调者(Global Coordinator)的 pending transation 清除掉。
依照上述例子的场景,分布式事务于 PREPARE PHASE 阶段出现异常。因此,需要以 SYS 用户登录本地端数据库,执行如下清理命令。
SQL> execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('local_tran_id'); --其中,local_tran_id 为本地端的事务 ID。
select 'exec dbms_transaction.purge_lost_db_entry('||''''||local_tran_id||''''||');commit;' "Purge"
from dba_2pc_pending;
Purge
---------------------------------------------------------------------------
exec dbms_transaction.purge_lost_db_entry('1.30.308101');commit;
已选择 1 行。
参考:
How to Setup DG4MSQL to Use Distributed Transactions (Doc ID 227011.1)
If the transaction is not set up correctly the following error encounters:
ORA-02054: transaction x.x.xx in-doubt
In some rare cases where distributed transaction might fail and the logs in the log tables (dba_2pc_pending, sys.pending_trans$ ,sys.pending_sessions$ , sys.pending_sub_sessions$) are not deleted, they must be wiped out manually. To do so, log in as SYS user and do:
select * from dba_2pc_pending;
Check out for the LOCAL_TRAN_ID of the transaction you want to get rid of and supply it in the package:
DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY('')
To remove those pending transactions from the DataDictionary tables, execute this package:
exec dbms_transaction.purge_lost_db_entry('<local_tran_id>');
followed by a commit - for example:
select local_tran_id from dba_2pc_pending;
=> returns for example 2.1.8263
to purge it, login as sysdba and execute:
alter system disable distributed recovery;
commit;
commit force '2.1.8263';
=> might fail, then continue with:
exec dbms_transaction.purge_lost_db_entry('2.1.8263');
commit;
alter system enable distributed recovery;
commit;