一套SUNOS上的2节点10.2.0.2 RAC系统日前出现ORA-00600: internal error code, arguments: [keltnfy-ldmInit], [46], [1], [], [], [], [], []内部错误,错误发生时系统操作人员误使用hostname命令修改了1号主机的主机名,之后陆续出现以上ora-00600错误,同时操作系统日志显示RAC CSS进程意外终止,具体日志如下:
================== OS Message===================== Jan 10 11:15:10 cupd25k-a root: [ID 702911 user.error] Cluster Ready Services completed waiting on dependencies. Jan 10 11:15:16 cupd25k-a root: [ID 702911 user.error] Duplicate Oracle CLSMON found. Killing and restarting it. Jan 10 11:15:16 cupd25k-a root: [ID 702911 user.error] Oracle CSS daemon failed to start up. Check CRS logs for diagnostics. Jan 10 11:15:16 cupd25k-a root: [ID 702911 user.error] Oracle CLSMON terminated with unexpected status 137. Respawning /* 这里的Duplicate Oracle CLSMON found 因该指的是OCLSMON进程, "In Oracle 10.2.0.2 and above there is an additional process called OCLSOMON which monitors the CSS daemon for hangs or scheduling issues and can reboot a node if there is a perceived hang. OCLSOMON is spawned in init.cssd and runs as the Oracle user." oclsmon进程在10.2.0.2以后版本被引入,用以监视css进程, 若发生hang或操作系统调度问题时该进程可能会reboot节点, oclsmon进程会被init.cssd脚本spawned. */ ==================oclsmon.log====================== 2011-01-10 11:15:11.376 unspecified member number is (1) Member 1 group OCLSMON_ in use. Is oclsmon already up? 2011-01-10 11:15:11.479 Internal Error Information: Category: 8 Operation: skgxnreg: the member number is i Location: skgxnreg_7 Other: Dep: 1 2011-01-10 11:15:11.737 unspecified member number is (1) Member 1 group OCLSMON_ in use. Is oclsmon already up? 2011-01-10 11:15:11.751 Internal Error Information: Category: 8 Operation: skgxnreg: the member number is i Location: skgxnreg_7 Other: Dep: 1 2011-01-10 11:15:12.006 unspecified member number is (1) Member 1 group OCLSMON_ in use. Is oclsmon already up? 2011-01-10 11:15:12.023 Internal Error Information: Category: 8 Operation: skgxnreg: the member number is i Location: skgxnreg_7 Other: Dep: 1 2011-01-10 11:15:12.278 unspecified member number is (1) Member 1 group OCLSMON_ in use. Is oclsmon already up? 2011-01-10 11:15:12.293 Internal Error Information: Category: 8 Operation: skgxnreg: the member number is i Location: skgxnreg_7 Other: Dep: 1 /* skgxn是Oracle Clusterware用以监视skgxn事件(即第三方CLUSTERWARE相关的事宜,他们应该有用sun的cluster); 似乎是修改hostname导致了Oracle CSS出现了fatal error,并启动了一个以上的OCLSMON进程(Duplicate Oracle CLSMON found), 最后"Oracle CSS daemon failed to start up. Check CRS logs for diagnostics", 在Oracle instance启动的情况下25k-a节点的CSS进程意外终止, 可能导致该节点上的所有实例的LMD(global Enqueue Service daemon)、LMON无法正常工作而导致实例hang住。*/ ==========================alert.log==================== Errors in file /oracle/oracle/admin/BOCPCS/udump/bocpcs1_ora_12320.trc: ORA-00600: internal error code, arguments: [keltnfy-ldmInit], [46], [1], [], [], [], [], [] =========================part of trace file=============== *** 2011-01-10 11:11:02.957 ksedmp: internal or fatal error ORA-00600: internal error code, arguments: [keltnfy-ldmInit], [46], [1], [], [], [], [], [] Current SQL information unavailable - no session. ----- Call Stack Trace ----- calling call entry argument values in hex location type point (? means dubious value) -------------------- -------- -------------------- ---------------------------- ksedmp()+716 CALL ksedst() FFFFFFFF7FFF9D40 ? 000000000 ? 0FFFFFFFF ? FFFFFFFF7FFF8EE8 ? FFFFFFFF7FFFA640 ? 000000008 ? kgerinv()+200 PTR_CALL 0000000000000000 000000002 ? 10638A1CC ? 000000001 ? 000000000 ? 10638A000 ? 10638A1CC ? kgeasnmierr()+28 CALL kgerinv() 106384B98 ? 000000000 ? 105D3B940 ? 000000002 ? FFFFFFFF7FFFDFF0 ? 000001430 ? keltnfy()+784 CALL kgeasnmierr() 106384B98 ? 1064DCBF0 ? 105D3B940 ? 000000002 ? 000000000 ? 00000002E ? kscnfy()+552 PTR_CALL 0000000000000000 10639B498 ? 38001E7A8 ? 1055AC5D0 ? 10639B498 ? 000102C00 ? 10638A1C0 ? ksucrp()+2436 CALL kscnfy() 000008000 ? 000808214 ? 100C4C220 ? 1055C6680 ? 00000000F ? 000000001 ? opiino()+2056 CALL ksucrp() 000106387 ? 380007608 ? 000000000 ? 000380000 ? 000106000 ? 106387618 ? opiodr()+1488 PTR_CALL 0000000000000000 10555A000 ? FFFFFFFF7FFFF1C8 ? 00010555A ? 000106000 ? 105C83000 ? 000000001 ? opidrv()+828 CALL opiodr() 106391000 ? 000000000 ? 106390DD8 ? 106390000 ? 106391BD0 ? 000106000 ? sou2o()+80 CALL opidrv() 106394358 ? 000000001 ? 00000003C ? 000000000 ? 00000003C ? 000106000 ? opimai_real()+124 CALL sou2o() FFFFFFFF7FFFF788 ? 00000003C ? 000000004 ? FFFFFFFF7FFFF7B0 ? 105C82000 ? 000105C82 ? main()+152 CALL opimai_real() 000000002 ? FFFFFFFF7FFFF888 ? 103F1BBCC ? 10632DB10 ? 002411E44 ? 000014400 ? _start()+380 CALL main() 000000002 ? 000000008 ? 000000000 ? FFFFFFFF7FFFF898 ? FFFFFFFF7FFFF9A8 ? FFFFFFFF7C700200 ? /* 可以看到以上trace文件指出了no session, 在服务进程启动阶段遭遇了该keltnfy-ldmInit内部错误*/ metalink文档Startup Database Produces Ora-00600: [Keltnfy-Ldminit] [ID 336447.1] 介绍了该内部错误一般由主机上的不当网络配置引起,很显然使用hostname命令修改了一个无法解析的 主机名时可能引发该ORA-00600[keltnfy-ldmInit]内部错误。 Applies to: Oracle Server - Enterprise Edition - Version: 10.2.0.1 to 10.2.0.3 - Release: 10.2 to 10.2 Information in this document applies to any platform. ***Checked for relevance on 09-Jun-2010*** Symptoms An startup nomount on Oracle 10g Release 2 database produces the following exception in alert log Starting up ORACLE RDBMS Version: 10.2.0.1.0. Errors in file /opt/oracle/10.2/admin/ORCL/udump/ORCL_ora_535.trc: ORA-00600: internal error code, arguments: [keltnfy-ldmInit], [46], [1], [], [], [], [], [] USER: terminating instance due to error 600 Instance terminated by USER, pid = 535 Cause The problem is related to getting host information. In this case, ldmInit()/sldmInit() is failing with error 46 : LDMERR_HOST_NOT_FOUND The following exception may also occur : LDMERR_SOSD_INIT OSD init failed to be specific in these OSD failures LDMERR_BAD_ADDR bad address when system call gethostname failed LDMERR_HOST_NOT_FOUND gethostbyname system call fails LDMERR_NO_SUPPORT when specific address type is not supported Development has fixed two bugs so far regarding this issue Bug:5438154 - Abstract: ORA-600[KELTNFY-LDMINIT] STARTING THE DB Release Notes: ldmInit returned LDMERR_HOST_NOT_FOUND for the machine huge alias list/address list Workaround: reduce the alais list of the machine Bug:5486074 - Abstract: ORA-600 [KELTNFY-LDMINIT] WHEN DNS IS NOT AVAILABLE Release Notes: Internal error is raised by the Server Generated Alert subsystem when it can not determine Host Name or Network Address. This can be caused by DNS server being unaavilable. Solution The fix for 5486074 will not fix any underlying error from gethostbyname(), it just change the internal error to a warning message : "Warning: keltnfy call to ldmInit failed with error 46" You will still need to fix the network config issue. These are the check you can do verify the host information Check permission on /etc/hosts $ ls -l /etc/hosts -rw-r--r-- 2 root root 194 Oct 17 2006 /etc/hosts Check if /etc/hosts file is correctly configured ( all of this on one line ). Check the hostname: $ hostname $ ping `hostname` Make sure you are able to ping the hostname Check if /etc/nodename is correctly configured If you have DNS setup, ping is not a tool to diagnose DNS problem. A better tool to use is nslookup, dnsquery, or dig. $ nslookup $ nslookup $ nslookup The forward and reverse lookup should succeed and return consistent address/info. Check nsswitch.conf $ more nsswitch.conf hosts: files dns Make sure host lookup is also done through the /etc/hosts file and not just dns. It is recommended that FILES come first before DNS. Also, check the resolv.conf. This makes sure that the DNS is working properly.显然在生产主机上使用hostname命令是危险的,因为你很难保证你在打字的时候不会因为同事的一下拍击而输错,有人说在生产环境中rm命令因该被禁用,那么这种特殊待遇对hostname命令也适用,我们可以用什么来代替hostname查看主机名呢?选择可以有非常多,这里我推荐一种:
/* uname -n完全可以满足你的需要! */That's great!
本文转自maclean_007 51CTO博客,原文链接:
http://blog.51cto.com/maclean/1277691