快速恢复检测 恢复 故障服务器方法

本文涉及的产品
RDS MySQL Serverless 基础系列,0.5-2RCU 50GB
云数据库 RDS MySQL,高可用系列 2核4GB
云数据库 RDS PostgreSQL,高可用系列 2核4GB
简介:

源文地址http://devo.ps/blog/2013/03/06/troubleshooting-5minutes-on-a-yet-unknown-box.html

First 5 Minutes Troubleshooting A Server

Back when our team was dealing with operations, optimization and scalability at our previous company, we had our fair share of troubleshooting poorly performing applications and infrastructures of various sizes, often large (think CNN or the World Bank). Tight deadlines, “exotic” technical stacks and lack of information usually made for memorable experiences.

The cause of the issues was rarely obvious: here are a few things we usually got started with.

Get some context

Don’t rush on the servers just yet, you need to figure out how much is already known about the server and the specifics of the issues. You don’t want to waste your time (trouble) shooting in the dark.

A few “must have”:

  • What exactly are the symptoms of the issue? Unresponsiveness? Errors?
  • When did the problem start being noticed?
  • Is it reproducible?
  • Any pattern (e.g. happens every hour)?
  • What were the latest changes on the platform (code, servers, stack)?
  • Does it affect a specific user segment (logged in, logged out, geographically located…)?
  • Is there any documentation for the architecture (physical and logical)?
  • Is there a monitoring platform? Munin, Zabbix, Nagios, New Relic… Anything will do.
  • Any (centralized) logs?. Loggly, Airbrake, Graylog…

The last two ones are the most convenient sources of information, but don’t expect too much: they’re also the ones usually painfully absent. Tough luck, make a note to get this corrected and move on.

Who’s there?

$ w $ last

Not critical, but you’d rather not be troubleshooting a platform others are playing with. One cook in the kitchen is enough.

What was previously done?

$ history

Always a good thing to look at; combined with the knowledge of who was on the box earlier on. Be responsible by all means, being admin shouldn’t allow you to break ones privacy.

A quick mental note for later, you may want to update the environment variable HISTTIMEFORMAT to keep track of the time those commands were ran. Nothing is more frustrating than investigating an outdated list of commands…

What is running?

$ pstree -a $ ps aux

While ps aux tends to be pretty verbose, pstree -a gives you a nice condensed view of what is running and who called what.

Listening services

$ netstat -ntlp $ netstat -nulp $ netstat -nxlp

I tend to prefer running them separately, mainly because I don’t like looking at all the services at the same time. netstat -nalp will do to though. Even then, I’d ommit the numeric option (IPs are more readable IMHO).

Identify the running services and whether they’re expected to be running or not. Look for the various listening ports. You can always match the PID of the process with the output of ps aux; this can be quite useful especially when you end up with 2 or 3 Java or Erlang processes running concurrently.

We usual prefer to have more or less specialized boxes, with a low number of services running on each one of them. If you see 3 dozens of listening ports you probably should make a mental note of investigating this further and see what can be cleaned up or reorganized.

CPU and RAM

$ free -m $ uptime $ top $ htop

This should answer a few questions:

  • Any free RAM? Is it swapping?
  • Is there still some CPU left? How many CPU cores are available on the server? Is one of them overloaded?
  • What is causing the most load on the box? What is the load average?

Hardware

$ lspci $ dmidecode $ ethtool

There are still a lot of bare-metal servers out there, this should help with;

  • Identifying the RAID card (with BBU?), the CPU, the available memory slots. This may give you some hints on potential issues and/or performance improvements.
  • Is your NIC properly set? Are you running in half-duplex? In 10MBps? Any TX/RX errors?

IO Performances

$ iostat -kx 2 $ vmstat 2 10 $ mpstat 2 10 $ dstat --top-io --top-bio

Very useful commands to analyze the overall performances of your backend;

  • Checking the disk usage: has the box a filesystem/disk with 100% disk usage?
  • Is the swap currently in use (si/so)?
  • What is using the CPU: system? User? Stolen (VM)?
  • dstat is my all-time favorite. What is using the IO? Is MySQL sucking up the resources? Is it your PHP processes?

Mount points and filesystems

$ mount $ cat /etc/fstab $ vgs $ pvs $ lvs $ df -h $ lsof +D / /* beware not to kill your box */
  • How many filesystems are mounted?
  • Is there a dedicated filesystem for some of the services? (MySQL by any chance..?)
  • What are the filesystem mount options: noatime? default? Have some filesystem been re-mounted as read-only?
  • Do you have any disk space left?
  • Is there any big (deleted) files that haven’t been flushed yet?
  • Do you have room to extend a partition if disk space is an issue?

Kernel, interrupts and network usage

$ sysctl -a | grep ... $ cat /proc/interrupts $ cat /proc/net/ip_conntrack /* may take some time on busy servers */ $ netstat $ ss -s
  • Are your IRQ properly balanced across the CPU? Or is one of the core overloaded because of network interrupts, raid card, …?
  • How much is swappinness set to? 60 is good enough for workstations, but when it come to servers this is generally a bad idea: you do not want your server to swap… ever. Otherwise your swapping process will be locked while data is read/written to the disk.
  • Is conntrack_max set to a high enough number to handle your traffic?
  • How long do you maintain TCP connections in the various states (TIME_WAIT, …)?
  • netstat can be a bit slow to display all the existing connections, you may want to use ss instead to get a summary.

Have a look at Linux TCP tuning for some more pointer as to how to tune your network stack.

System logs and kernel messages

$ dmesg $ less /var/log/messages $ less /var/log/secure $ less /var/log/auth
  • Look for any error or warning messages; is it spitting issues about the number of connections in your conntrack being too high?
  • Do you see any hardware error, or filesystem error?
  • Can you correlate the time from those events with the information provided beforehand?

Cronjobs

$ ls /etc/cron* + cat $ for user in $(cat /etc/passwd | cut -f1 -d:); do crontab -l -u $user; done
  • Is there any cron job that is running too often?
  • Is there some user’s cron that is “hidden” to the common eyes?
  • Was there a backup of some sort running at the time of the issue?

Application logs

There is a lot to analyze here, but it’s unlikely you’ll have time to be exhaustive at first. Focus on the obvious ones, for example in the case of a LAMP stack:

  • Apache & Nginx; chase down access and error logs, look for 5xx errors, look for possible limit_zone errors.
  • MySQL; look for errors in the mysql.log, trace of corrupted tables, innodb repair process in progress. Looks for slow logs and define if there is disk/index/query issues.
  • PHP-FPM; if you have php-slow logs on, dig in and try to find errors (php, mysql, memcache, …). If not, set it on.
  • Varnish; in varnishlog and varnishstat, check your hit/miss ratio. Are you missing some rules in your config that let end-users hit your backend instead?
  • HA-Proxy; what is your backend status? Are your health-checks successful? Do you hit your max queue size on the frontend or your backends?

Conclusion

After these first 5 minutes (give or take 10 minutes) you should have a better understanding of:

  • What is running.
  • Whether the issue seems to be related to IO/hardware/networking or configuration (bad code, kernel tuning, …).
  • Whether there’s a pattern you recognize: for example a bad use of the DB indexes, or too many apache workers.

You may even have found the actual root cause. If not, you should be in a good place to start digging further, with the knowledge that you’ve covered the obvious.


本文转自    拖鞋崽      51CTO博客,原文链接:http://blog.51cto.com/1992mrwang/1178760

相关实践学习
每个IT人都想学的“Web应用上云经典架构”实战
本实验从Web应用上云这个最基本的、最普遍的需求出发,帮助IT从业者们通过“阿里云Web应用上云解决方案”,了解一个企业级Web应用上云的常见架构,了解如何构建一个高可用、可扩展的企业级应用架构。
MySQL数据库入门学习
本课程通过最流行的开源数据库MySQL带你了解数据库的世界。   相关的阿里云产品:云数据库RDS MySQL 版 阿里云关系型数据库RDS(Relational Database Service)是一种稳定可靠、可弹性伸缩的在线数据库服务,提供容灾、备份、恢复、迁移等方面的全套解决方案,彻底解决数据库运维的烦恼。 了解产品详情: https://www.aliyun.com/product/rds/mysql 
相关文章
|
1月前
|
弹性计算 ice
阿里云4核8g服务器多少钱一年?1个月和1小时价格,省钱购买方法分享
阿里云4核8G服务器价格因实例类型而异,经济型e实例约159元/月,计算型c9i约371元/月,按小时计费最低0.45元。实际购买享折扣,1年最高可省至1578元,附主流ECS实例及CPU型号参考。
345 8
|
1月前
|
弹性计算 定位技术 数据中心
阿里云服务器配置选择方法:付费类型、地域及CPU内存配置全解析
阿里云服务器怎么选?2025最新指南:就近选择地域,降低延迟;长期使用选包年包月,短期灵活选按量付费;企业选2核4G5M仅199元/年,个人选2核2G3M低至99元/年,高性价比爆款推荐,轻松上云。
158 11
|
4月前
|
JSON 监控 API
在线网络PING接口检测服务器连通状态免费API教程
接口盒子提供免费PING检测API,可测试域名或IP的连通性与响应速度,支持指定地域节点,适用于服务器运维和网络监控。
|
1月前
|
域名解析 运维 监控
如何检测服务器是否被入侵?解析5大异常指标
本文系统介绍了服务器入侵的五大检测维度:硬件资源、网络流量、系统日志、文件完整性及综合防护。涵盖CPU、内存异常,可疑外联与隐蔽通信,登录行为审计,关键文件篡改识别等内容,并提供实用工具与防护建议,助力运维人员快速发现潜在威胁,提升系统安全防御能力。转载链接:https://www.ffy.com/latest-news/1916688607247077376
202 0
如何检测服务器是否被入侵?解析5大异常指标
|
1月前
|
弹性计算
阿里云8核16G云服务器收费标准:最新价格及省钱购买方法整理
阿里云8核16G云服务器价格因实例类型而异。计算型c9i约743元/月,一年6450元(7折);通用算力型u1约673元/月,一年仅需4225元(5.1折)。实际价格享时长折扣,详情见ECS官网。
|
1月前
|
域名解析 弹性计算 负载均衡
给阿里云服务器加速的方法有哪些?
本文介绍如何通过阿里云CDN加速、ECS网络优化及SLB负载均衡三种方法提升服务器响应速度。涵盖具体操作步骤,助力用户优化访问性能。
|
7月前
|
存储 固态存储 文件存储
服务器数据恢复—NAS存储精简lv的故障的数据恢复案例
NAS数据恢复环境: QNAP TS-532X NAS设备中有两块1T的SSD固态硬盘和3块5T的机械硬盘。三块机械硬盘组建了一组RAID5阵列,两块固态硬盘组建RAID1阵列。划分了一个存储池,并通过精简LVM划分了7个lv。 NAS故障: 硬盘故障导致无法正常进入系统,7个lv分区丢失。
|
2月前
|
弹性计算 定位技术 数据中心
阿里云服务器选择方法:配置、地域及付费模式全解析
2025阿里云服务器选购指南:就近选择地域以降低延迟,企业用户优选2核4G5M带宽u1实例,仅199元/年;个人用户可选2核2G3M带宽ECS,99元/年起。长期稳定业务选包年包月,短期或波动场景用按量付费,轻松搭建网站首选高性价比配置。
|
3月前
|
弹性计算 小程序 容灾
2025购买阿里云服务器配置选择方法:企业+个人+学生攻略
2025年阿里云服务器购买省钱攻略,涵盖个人、中小企业及高性能配置推荐。个人用户优选38元轻量或99元ECS,企业用户选199元2核4G服务器,游戏用户适合4核16G或8核32G配置,详情请参考最新活动及攻略。
907 11
|
8月前
|
存储 数据挖掘 数据库
服务器数据恢复—服务器raid磁盘出现故障的数据恢复案例
一台服务器中有一组由三块SAS硬盘组建的raid阵列。服务器上部署的数据库存储在D分区,数据库备份存储在E分区。 服务器上一块硬盘指示灯显示红色。D分区不可识别。E分区虽然可以识别,但是E分区拷贝文件报错。 管理员重启服务器,先离线的硬盘上线开始同步数据,同步没有完成的情况下管理员将服务器强制关机,之后没有动过服务器。