How to Quickly Implement Nginx-based Website Monitoring-阿里云开发者社区

开发者社区> 芷沁> 正文

How to Quickly Implement Nginx-based Website Monitoring

简介: A small internet startup James works with has been running its application on a domestic cloud A. The application adopts the common distributed Nginx



In this article, we dive into a scenario that discusses a rapidly growing business with an application that provides users with e-commerce data statistics web services. The application adopts the common distributed Nginx + app architecture, and to overcome the performance issues and bugs, it's needed to set up monitoring for the application services to improve the quality of application operation. This article will discuss the ins and outs on how to navigate around this difficult issue.

All Begins from Application Service Monitoring

A small internet startup James works with has been running its application on a domestic Cloud A. The application adopts the common distributed Nginx + app architecture to provide users with e-commerce data statistics web services. The application is performing well except for occasional bugs and performance issues.


Recently, James's manager assigned James the task of setting up monitoring for the application services to improve the quality of application operation. James's manager has primarily three requirements:

  1. First, with application service monitoring as the starting point, the system should be able to:
    a) Generate real-time statistics on the number of calls of various types of services;

b) Based on point a), generate real-time statistical data on the number of occurrences of various response values of various services, such as 200, 404 and 500;
c) Based on point b), issue real-time alarms if the number of calls of a type of response value exceeds the set limit;

  1. The system should be able to provide the query feature for historical data and return the statistics of any response value calls during any period for all services;
  2. Going forward, the company's monitoring of various customized services should allow quick extension to the system, such as the statistics on the interface response time, and user features statistics;

In the manager's words, "The solution should be as versatile, quick, good and cost-effective as possible, and it is ideal setting up the monitoring platform on the A Cloud. James must not place the data on a third-party cloud, mainly to save on public traffic costs and simplify preparation for future big data analysis."

Technical options

James begins with technical model selection after receiving the brief on his mission. Now, he has three feasible options from which he has to choose: the traditional OLAP-style approach, search engines, and real-time computing approach, as depicted below:


Post analyzing the status quo and numerous techniques, his findings were:

1. The Traditional OLAP-style Approach

As the company is not a small-scale business, the average QPS, during the day, in peak hours could go beyond 100, considering that the business is still growing rapidly. For this reason, it is inappropriate to store the information called hundreds of times per second directly to the database for real-time queries. The cost is too high, and the architecture is not suitable for expansion.

2. Search Engines

A Cloud provides search engine services, and its error statistics can essentially meet the business requirements. However, there are two uncertainties in this case. On the one hand, the search engine price and storage costs are high (search engines need to introduce index storage), and the company cannot guarantee various types of aggregate queries such as the query response time of interface response time statistics. On the other hand, considering the real-time alarming feature, there is a need to write APIs for endless polls of various call error counts, for which the performance and cost are not certain.

3. Real-time Computing Approach

Based on real-time computing architecture, it processes all the online logs for real-time aggregation computing in the memory in the service, returns value error type and time dimensions, and then stores persistently to the storage. On the one hand, a real-time calculation can be highly efficient, as it greatly reduces the size of aggregated results in comparison with the raw data. Thereby reducing the persistence cost and ensuring real-time processing. On the other hand, we can validate the alarm policies, in real time, in the memory to minimize the alarm performance overhead.

Considering the factors mentioned above, the real-time computing-based architecture seems to be the best fit for the current needs of the company. Once the decision is final, James begins to explore the architectural design deeply.

Architecture Design

Post determining that the most suitable technology should be one with the backbone as real-time computing, James begins to design the architecture. After thorough research and by referring to various technical websites, he concluded that the following components are indispensable for structuring a reliable website monitoring solution.

Data channel

The data channel is responsible for pulling data from Nginx and sending it to the search engine. It also undertakes the tasks of data accumulation and data recalculation.

Computing Engines

The aggregation real-time computing logic in the error code and time dimensions based on the Nginx service is based on chosen engine. At the same time, the computing engine is also responsible for alarm logic.


This is the location where the Nginx monitoring stores the results. Taking into account that the monitoring results, although in a simple table structure, have a variety of dimensions of queries, a storage type similar to the OLAP is preferable.

Demonstration Portal

Demonstration portal carries out the rapid analysis and presentation of all dimensions for all Nginx monitoring results.

Below is a depiction of a suitable architecture diagram in such a scenario:


Fortunately, A Cloud provides ready-made products for the first three components. Hence James does not need to build them one by one, lowering the entry threshold.

  • Regarding the data channel, James selects a Kafka-like data channel on Alibaba Cloud. This service supports performance and message accumulation among other features while providing some level of simplicity for data access.
  • To keep it simple, James chooses a spark-stream-based computing engine component that allows programmers to write SQL statements directly for real-time calculations. Subsequently, James does not need to write the streaming computing program on his own.
  • For storage, James goes with an HBase-like cloud storage product because there are no demanding transaction requirements; however, there is a high need for capacity.
  • No ready product for the demonstration portal exists leading James to scratch his head and decide to delve deeper into the programming technology he learned some time ago. He writes a simple query portal based on open-source demonstration frameworks.

With approval from his manager on the budget, James begins to activate a variety of products for development testing. The target duration for accomplishing this mission is one month.

Long Journey of Development

The long journey of development of this architecture began with a simple activation process. Within less than half a day, Kafka, Storm, HBase tenants, and clusters were ready. Unfortunately, as the saying goes, program developers spend 80% of the time of a development project on the final 20% pitfalls. One month elapsed with less than 70% of the project features completed. James encountered the following pitfalls and recorded them in his technical blog:

Integration Troubleshooting Costs

The integrated components include data channels, real-time computing layer, and background storage. The data push logic, as well as alarm query logic, are integrated into the code, and if any link suffers a slight error, it blocks the entire link. Further, the debugging cost is very high.

Log Cleaning

During the development, to get the push logic of related application adjustments, after the content of each Nginx log changes, there is need to change the API push logic on each service end. However, the change process is lengthy and error-prone.

Persistence Table Design

The appropriate design of tables and databases for the monitoring items is essential to avoid index hotspots. One needs to ensure that the database writes idempotence of data results when the real-time computing layer instability leads to repeated calculation. This is a major challenge for the table structure design.

Delayed Data Consolidation

If Nginx log data sending delays due to reasons that pertain to the application, it is not possible to ensure that one can accurately calculate delayed data, such as data delayed for one hour, by the real-time computing engine and the results merged into the previous results.

Alarm Issuing

Consider setting time-bound tasks for all results to traverse and query the data every minute. For example, when the services with 500 call errors account for more than 5% of the total, multiple call result traversals, then you should consider conducting queries for all such services. Another challenge is to avoid missing any service error checks while ensuring efficient queries.

Alarm Accuracy

Sometimes due to log delay, the normal log of servers in the last minute has not completed collection, resulting in a temporary proportion of more than 5% of services with local 500 call errors. Will the system trigger an alarm for such errors? If it does trigger an alarm, there may be false alerts. However, if not, how should one handle such a case?

How to Calculate UV and TopN

Taking UV as an example, if you want to query UV for any time span, it will also involve storing the full IP access information per unit of time (such as minutes) in the database in a conventional manner. This is unacceptable regarding storage utilization. Is there a better solution?

Diagnostic Methods for Error Scenarios

For various call errors with 500 returned values, the business expects that it is possible to query the detailed call input parameters and other details in the required time or called service dimensions for the emergence of 500 errors. The scenario resembles log search. For similar new requirements, it seems that we cannot fulfill this requirement through real-time aggregation computing and storage. Hence, we need to find another way.

The issues mentioned above are add-on challenges to the various problems discussed earlier in this article.

With two months gone by with James trying to counter and find a solution to these challenges, and with the project half done, James begins to get anxious.


This article dove into a scenario that discusses a rapidly growing business with an application that provides users with e-commerce data statistics and web services. The application adopts the common distributed Nginx + app architecture, and to overcome the performance issues and bugs, we needed to set up monitoring for the application services to improve the quality of application operation. While finding the correct method to go about it and exploring various options, the best one comes out to be Real-time Computing Approach. It not only performs real-time aggregation computing but also persistently and efficiently stores data while reducing overheads. We discuss the application architecture design, its components, the challenges encountered and the essential go-to solution components for businesses, provided by Alibaba Cloud.


如果在创建实例时没有设置密码,或者密码丢失,您可以在控制台上重新设置实例的登录密码。本文仅描述如何在 ECS 管理控制台上修改实例登录密码。
9414 0
11158 0
12015 0
虽然0.0.0.0/0使用非常方便,但是发现很多同学使用它来做内网互通,这是有安全风险的,实例有可能会在经典网络被内网IP访问到。下面介绍一下四种安全的内网互联设置方法。 购买前请先:领取阿里云幸运券,有很多优惠,可到下文中领取。
11785 0
windows server 2008阿里云ECS服务器安全设置
最近我们Sinesafe安全公司在为客户使用阿里云ecs服务器做安全的过程中,发现服务器基础安全性都没有做。为了为站长们提供更加有效的安全基础解决方案,我们Sinesafe将对阿里云服务器win2008 系统进行基础安全部署实战过程! 比较重要的几部分 1.
9027 0
腾讯云服务器 设置ngxin + fastdfs +tomcat 开机自启动
在tomcat中新建一个可以启动的 .sh 脚本文件 /usr/local/tomcat7/bin/ export JAVA_HOME=/usr/local/java/jdk7 export PATH=$JAVA_HOME/bin/:$PATH export CLASSPATH=.
4616 0
阿里云ECS云服务器初始化是指将云服务器系统恢复到最初状态的过程,阿里云的服务器初始化是通过更换系统盘来实现的,是免费的,阿里云百科网分享服务器初始化教程: 服务器初始化教程方法 本文的服务器初始化是指将ECS云服务器系统恢复到最初状态,服务器中的数据也会被清空,所以初始化之前一定要先备份好。
6875 0
购买阿里云ECS云服务器后如何登录?场景不同,云吞铺子总结大概有三种登录方式: 登录到ECS云服务器控制台 在ECS云服务器控制台用户可以更改密码、更换系统盘、创建快照、配置安全组等操作如何登录ECS云服务器控制台? 1、先登录到阿里云ECS服务器控制台 2、点击顶部的“控制台” 3、通过左侧栏,切换到“云服务器ECS”即可,如下图所示 通过ECS控制台的远程连接来登录到云服务器 阿里云ECS云服务器自带远程连接功能,使用该功能可以登录到云服务器,简单且方便,如下图:点击“远程连接”,第一次连接会自动生成6位数字密码,输入密码即可登录到云服务器上。
21869 0