网页主动探测工具使用

本文涉及的产品
公共DNS(含HTTPDNS解析),每月1000万次HTTP解析
.cn 域名,1个 12个月
全局流量管理 GTM,标准版 1个月
简介:
单位的项目是IBatis做的,每个查询的SQL里面都有很多判断
  上次优化SQL之后,其中的一个分支报错,但是作为dba,不可能排查每一个分支.
  所以,干脆用爬虫爬过所有的网页,主动探测程序的异常.
  这样有两个好处
  1.可以主动查看网页是否异常 (500错误,404错误)
  2.可以筛查速度较慢的网页,从这个方向也可以定位慢SQL吧.(也有服务器资源不足,造成网络超时的情况)
  前提,
  必须是互联网公司,大多数网页不用登录也可以浏览
  首先,建表
  CREATE SEQUENCE seq_probe_id INCREMENT BY 1 START WITH 1 NOMAXvalue NOCYCLE CACHE 2000;
  create table probe(
  id int primary key,
  host varchar(40) not null,
  path varchar(500) not null,
  state int not null,
  taskTime int not null,
  type varchar(10) not null,
  createtime date default sysdate not null
  ) ;
  其中host是域名,path是网页的相对路径,state是HTTP状态码,taskTime是网页获取时间,单位是毫秒,type是类型(html,htm,jpg等)
  程序结构
  程序分三个主要步骤,再分别用三个队列实现生产者消费者模式.
  1.连接.根据连接队列的目标,使用Socket获取网页,然后放入解析队列
  2.解析.根据解析队列的内容,使用正则表达式获取该网页的合法连接,将其再放入连接队列.然后将解析的网页放入持久化队列
  3.持久化.将持久化队列的内容存入数据库,以便查询。
  程序使用三个步骤并行,每个步骤可以并发的方式.
但是通常来说,解析和持久化可以分别用单线程的方式执行.
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.net.InetAddress;
import java.net.Socket;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.SQLException;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import java.util.Set;
import java.util.concurrent.BlockingQueue;
import java.util.concurrent.ConcurrentSkipListSet;
import java.util.concurrent.CopyOnWriteArrayList;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.LinkedBlockingQueue;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Probe {
private static final BlockingQueue<Task> CONNECTLIST = new LinkedBlockingQueue<Task>();
private static final BlockingQueue<Task> PARSELIST = new LinkedBlockingQueue<Task>();
private static final BlockingQueue<Task> PERSISTENCELIST = new LinkedBlockingQueue<Task>();
private static ExecutorService CONNECTTHREADPOOL;
private static ExecutorService PARSETHREADPOOL;
private static ExecutorService PERSISTENCETHREADPOOL;
private static final List<String> DOMAINLIST = new CopyOnWriteArrayList<>();
static {
CONNECTTHREADPOOL = Executors.newFixedThreadPool(200);
PARSETHREADPOOL = Executors.newSingleThreadExecutor();
PERSISTENCETHREADPOOL = Executors.newFixedThreadPool(1);
DOMAINLIST.add("域名");
}
public static void main(String args[]) throws Exception {
long start = System.currentTimeMillis();
CONNECTLIST.put(new Task("域名", 80, "/static/index.html"));
for (int i = 0; i < 600; i++) {
CONNECTTHREADPOOL.submit(new ConnectHandler(CONNECTLIST, PARSELIST));
}
PARSETHREADPOOL.submit(new ParseHandler(CONNECTLIST, PARSELIST, PERSISTENCELIST, DOMAINLIST));
PERSISTENCETHREADPOOL.submit(new PersistenceHandler(PERSISTENCELIST));
while (true) {
Thread.sleep(1000);
long end = System.currentTimeMillis();
float interval = ((end - start) / 1000);
int connectTotal = ConnectHandler.GETCOUNT();
int parseTotal = ParseHandler.GETCOUNT();
int persistenceTotal = PersistenceHandler.GETCOUNT();
int connectps = Math.round(connectTotal / interval);
int parseps = Math.round(parseTotal / interval);
int persistenceps = Math.round(persistenceTotal / interval);
System.out.print("\r连接总数:" + connectTotal + " \t每秒连接:" + connectps + "\t连接队列剩余:" + CONNECTLIST.size()
+ " \t解析总数:" + parseTotal + " \t每秒解析:" + parseps + "\t解析队列剩余:" + PARSELIST.size() + " \t持久化总数:"
+ persistenceTotal + " \t每秒持久化:" + persistenceps + "\t持久化队列剩余:" + PERSISTENCELIST.size());
}
}
}
class Task {
public Task() {
}
public void init(String host, int port, String path) {
this.setCurrentPath(path);
this.host = host;
this.port = port;
}
public Task(String host, int port, String path) {
init(host, port, path);
}
private String host;
private int port;
private String currentPath;
private long taskTime;
private String type;
private String content;
private int state;
public int getState() {
return state;
}
public void setState(int state) {
this.state = state;
}
public String getCurrentPath() {
return currentPath;
}
public void setCurrentPath(String currentPath) {
this.currentPath = currentPath;
this.type = currentPath.substring(currentPath.indexOf(".") + 1,
currentPath.indexOf("?") != -1 ? currentPath.indexOf("?") : currentPath.length());
}
public long getTaskTime() {
return taskTime;
}
public void setTaskTime(long taskTime) {
this.taskTime = taskTime;
}
public String getType() {
return type;
}
public void setType(String type) {
this.type = type;
}
public String getHost() {
return host;
}
public int getPort() {
return port;
}
public String getContent() {
return content;
}
public void setContent(String content) {
this.content = content;
}
}
class ParseHandler implements Runnable {
private static Set<String> SET = new ConcurrentSkipListSet<String>();
public static int GETCOUNT() {
return COUNT.get();
}
private static final AtomicInteger COUNT = new AtomicInteger();
private BlockingQueue<Task> connectlist;
private BlockingQueue<Task> parselist;
private BlockingQueue<Task> persistencelist;
List<String> domainlist;
private interface Filter {
void doFilter(Task fatherTask, Task newTask, String path, Filter chain);
}
private class FilterChain implements Filter {
private List<Filter> list = new ArrayList<Filter>();
{
addFilter(new TwoLevel());
addFilter(new OneLevel());
addFilter(new FullPath());
addFilter(new Root());
addFilter(new Default());
}
private void addFilter(Filter filter) {
list.add(filter);
}
private Iterator<Filter> it = list.iterator();
@Override
public void doFilter(Task fatherTask, Task newTask, String path, Filter chain) {
if (it.hasNext()) {
it.next().doFilter(fatherTask, newTask, path, chain);
}
}
}
private class TwoLevel implements Filter {
@Override
public void doFilter(Task fatherTask, Task newTask, String path, Filter chain) {
if (path.startsWith("../../")) {
String prefix = getPrefix(fatherTask.getCurrentPath(), 3);
newTask.init(fatherTask.getHost(), fatherTask.getPort(), path.replace("../../", prefix));
} else {
chain.doFilter(fatherTask, newTask, path, chain);
}
}
}
private class OneLevel implements Filter {
@Override
public void doFilter(Task fatherTask, Task newTask, String path, Filter chain) {
if (path.startsWith("../")) {
String prefix = getPrefix(fatherTask.getCurrentPath(), 2);
newTask.init(fatherTask.getHost(), fatherTask.getPort(), path.replace("../", prefix));
} else {
chain.doFilter(fatherTask, newTask, path, chain);
}
}
}
private class FullPath implements Filter {
@Override
public void doFilter(Task fatherTask, Task newTask, String path, Filter chain) {
if (path.startsWith("http://")) {
Iterator<String> it = domainlist.iterator();
boolean flag = false;
while (it.hasNext()) {
String domain = it.next();
if (path.startsWith("http://" + domain + "/")) {
newTask.init(domain, fatherTask.getPort(), path.replace("http://" + domain + "/", "/"));
flag = true;
break;
}
}
if (!flag) {
newTask = null;
}
} else {
chain.doFilter(fatherTask, newTask, path, chain);
}
}
}
private class Root implements Filter {
@Override
public void doFilter(Task fatherTask, Task newTask, String path, Filter chain) {
if (path.startsWith("/")) {
newTask.init(fatherTask.getHost(), fatherTask.getPort(), path);
} else {
chain.doFilter(fatherTask, newTask, path, chain);
}
}
}
private class Default implements Filter {
@Override
public void doFilter(Task fatherTask, Task newTask, String path, Filter chain) {
String prefix = getPrefix(fatherTask.getCurrentPath(), 1);
newTask.init(fatherTask.getHost(), fatherTask.getPort(), prefix + "/" + path);
}
}
public ParseHandler(BlockingQueue<Task> connectlist, BlockingQueue<Task> parselist,
BlockingQueue<Task> persistencelist, List<String> domainlist) {
this.connectlist = connectlist;
this.parselist = parselist;
this.persistencelist = persistencelist;
this.domainlist = domainlist;
}
private Pattern pattern = Pattern.compile("\"[^\"]+\\.htm[^\"]*\"");
private void handler() {
try {
Task task = parselist.take();
parseTaskState(task);
if (200 == task.getState()) {
Matcher matcher = pattern.matcher(task.getContent());
while (matcher.find()) {
String path = matcher.group();
if (!path.contains(" ") && !path.contains("\t") && !path.contains("(") && !path.contains(")")
&& !path.contains(":")) {
path = path.substring(1, path.length() - 1);
if (!SET.contains(path)) {
SET.add(path);
createNewTask(task, path);
}
}
}
}
task.setContent(null);
persistencelist.put(task);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
private void parseTaskState(Task task) {
if (task.getContent().startsWith("HTTP/1.1")) {
task.setState(Integer.parseInt(task.getContent().substring(9, 12)));
} else {
task.setState(Integer.parseInt(task.getContent().substring(19, 22)));
}
}
/**
* @param fatherTask
* @param path
* @throws Exception
*/
private void createNewTask(Task fatherTask, String path) throws Exception {
Task newTask = new Task();
FilterChain filterchain = new FilterChain();
filterchain.doFilter(fatherTask, newTask, path, filterchain);
if (newTask != null) {
connectlist.put(newTask);
}
}
private String getPrefix(String s, int count) {
String prefix = s;
while (count > 0) {
prefix = prefix.substring(0, prefix.lastIndexOf("/"));
count--;
}
return "".equals(prefix) ? "/" : prefix;
}
@Override
public void run() {
while (true) {
this.handler();
COUNT.addAndGet(1);
}
}
}
class ConnectHandler implements Runnable {
public static int GETCOUNT() {
return COUNT.get();
}
private static final AtomicInteger COUNT = new AtomicInteger();
private BlockingQueue<Task> connectlist;
private BlockingQueue<Task> parselist;
public ConnectHandler(BlockingQueue<Task> connectlist, BlockingQueue<Task> parselist) {
this.connectlist = connectlist;
this.parselist = parselist;
}
private void handler() {
try {
Task task = connectlist.take();
long start = System.currentTimeMillis();
getHtml(task);
long end = System.currentTimeMillis();
task.setTaskTime(end - start);
parselist.put(task);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
private void getHtml(Task task) throws Exception {
StringBuilder sb = new StringBuilder(2048);
InetAddress addr = InetAddress.getByName(task.getHost());
// 建立一个Socket
Socket socket = new Socket(addr, task.getPort());
// 发送命令,无非就是在Socket发送流的基础上加多一些握手信息,详情请了解HTTP协议
BufferedWriter wr = new BufferedWriter(new OutputStreamWriter(socket.getOutputStream(), "UTF-8"));
wr.write("GET " + task.getCurrentPath() + " HTTP/1.0\r\n");
wr.write("HOST:" + task.getHost() + "\r\n");
wr.write("Accept:*/*\r\n");
wr.write("\r\n");
wr.flush();
// 接收Socket返回的结果,并打印出来
BufferedReader rd = new BufferedReader(new InputStreamReader(socket.getInputStream()));
String line;
while ((line = rd.readLine()) != null) {
sb.append(line);
}
wr.close();
rd.close();
task.setContent(sb.toString());
socket.close();
}
@Override
public void run() {
while (true) {
this.handler();
COUNT.addAndGet(1);
}
}
}
class PersistenceHandler implements Runnable {
static {
try {
Class.forName("oracle.jdbc.OracleDriver");
} catch (ClassNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public static int GETCOUNT() {
return COUNT.get();
}
private static final AtomicInteger COUNT = new AtomicInteger();
private BlockingQueue<Task> persistencelist;
public PersistenceHandler(BlockingQueue<Task> persistencelist) {
this.persistencelist = persistencelist;
try {
conn = DriverManager.getConnection("jdbc:oracle:thin:127.0.0.1:1521:orcl", "edmond", "edmond");
ps = conn
.prepareStatement("insert into probe(id,host,path,state,tasktime,type) values(seq_probe_id.nextval,?,?,?,?,?)");
} catch (SQLException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
private Connection conn;
private PreparedStatement ps;
@Override
public void run() {
while (true) {
this.handler();
COUNT.addAndGet(1);
}
}
private void handler() {
try {
Task task = persistencelist.take();
ps.setString(1, task.getHost());
ps.setString(2, task.getCurrentPath());
ps.setInt(3, task.getState());
ps.setLong(4, task.getTaskTime());
ps.setString(5, task.getType());
ps.executeUpdate();
conn.commit();
} catch (InterruptedException e) {
e.printStackTrace();
} catch (SQLException e) {
e.printStackTrace();
}
}
}
  ParseHandler 使用了一个职责链模式,
  TwoLevel 处理../../开头的连接(../../sucai/sucai.htm)
  OneLevel 处理../开头的连接(../sucai/sucai.htm)
  FullPath 处理绝对路径的连接(http://域名/sucai/sucai.htm)
  Root 处理/开头的连接(/sucai/sucai.htm)
  Default 处理常规的连接(sucai.htm)
  ParseHandler FullPath 过滤需要一个白名单.
  这样可以使程序在固定的域名爬行
  ParseHandler parseTaskState 解析状态码 可能需要根据实际情况进行调整
  比如网页404,服务器可能会返回一个错误页,而不是通常的HTTP状态码。
  第一版仅仅实现了功能,错误处理不完整,
  所以仅仅在定制的域名下生效,其实并不通用,后续会逐步完善.
最新内容请见作者的GitHub页:http://qaseven.github.io/
相关文章
|
2月前
|
运维 Linux 网络安全
"揭秘Traceroute穿越防火墙的隐形障碍:为何路径追踪在此中断?解锁隐藏的网络回显之谜!"
【8月更文挑战第19天】Traceroute是网络故障排查的关键工具,用于追踪数据包路径。但防火墙常致回显问题,表现为路由节点信息缺失。本文通过案例分析此现象:数据包遇防火墙时,因安全策略拦截ICMP或UDP数据包而显示星号。解决方法包括检查防火墙策略以确保ICMP和UDP端口未被阻止,在Linux中使用ICMP版本的Traceroute(如`traceroute -I`),关闭防火墙接口管理功能,或调整安全策略以限制Traceroute访问。针对具体网络环境灵活运用这些策略可有效解决问题。
73 0
|
5月前
|
数据采集 安全 定位技术
获得代理服务器的几种途径
获得代理服务器的几种途径
|
网络安全 Android开发
Fiddler连接手机无网络问题
苹果手机连接fiddler,访问一直没有网络,踩过的大坑,特此记录!
361 0
|
安全 Android开发 网络虚拟化
APP安全——抓包代理工具的设置
APP安全——抓包代理工具的设置
361 0
|
域名解析 编解码 网络协议
【网络篇】第十四篇——HTTP协议(一)(附带电视剧李浔同款爱心+端口号被恶意占用如何清除)(一)
【网络篇】第十四篇——HTTP协议(一)(附带电视剧李浔同款爱心+端口号被恶意占用如何清除)
【网络篇】第十四篇——HTTP协议(一)(附带电视剧李浔同款爱心+端口号被恶意占用如何清除)(一)
|
存储 网络协议 前端开发
【网络篇】第十四篇——HTTP协议(一)(附带电视剧李浔同款爱心+端口号被恶意占用如何清除)(二)
【网络篇】第十四篇——HTTP协议(一)(附带电视剧李浔同款爱心+端口号被恶意占用如何清除)
【网络篇】第十四篇——HTTP协议(一)(附带电视剧李浔同款爱心+端口号被恶意占用如何清除)(二)
|
前端开发 网络安全 C#
fiddler实现手游封包逆向测试:Fiddler手机代理一步到位(fiddler安装+手机代理+封包详解)
本文仅对该教程做一个记录学习测试 🥳🥳🥳 茫茫人海千千万万,感谢这一刻你看到了我的文章,感谢观赏,大家好呀🥳🥳🥳 ✨✨欢迎订阅本专栏或者关注我✨✨ ❤️❤️❤️ 最后,希望我的这篇文章能对你的有所帮助! 愿自己还有你在未来的日子,保持学习,保持进步,保持热爱,奔赴山海! ❤️❤️❤️ 🔥【Python验证码识别】Selenium验证码ddddocr识别:带带ddocr🔥 🔥【C#学习】C#学习记录🔥 ———————————————— 文章目录前言一、分析过程二、工具下载三、工具配置1、第一步2、第
385 0
fiddler实现手游封包逆向测试:Fiddler手机代理一步到位(fiddler安装+手机代理+封包详解)
原来后台是这样分辨浏览器请求的 浏览器发送请求给服务器 服务器通过反射辨别Servlet类 通过methodName分辨方法 妈妈再也不用担心我的学习了
原来后台是这样分辨浏览器请求的 浏览器发送请求给服务器 服务器通过反射辨别Servlet类 通过methodName分辨方法 妈妈再也不用担心我的学习了
195 0
|
Web App开发 JavaScript 前端开发
如何处理浏览器的断网情况?
如何处理浏览器的断网情况?
如何处理浏览器的断网情况?
|
Web App开发 监控 前端开发