Analyzing a web crawler (part 3)

简介: 相关过程

There are no significant changes to much of our functions’ code, however there have been some changes that have occurred to the parse content function, this one in particular:


9.png

9.png

The main change we have made towards our parse detail content function, is that we have added a try-except statement for our bs4 connection to the website’s server, this way we know whether the server’s response returns nothing, or the server has remotely shut down our connection to it, thus recognizing us as someone using a bot, and not a legitimate human using their own computer to access the webpage. This will make our bot more “technologically advanced” when it comes to saving us and potentially any future users some time, in case they run into any errors while running the bot on whatever website they are trying to crawl. This ensures that if there is no response from the server for a long time, or there has been some errors that may not directly return to the python console, the crawler’s user will be notified of it, this way they don’t have to sit there and wait for nothing particularly useful to happen.


There has been another minor change that has occurred to our main function that calls all the smaller functions inside of our crawler’s class:


10.png

You may not be able to see the change immediately, this is because we only added two new lines of code. We basically are enumerating a object that contains another object called “content” that holds a list of all the contens of the story that we want to parse. We enumerate this object called “r”, and we can print the numerate that is enumerated through by the function. This helps us when we want to see what the python interpreter is cycling through and running after we hit the green run button in PyCharm.


The third and final change we added to our crawler, is the custom DelayWait function that we made a class for in a seperate python file, and we added this code into the parse_detail_content() function.

11.png


We use a custom DelayWait object called “DELAYWAIT” that we defined in the few lines at the beginning of our python file. We made this DelayWait custom wait function in a class in a seperate python file called “DelayWait.py” This file contains the following code:


12.png

12.png

We use a special function called “netloc” to essentially “lock” the request time and delay from the connection our computer generates to the server, this way we have a steady response time, and the server won’t immediately flag us for suspicious amounts of requests. It basically grabs the delay between the previous request, and the current one, subtracts the first connection’s delay time from the seocnd one, this way it creates a perfect balance for the remaining number, and that number is the perfect delay for our connection to sustain without the website’s server flagging our ip for suspicious network traffic.



11.png


10.png

目录
相关文章
|
数据采集 安全 数据安全/隐私保护
Flask Web 极简教程(四)- Flask WTF Froms(Part A)
Flask Web 极简教程(四)- Flask WTF Froms(Part A)
Flask Web 极简教程(四)- Flask WTF Froms(Part A)
|
存储 SQL 数据库
Flask Web 极简教程(三)- SqlAlchemy(Part A)
Flask Web 极简教程(三)- SqlAlchemy(Part A)
Flask Web 极简教程(三)- SqlAlchemy(Part A)
|
前端开发 数据库 Python
Django Web 极简教程(六)- Django Form(Part A)
Django Web 极简教程(六)- Django Form(Part A)
Django Web 极简教程(六)- Django Form(Part A)
|
前端开发 JavaScript Java
Flask Web 极简教程(二)- Flask 模板(Part E)
Flask Web 极简教程(二)- Flask 模板(Part E)
Flask Web 极简教程(二)- Flask 模板(Part E)
|
JavaScript Java Spring
Spring 全家桶之 Spring Boot 2.6.4(六)- Web Develop(Part C)(下)
Spring 全家桶之 Spring Boot 2.6.4(六)- Web Develop(Part C)(下)
Spring 全家桶之 Spring Boot 2.6.4(六)- Web Develop(Part C)(下)
|
前端开发 Java Spring
Spring 全家桶之 Spring Boot 2.6.4(六)- Web Develop(Part C)(上)
Spring 全家桶之 Spring Boot 2.6.4(六)- Web Develop(Part C)
Spring 全家桶之 Spring Boot 2.6.4(六)- Web Develop(Part C)(上)
|
缓存 前端开发 Java
Spring 全家桶之 Spring Boot 2.6.4(六)- Web Develop(Part B)
Spring 全家桶之 Spring Boot 2.6.4(六)- Web Develop(Part B)
Spring 全家桶之 Spring Boot 2.6.4(六)- Web Develop(Part B)
|
前端开发 JavaScript Java
Spring 全家桶之 Spring Boot 2.6.4(六)- Web Develop(Part A)
Spring 全家桶之 Spring Boot 2.6.4(六)- Web Develop(Part A)
Spring 全家桶之 Spring Boot 2.6.4(六)- Web Develop(Part A)