使用 XPATH 和 HTML Cleaner 解析 HTML/XML

简介: 使用 XPATH 和 HTML Cleaner 解析 HTML/XML(Using XPATH and HTML Cleaner to parse HTML / XML)太阳火神的美丽人生 (http://blog.csdn.net/opengl_es)本文遵循“署名-非商业用途-保持一致”创作公用协议转载请保留此句:太阳火神的美丽人生 -  本博客专注于 敏捷开发及移动和物联设备研究:iOS、Android、Html5、Arduino、pcDuino,否则,出自本博客的文章拒绝转载或再转载,谢谢合作。

使用 XPATH 和 HTML Cleaner 解析 HTML/XML
(Using XPATH and HTML Cleaner to parse HTML / XML)

太阳火神的美丽人生 (http://blog.csdn.net/opengl_es)

本文遵循“署名-非商业用途-保持一致”创作公用协议

转载请保留此句:太阳火神的美丽人生 -  本博客专注于 敏捷开发及移动和物联设备研究:iOS、Android、Html5、Arduino、pcDuino否则,出自本博客的文章拒绝转载或再转载,谢谢合作。



使用 XPATH 和 HTML Cleaner 解析 HTML/XML
(Using XPATH and HTML Cleaner to parse HTML / XML)

JANUARY 5, 2010
tags:   android,   examples,   HTML,   parse,   scraping,   XML,   XPATH

大家好
Hey everyone,

有时我发现有一种能力十分有用,尤其在 Web 相关的应用中,那就是从 web 站点获取 HTML 并且从 HTML 解析数据,或是任何你要想得到的内容(对于我的情况大多总是数据)。
So something that I’ve found to be extremely useful (especially in web related applications) is the ability to retrieve HTML from websites and parse their HTML for data or whatever you may be looking for (in my case it is almost always data).


I actually use this technique to do the real time stock/option imports for my Black-Scholes/Implied Volatility applications, so if you’re looking for an example on how to retrieve and parse HTML and run “queries” over it using, say, XPATH, then this post is for you.

Now, before we begin, in order to do this you will have to reference an external JAR in your project’s build path. The JAR that I use comes from HtmlCleaner which even gives you an example of how they use it here HtmlCleaner Example, but in addition to that I’ll show you an example of how I use it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
public class OptionScraper {
 
     // EXAMPLE XPATH QUERIES IN THE FORM OF STRINGS - WILL BE USED LATER
     private static final String NAME_XPATH = "//div[@class='yfi_quote']/div[@class='hd']/h2" ;
 
     private static final String TIME_XPATH = "//table[@id='time_table']/tbody/tr/td[@class='yfnc_tabledata1']" ;
 
     private static final String PRICE_XPATH = "//table[@id='price_table']//tr//span" ;
 
     // TAGNODE OBJECT, ITS USE WILL COME IN LATER
     private static TagNode node;
 
     // A METHOD THAT HELPS ME RETRIEVE THE STOCK OPTION'S DATA BASED OFF THE NAME (I.E. GOUAA IS ONE OF GOOGLE'S STOCK OPTIONS)
     public static Option getOptionFromName(String name) throws XPatherException, ParserConfigurationException,SAXException, IOException, XPatherException {
 
         // THE URL WHOSE HTML I WANT TO RETRIEVE AND PARSE
         String option_url = "http://finance.yahoo.com/q?s=" + name.toUpperCase();
 
         // THIS IS WHERE THE HTMLCLEANER COMES IN, I INITIALIZE IT HERE
         HtmlCleaner cleaner = new HtmlCleaner();
         CleanerProperties props = cleaner.getProperties();
         props.setAllowHtmlInsideAttributes( true );
         props.setAllowMultiWordAttributes( true );
         props.setRecognizeUnicodeChars( true );
         props.setOmitComments( true );
 
         // OPEN A CONNECTION TO THE DESIRED URL
         URL url = new URL(option_url);
         URLConnection conn = url.openConnection();
 
         //USE THE CLEANER TO "CLEAN" THE HTML AND RETURN IT AS A TAGNODE OBJECT
         node = cleaner.clean( new InputStreamReader(conn.getInputStream()));
 
         // ONCE THE HTML IS CLEANED, THEN YOU CAN RUN YOUR XPATH EXPRESSIONS ON THE NODE, WHICH WILL THEN RETURN AN ARRAY OF TAGNODE OBJECTS (THESE ARE RETURNED AS OBJECTS BUT GET CASTED BELOW)
         Object[] info_nodes = node.evaluateXPath(NAME_XPATH);
         Object[] time_nodes = node.evaluateXPath(TIME_XPATH);
         Object[] price_nodes = node.evaluateXPath(PRICE_XPATH);
 
         // HERE I JUST DO A SIMPLE CHECK TO MAKE SURE THAT MY XPATH WAS CORRECT AND THAT AN ACTUAL NODE(S) WAS RETURNED
         if (info_nodes.length > 0 ) {
             // CASTED TO A TAGNODE
             TagNode info_node = (TagNode) info_nodes[ 0 ];
             // HOW TO RETRIEVE THE CONTENTS AS A STRING
             String info = info_node.getChildren().iterator().next().toString().trim();
 
             // SOME METHOD THAT PROCESSES THE STRING OF INFORMATION (IN MY CASE, THIS WAS THE STOCK QUOTE, ETC)
             processInfoNode(o, info);
         }
 
         if (time_nodes.length > 0 ) {
             TagNode time_node = (TagNode) time_nodes[ 0 ];
             String date = time_node.getChildren().iterator().next().toString().trim();
 
             // DATE RETURNED IN 15-JAN-10 FORMAT, SO THIS IS SOME METHOD I WROTE TO JUST PARSE THAT STRING INTO THE FORMAT THAT I USE
             processDateNode(o, date);
         }
 
         if (price_nodes.length > 0 ) {
             TagNode price_node = (TagNode) price_nodes[ 0 ];
             double price = Double.parseDouble(price_node.getChildren().iterator().next().toString().trim());
             o.setPremium(price);
         }
 
         return o;
     }
}

So that’s it! Once you include the JAR in your build path, everything else is pretty easy! It’s a great tool to use. However, it does require knowledge of XPATH but XPATH isn’t too hard to pick up and is useful to know so if you don’t know it then take a look at the link.

Now, a warning to everyone. It’s documented that the XPATH expressions recognized by HtmlCleaner is not complete in the sense that only “basic” XPATH is recognized. What’s excluded? For instance, you can’t use any of the “axes” operators (i.e. parent, ancestor, following, following-sibling, etc), but in my experience everything else is fair game. Yes, it sucks, and many times it can make your life a little bit harder, but usually it just requires you to be a tad more clever with your XPATH expressions before you can pull the desired information.

And of course, this technique works for XML documents as well!

Hope this was helpful to everyone. Let me know if you’re confused anywhere.

- jwei



目录
相关文章
|
2天前
|
XML 网络协议 Java
XML Web 服务技术解析:WSDL 与 SOAP 原理、应用案例一览
XML Web服务是基于WSDL、SOAP、RDF和RSS等标准的网络应用程序组件技术。WSDL描述服务接口和消息格式,SOAP用于结构化信息交换,RDF描述网络资源,RSS则用于发布网站更新。Web服务特点是自包含、自描述,基于开放协议,可重用且能连接现有软件。WSDL文档包含`types`、`message`、`portType`和`binding`元素,定义服务操作和协议。SOAP协议规定消息格式,通过HTTP等传输。
138 1
|
3天前
|
XML JavaScript Java
Java一分钟之-XML解析:DOM, SAX, StAX
Java中的XML解析包括DOM、SAX和StAX三种方法。DOM将XML加载成内存中的树形结构,适合小文件和需要随意访问的情况,但消耗资源大。SAX是事件驱动的,逐行读取,内存效率高,适用于大型文件,但编程复杂。StAX同样是事件驱动,但允许程序员控制解析流程,低内存占用且更灵活。每种方法都有其特定的易错点和避免策略,选择哪种取决于实际需求。
13 0
|
4天前
|
XML JavaScript API
Python XML 解析
Python XML 解析
|
5天前
|
XML Web App开发 JavaScript
XML DOM 解析器
浏览器内置XML解析器将XML转换为JavaScript可操作的DOM对象。通过XMLHttpRequest或ActiveXObject加载XML文档,如"books.xml",然后解析成DOM,便于JavaScript访问和操作。示例展示了如何使用XMLHttpRequest加载XML。
|
7天前
|
XML Web App开发 JavaScript
XML DOM 解析器
浏览器内置XML解析器,用于读取和操作XML。XML解析器将XML转换为JavaScript可访问的DOM对象,提供遍历、增删节点功能。要访问XML文档,需先加载到DOM。以下JS代码示例展示了如何使用XMLHttpRequest加载XML文档"books.xml":创建XMLHTTP对象,打开并发送请求到服务器,然后将响应转换为DOM对象。
|
10天前
|
移动开发 网络协议 安全
HTML5页面被运营商DNS问题及解决方案,app中h5页面源码的获取
HTML5页面被运营商DNS问题及解决方案,app中h5页面源码的获取
67 4
|
10天前
|
域名解析 网络协议 应用服务中间件
2024最新彩虹聚合DNS管理系统源码v1.3 全开源
聚合DNS管理系统可以实现在一个网站内管理多个平台的域名解析,目前已支持的域名平台有:阿里云、腾讯云、华为云、西部数码、DNSLA、CloudFlare。 本系统支持多用户,每个用户可分配不同的域名解析权限;支持API接口,支持获取域名独立DNS控制面板登录链接,方便各种IDC系统对接。
61 0
|
17天前
|
Linux 网络安全 Windows
网络安全笔记-day8,DHCP部署_dhcp搭建部署,源码解析
网络安全笔记-day8,DHCP部署_dhcp搭建部署,源码解析
|
17天前
HuggingFace Tranformers 源码解析(4)
HuggingFace Tranformers 源码解析
82 0
|
17天前
HuggingFace Tranformers 源码解析(3)
HuggingFace Tranformers 源码解析
71 0

推荐镜像

更多