XML document processing in Java using XPath and XSLT

简介: XML document processing in Java using XPath and XSLTMore like this XSL gives your XML some style Easy Java/XML integrati...


XML document processing in Java using XPath and XSLT

Discover how XPath and XSLT can significantly reduce the complexity of your Java code when handling XML documents

The Extensible Markup Language (XML) is certainly one of the hottest technologies at the moment. While the concept of markup languages is not new, XML seems especially attractive to Java and Internet programmers. The Java API for XML Parsing (JAXP; see Resources), having recently been defined through the Java Community Process, promises to provide a common interface for accessing XML documents. The W3C has defined the so-called Document Object Model (DOM), which provides a standard interface for working with an XML document in a tree hierarchy, whereas the Simple API for XML (SAX) lets a program parse an XML document sequentially, based on an event handling model. Both of these standards (SAX being a de facto standard) complement the JAXP. Together, these three APIs provide sufficient support for dealing with XML documents in Java, and numerous books on the market describe their use.

Featured Resource
Presented by Zero Turnaround

With JRebel, developers get to see their code changes immediately, fine-tune their code with

Learn More

This article introduces a way to handle XML documents that goes beyond the standard Java APIs for manipulating XML. We'll see that in many cases XPath and XSLT provide simpler, more elegant ways of solving application problems. In some simple samples, we will compare a pure Java/XML solution with one that utilizes XPath and/or XSLT.

Both XSLT and XPath are part of the Extensible Stylesheet Language (XSL) specification (see Resources). XSL consists of three parts: the XSL language specification itself, XSL Transformations (XSLT), and XML Path Language (XPath). XSL is a language for transforming XML documents; it includes a definition -- Formatting Objects -- of how XML documents can be formatted for presentation. XSLT specifies a vocabulary for transforming one XML document into another. You can consider XSLT to be XSL minus Formatting Objects. The XPath language addresses specific parts of XML documents and is intended to be used from within an XSLT stylesheet.

For the purposes of this article, it is assumed that you are familiar with the basics of XML and XSLT, as well as the DOM APIs. (For information and tutorials on these topics, see Resources.)

Note: This article's code samples were compiled and tested with the Apache Xerces XML parser and the Apache Xalan XSL processor (see Resources).

The problem

Many articles and papers that deal with XML state that it is the perfect vehicle to accomplish a good design practice in Web programming: the Model-View-Controller pattern (MVC), or, in simpler terms, the separation of application data from presentation data. If the application data is formatted in XML, it can easily be bound -- typically in a servlet or Java ServerPage -- to, say, HTML templates by using an XSL stylesheet.

But XML can do much more than merely help with model-view separation for an application's frontend. We currently observe more and more widespread use of components (for example, components developed using the EJB standard) that can be used to assemble applications, thus enhancing developer productivity. Component reusability can be improved by formatting the data that components deal with in a standard way. Indeed, we can expect to see more and more published components that use XML to describe their interfaces.

Because XML-formatted data is language-neutral, it becomes usable in cases where the client of a given application service is not known, or when it must not have any dependencies on the server. For example, in B2B environments, it may not be acceptable for two parties to have dependencies on concrete Java object interfaces for their data exchange. New technologies like the Simple Object Access Protocol (SOAP) (see Resources) address these requirements.

All of these cases have one thing in common: data is stored in XML documents and needs to be manipulated by an application. For example, an application that uses various components from different vendors will most likely have to change the structure of the (XML) data to make it fit the need of the application or adhere to a given standard.

Code written using the Java APIs mentioned above would certainly do this. Moreover, there are more and more tools available with which you can turn an XML document into a JavaBean and vice versa, which makes it easier to handle the data from within a Java program. However, in many cases, the application, or at least a part of it, merely processes one or more XML documents as input and converts them into a different XML format as output. Using stylesheets in those cases is a viable alternative, as we will see later in this article.

Use XPath to locate nodes in an XML document

As stated above, the XPath language is used to locate certain parts of an XML document. As such, it's meant to be used by an XSLT stylesheet, but nothing keeps us from using it in our Java program in order to avoid lengthy iteration over a DOM element hierarchy. Indeed, we can let the XSLT/XPath processor do the work for us. Let's take a look at how this works.

Let us assume that we have an application scenario in which a source XML document is presented to the user (possibly after being processed by a stylesheet). The user makes updates to the data and, to save network bandwidth, sends only the updated records back to the application. The application looks for the XML fragment in the source document that needs to be updated and replaces it with the new data.

We will create a little sample that will help you understand the various options. For this example, we assume that the application deals with address records in an addressbook. A sample addressbook document looks like this:

<addressbook>
   <address>
      <addressee>John Smith</addressee>
      <streetaddress>250 18th Ave SE</streetaddress>
      <city>Rochester</city>
      <state>MN</state>
      <postalCode>55902</postalCode>
   </address>
   <address>
      <addressee>Bill Morris</addressee>
      <streetaddress>1234 Center Lane NW</streetaddress>
      <city>St. Paul</city>
      <state>MN</state>
      <postalCode>55123</postalCode>
</address>
</addressbook>

The application (possibly, though not necessarily, a servlet) keeps an instance of the addressbook in memory as a DOM Document object. When the user changes an address, the application's frontend sends it only the updated <address> element.

The <addressee> element is used to uniquely identify an address; it serves as the primary key. This would not make a lot of sense for a real application, but we do it here to keep things simple.

We now need to write some Java code that will help us identify the <address> element in the source tree that needs to be replaced with the updated element. The findAddress() method below shows how that can be accomplished. Please note that, to keep the sample short, we've left out the appropriate error handling.

public Node findAddress(String name, Document source) {
   Element root = source.getDocumentElement();
   NodeList nl = root.getChildNodes();
   // iterate over all address nodes and find the one that has the correct addressee
   for (int i=0;i<nl.getLength(); i++) {
      Node n = nl.item(i);
      if ((n.getNodeType() == Node.ELEMENT_NODE) && 
          (((Element)n).getTagName().equals("address"))) {
         // we have an address node, now we need to find the 
         // 'addressee' child
         Node addressee = ((Element)n).getElementsByTagName("addressee").item(0);
         // there is the addressee, now get the text node and compare
         Node child = addressee.getChildNodes().item(0);
         do {
            if ((child.getNodeType()==Node.TEXT_NODE) &&
                (((Text)child).getData().equals(name))) {             
               return n;
            }
            child = child.getNextSibling(); 
                  } while (child != null);
      }
   }
   return null;
}

The code above could most likely be optimized, but it is obvious that iterating over the DOM tree can be tedious and error prone. Now let's look at how the target node can be located by using a simple XPath statement. The statement could look like this:

//address[child::addressee[text() = 'Jim Smith']]

We can now rewrite our previous method. This time, we use the XPath statement to find the desired node:

public Node findAddress(String name, Document source) throws Exception {
   // need to recreate a few helper objects
   XMLParserLiaison xpathSupport = new XMLParserLiaisonDefault();
   XPathProcessor xpathParser = new XPathProcessorImpl(xpathSupport);
   PrefixResolver prefixResolver = new PrefixResolverDefault(source.getDocumentElement());
   // create the XPath and initialize it
   XPath xp = new XPath();
   String xpString = "//address[child::addressee[text() = '"+name+"']]";
   xpathParser.initXPath(xp, xpString, prefixResolver);
   // now execute the XPath select statement
   XObject list = xp.execute(xpathSupport, source.getDocumentElement(), prefixResolver);
   // return the resulting node
   return list.nodeset().item(0);
}

The above code may not look a lot better than the previous try, but most of this method's contents could be encapsulated in a helper class. The only part that changes over and over is the actual XPath expression and the target node.

This lets us create an XPathHelper class, which looks like this:

import org.w3c.dom.*;
import org.xml.sax.*;
import org.apache.xalan.xpath.*;
import org.apache.xalan.xpath.xml.*;
public class XPathHelper {
   XMLParserLiaison xpathSupport = null;
   XPathProcessor xpathParser = null;
   PrefixResolver prefixResolver = null;
   XPathHelper() {
      xpathSupport = new XMLParserLiaisonDefault();
      xpathParser = new XPathProcessorImpl(xpathSupport);
   }
   public NodeList processXPath(String xpath, Node target) thrws SAXException {
      prefixResolver = new PrefixResolverDefault(target);
      // create the XPath and initialize it
      XPath xp = new XPath();
      xpathParser.initXPath(xp, xpath, prefixResolver);
      // now execute the XPath select statement
      XObject list = xp.execute(xpathSupport, target, prefixResolver);
      // return the resulting node
      return list.nodeset();
   }
}

After creating the helper class, we can rewrite our finder method again, which is now very short:

public Node findAddress(String name, Document source) throws Exception {
   XPathHelper xpathHelper = new XPathHelper();
   NodeList nl = xpathHelper.processXPath(
        "//address[child::addressee[text() = '"+name+"']]", 
        source.getDocumentElement());
   return nl.item(0);
}

The helper class can now be used whenever a node or a set of nodes needs to be located in a given XML document. The actual XPath statement could even be loaded from an external source, so that changes could be made on the fly if the source document structure changes. In this case, no recompile is necessary.

Process XML documents with XSL stylesheets

In some cases, it makes sense to outsource the entire handling of an XML document to an external XSL stylesheet, a process in some respects similar to the use of XPath as described in the previous section. With XSL stylesheets, you can create an output document by selecting nodes from the input document and merging their content with stylesheet content, based on pattern rules.

If an application changes the structure and content of an XML document and producing a new document, it may be better and easier to use a stylesheet to handle the work rather than writing a Java program that does the same job. The stylesheet is most likely stored in an external file, allowing you to change it on the fly, without the need to recompile.

For example, we could accomplish the processing for the addressbook sample by creating a stylesheet that merges the cached version of the addressbook with the updated one, thus creating a new document with the updates in it.

Here is a sample of such a stylesheet:

<?xml version='1.0'?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
   <xsl:output method="xml"/>
<xsl:variable name="doc-file">http://mymachine.com/changed.xml</xsl:variable>
<!-- copy everything that has no other pattern defined -->
<xsl:template match="* | @*">
   <xsl:copy><xsl:copy-of select="@*"/><xsl:apply-templates/></xsl:copy>
</xsl:template>
<!-- check for every <address> element if an updated one exists -->
<xsl:template match="//address">
   <xsl:param name="addresseeName">
      <xsl:value-of select="addressee"/>
   </xsl:param>
   <xsl:choose>
      <xsl:when test="document($doc-file)//addressee[text()=$addresseeName]">
         <xsl:copy-of select="document($doc-file)//address[child::addressee[text()=$addresseeName]]"/>
      </xsl:when>
      <xsl:otherwise>
         <xsl:apply-templates/>
      </xsl:otherwise>
   </xsl:choose>
</xsl:template>
</xsl:stylesheet>

Note that the above stylesheet takes the updated data out of a file called changed.xml. A real application would obviously not want to store the changed data in a file before processing it. One solution is to add a special attribute to the <address> element, indicating whether or not it has been updated. Then the application could simply append the updated data to the source document and define a different stylesheet that detects updated records and replaces the outdated ones.

All the application has to do now is create an XSLTProcessor object and let it do the work:

import org.apache.xalan.xslt.*;
   ...
   XSLTProcessor processor = XSLTProcessorFactory.getProcessor();
   processor.process(new XSLTInputSource(sourceDoc.getDocumentElement(),
                     new XSLTInputsource("http://mymachine.com/updateAddress.xsl"),
                     new XSLTResultTarget(newDoc.getDocumentElement());
   sourceDoc = newDoc;
   ...

Conclusion

To many of us Java programmers, XML is a relatively new technology that we need to master. This article shows that the manual parsing and processing of an XML document is only one option, and that we may be able to use of XPath expressions and XSL stylesheets to avoid a lot of parsing and iterating, thus reducing the amount of code that we need to write. Moreover, under this system the information about how the data is processed is stored externally and can be changed without recompiling the application. The mechanisms described here can be used for the creation of presentation data for a Web application, but can also be applied in all cases in which XML data needs to be processed.

Learn more about this topic

  • Recent XML articles in JavaWorld
  • XML help
  • Other valuable XML-related resources









目录
相关文章
|
5月前
|
XML 数据采集 存储
使用Java和XPath在XML文档中精准定位数据
在数据驱动的时代,从复杂结构中精确提取信息至关重要。XML被广泛用于数据存储与传输,而XPath则能高效地在这些文档中导航和提取数据。本文深入探讨如何使用Java和XPath精准定位XML文档中的数据,并通过小红书的实际案例进行分析。首先介绍了XML及其挑战,接着阐述了XPath的优势。然后,提出从大型XML文档中自动提取特定产品信息的需求,并通过代理IP技术、设置Cookie和User-Agent以及多线程技术来解决实际网络环境下的数据抓取问题。最后,提供了一个Java示例代码,演示如何集成这些技术以高效地从XML源中抓取数据。
216 7
使用Java和XPath在XML文档中精准定位数据
|
7月前
|
XML Java 数据格式
必知的技术知识:java基础73dom4j修改xml里面的内容(网页知识)
必知的技术知识:java基础73dom4j修改xml里面的内容(网页知识)
47 1
|
7月前
|
XML Java 数据格式
java修改XML
java修改XML
|
7月前
|
XML Java 数据格式
java创建xml文件内容
java创建xml文件内容
|
7月前
|
XML Java 数据格式
java解析xml文件内容
java解析xml文件内容
|
3月前
|
XML 前端开发 数据格式
使用 XSLT 显示 XML
10月更文挑战第1天
|
8月前
|
XML 前端开发 Java
《手把手教你》系列技巧篇(十四)-java+ selenium自动化测试-元素定位大法之By xpath上卷(详细教程)
【4月更文挑战第6天】按宏哥计划,本文继续介绍WebDriver关于元素定位大法,这篇介绍定位倒数二个方法:By xpath。xpath 的定位方法, 非常强大。使用这种方法几乎可以定位到页面上的任意元素。xpath 是XML Path的简称, 由于HTML文档本身就是一个标准的XML页面,所以我们可以使用Xpath 的用法来定位页面元素。XPath 是XML 和Path的缩写,主要用于xml文档中选择文档中节点。基于XML树状文档结构,XPath语言可以用在整棵树中寻找指定的节点。
121 0
|
6月前
|
XML 缓存 JavaScript
优化Java中的XML解析性能
优化Java中的XML解析性能
|
7月前
|
XML Java 数据格式
Java一分钟之-JAXB:Java对象到XML绑定
【6月更文挑战第1天】Java Architecture for XML Binding (JAXB) 是Java平台标准,用于自动转换Java对象和XML。它通过注解实现声明式映射,简化XML处理。本文介绍了JAXB的基本使用、常见问题和最佳实践,包括对象到XML(Marshalling)和XML到对象(Unmarshalling)过程,并通过示例展示如何在Java类和XML之间进行转换。注意类型匹配、注解冲突和JAXB上下文创建等问题,以及如何优化性能和避免循环引用。
445 3
|
7月前
|
XML 数据格式
Xml declaration should precede all document content
Xml declaration should precede all document content