Detailed Comparison of XML Parsing Performance Using XPath in JAVA

Recently, I am working on a small project that uses XML file parsing technology. Through understanding and using this technology, I have summarized the following content.

1 for XML file parsing4A method

There are usually four classic methods to parse XML files. There are two basic parsing methods, one called SAX, and the other called DOM. SAX is based on event stream parsing, and DOM is based on the XML document tree structure parsing. Based on this, to reduce the amount of coding for DOM and SAX, JDOM appeared, which has the advantage of,20-80 principle (Pareto law), which greatly reduces the amount of code. Under normal circumstances, JDOM meets the requirements of simple functions to be implemented, such as parsing, creation, etc. But at the bottom, JDOM still uses SAX (the most commonly used), DOM, Xanan document. Another is DOM4J, which is an excellent Java XML API with excellent performance, powerful functions, and extremely easy to use, and it is also an open-source software. Nowadays, you can see that more and more Java software are using DOM4J is used to read and write XML, and it is particularly noteworthy that even Sun's JAXM is using DOM4J.

2 Brief introduction to XPath

XPath is a language used to find information in XML documents. XPath is used to navigate through elements and attributes in XML documents and to traverse elements and attributes. XPath is a W3C XSLT standard elements, and XQuery and XPointer are also built on top of XPath expressions. Therefore, understanding XPath is the foundation of many advanced XML applications. XPath is very similar to the SQL language used for database operations, or jQuery, which can easily allow developers to pick up the things they need from the document. Among them, DOM4J also supports the use of XPath.

3 DOM4J uses XPath

DOM4J uses XPath to parse XML documents, and it is necessary to reference two JAR packages in the project first:

dom4j-1.6.1.jar: DOM4J software package, download addresshttp://sourceforge.net/projects/dom4j/;

jaxen-xx.xx.jar: It is usually not recommended to add this package, as it may cause exceptions (java.lang.NoClassDefFoundError: org/jaxen/JaxenException), download addresshttp://www.jaxen.org/releases.html.

3.1 Interference from namespaces (namespace)

When processing XML files converted from Excel files or other formats, it is often encountered that the results cannot be obtained through XPath parsing. This situation is usually caused by the existence of namespaces. For example, in the following XML file, through XPath=" // Workbook/ Worksheet / Table / Row[1]/ Cell[1]/Data[1Performing a simple search often results in no matches. This is due to the presence of namespaces (xmlns="urn:schemas-microsoft-com:office:spreadsheet"）caused.

<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" xmlns:html="http://www.w3.org/TR/REC-html40">
 <Worksheet ss:Name="Sheet1">
  <Table ss:ExpandedColumnCount="81" ss:ExpandedRowCount="687" x:FullColumns="1" x:FullRows="1" ss:DefaultColumnWidth="52.5" ss:DefaultRowHeight="15.5625">
   <Row ss:AutoFitHeight="0">
     <Cell>
     <Data ss:Type="String">敲代码的耗子</Data>
     </Cell> 
   </Row>
   <Row ss:AutoFitHeight="0">
     <Cell>
     <Data ss:Type="String">Sunny</Data>
     </Cell> 
   </Row>
  </Table>
 </Worksheet>
</Workbook>

3.2 XPath parsing for XML files with namespaces

The first method (read1() function): Uses the local namespace in the XPath syntax.-name() and namespace-uri() specifies the node name and namespace you want to use. The writing of XPath expressions is more麻烦.

The second method (read2() function): Sets the namespace of XPath, using the setNamespaceURIs() function.

The third method (read3() function): Sets the namespace of DocumentFactory(), using the setXPathNamespaceURIs() function. The XPath expressions in the second and third methods are relatively simple to write.

The fourth method (read4The () function: The method is the same as the third one, but the XPath expression is different (specifically implemented in the program), mainly to test the differences in XPath expressions, mainly referring to the completeness, and whether it will affect the search efficiency. (All four methods pass through DOM4J combined with XPath to parse the XML file)

The fifth method (read5(Function)）：Use DOM combined with XPath to parse the XML file, mainly to test performance differences.

Nothing can illustrate a problem better than code! Let's jump into the code!

package XPath;
import java.io.IOException;
import java.io.InputStream;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import org.dom4j.Document;
import org.dom4j.DocumentException;
import org.dom4j.Element;
import org.dom4j.XPath;
import org.dom4j.io.SAXReader;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;
/**
 * DOM4J DOM XML XPath
 * @author hao
 */
public class TestDom4jXpath {
  public static void main(String[] args) {
    read1;
    read2;
    read3;
    read4;//read3The method is the same as the one mentioned before, but the XPath expression is different
    read5;
  }
  public static void read1() {
    /*
     * use local-name() and namespace-uri() in XPath
     */
    try {
      long startTime=System.currentTimeMillis();
      SAXReader reader = new SAXReader();
      InputStream in = TestDom4jXpath.class.getClassLoader().getResourceAsStream("XPath\\XXX.xml");
      Document doc = reader.read(in);
      /*String xpath ="//*[local-name()='Workbook' and namespace-uri()='urn:schemas-microsoft-com:office:spreadsheet'
          + "/*[local-name()='Worksheet'
          + "/*[local-name()='Table']"
          + "/*[local-name()='Row'][4]"
          + "/*[local-name()='Cell'][3]"
          + "/*[local-name()='Data'][1]";*/
      String xpath ="//*[local-name()='Row'][4]/*[local-name()='Cell'][3]/*[local-name()='Data'][1]";
      System.err.println("=====use local-name() and namespace-uri() in XPath====");
      System.err.println("XPath：" + xpath);
      @SuppressWarnings("unchecked")
      List<Element> list = doc.selectNodes(xpath);
      for(Object o:list){ 
        Element e = (Element) o; 
        String show=e.getStringValue();
        System.out.println("show = " + show); 
      long endTime=System.currentTimeMillis();
      System.out.println("程序运行时间： "+(endTime-startTime)+"ms");
      } 
    } catch (DocumentException e) {
      e.printStackTrace();
    }
  }
  public static void read2() {
    /*
     * set xpath namespace(setNamespaceURIs)
     */
    try {
      long startTime=System.currentTimeMillis();
      Map map = new HashMap();
      map.put("Workbook","urn:schemas-microsoft-com:office:spreadsheet");
      SAXReader reader = new SAXReader();
      InputStream in = TestDom4jXpath.class.getClassLoader().getResourceAsStream("XPath\\XXX.xml");
      Document doc = reader.read(in);
      String xpath ="//Workbook:Row[4]/Workbook:Cell[3]/Workbook:Data[1]";
      System.err.println("=====use setNamespaceURIs() to set xpath namespace====");
      System.err.println("XPath：" + xpath);
      XPath x = doc.createXPath(xpath);
      x.setNamespaceURIs(map);
      @SuppressWarnings("unchecked")
      List<Element> list = x.selectNodes(doc);
      for(Object o:list){ 
        Element e = (Element) o; 
        String show=e.getStringValue();
        System.out.println("show = " + show);  
      long endTime=System.currentTimeMillis();
      System.out.println("程序运行时间： "+(endTime-startTime)+"ms");
      } 
    } catch (DocumentException e) {
      e.printStackTrace();
    }
  }
  public static void read3() {
    /*
     * set DocumentFactory() namespace(setXPathNamespaceURIs)
     */
    try {
      long startTime=System.currentTimeMillis();
      Map map = new HashMap();
      map.put("Workbook","urn:schemas-microsoft-com:office:spreadsheet");
      SAXReader reader = new SAXReader();
      InputStream in = TestDom4jXpath.class.getClassLoader().getResourceAsStream("XPath\\XXX.xml");
      reader.getDocumentFactory().setXPathNamespaceURIs(map);
      Document doc = reader.read(in);
      String xpath ="//Workbook:Row[4]/Workbook:Cell[3]/Workbook:Data[1]";
      System.err.println("=====use setXPathNamespaceURIs() to set DocumentFactory() namespace====");
      System.err.println("XPath：" + xpath);
      @SuppressWarnings("unchecked")
      List<Element> list = doc.selectNodes(xpath);
      for(Object o:list){ 
        Element e = (Element) o; 
        String show=e.getStringValue();
        System.out.println("show = " + show);
      long endTime=System.currentTimeMillis();
      System.out.println("程序运行时间： "+(endTime-startTime)+"ms");  
      } 
    } catch (DocumentException e) {
      e.printStackTrace();
    }
  }
  public static void read4() {
    /*
     * Same as read3The method is the same as the one mentioned before, but the XPath expression is different
     */
    try {
      long startTime=System.currentTimeMillis();
      Map map = new HashMap();
      map.put("Workbook","urn:schemas-microsoft-com:office:spreadsheet");
      SAXReader reader = new SAXReader();
      InputStream in = TestDom4jXpath.class.getClassLoader().getResourceAsStream("XPath\\XXX.xml");
      reader.getDocumentFactory().setXPathNamespaceURIs(map);
      Document doc = reader.read(in);
      String xpath ="//Workbook:Worksheet/Workbook:Table/Workbook:Row[4]/Workbook:Cell[3]/Workbook:Data[1]";
      System.err.println("=====use setXPathNamespaceURIs() to set DocumentFactory() namespace====");
      System.err.println("XPath：" + xpath);
      @SuppressWarnings("unchecked")
      List<Element> list = doc.selectNodes(xpath);
      for(Object o:list){ 
        Element e = (Element) o; 
        String show=e.getStringValue();
        System.out.println("show = " + show);
      long endTime=System.currentTimeMillis();
      System.out.println("程序运行时间： "+(endTime-startTime)+"ms");  
      } 
    } catch (DocumentException e) {
      e.printStackTrace();
    }
  }
  public static void read5() {
    /*
     * DOM and XPath
     */
    try {
      long startTime=System.currentTimeMillis();
      DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
      dbf.setNamespaceAware(false);
      DocumentBuilder builder = dbf.newDocumentBuilder();
      InputStream in = TestDom4jXpath.class.getClassLoader().getResourceAsStream("XPath\\XXX.xml");
      org.w3c.dom.Document doc = builder.parse(in);
      XPathFactory factory = XPathFactory.newInstance();
      javax.xml.xpath.XPath x = factory.newXPath();
      //Select all class elements' name attribute
      String xpath = "//Workbook/Worksheet/Table/Row[4]/Cell[3]/Data[1]";
      System.err.println("=====Dom XPath====");
      System.err.println("XPath：" + xpath);
      XPathExpression expr = x.compile(xpath);
      NodeList nodes = (NodeList)expr.evaluate(doc, XPathConstants.NODE);
      for(int i = 0; i<nodes.getLength();i++) {
        System.out.println("show = " + nodes.item(i).getNodeValue());
      long endTime=System.currentTimeMillis();
      System.out.println("程序运行时间： "+(endTime-startTime)+"ms");
      }
    catch(XPathExpressionException e) {
      e.printStackTrace();
    catch(ParserConfigurationException e) {
      e.printStackTrace();
    catch(SAXException e) {
      e.printStackTrace();
    } catch(IOException e) {}}
      e.printStackTrace();
    }
  }
}

3.3 Performance comparison of different methods

In order to compare the parsing performance of several methods, different methods were used during the experiment6in size,7XML files with more than ten thousand lines (XXX.xml) are processed10Round tests, as described below:

Figure1 XPath Performance Comparison

Method Name	Average Run Time	XPath Expression
read1())	1663ms	//[local-name()='Row'][4]/[local-name()='Cell'][3]/*[local-name()='Data'][1]
read2())	2184ms	//Workbook:Row[4]/Workbook:Cell[3]/Workbook:Data[1]
read3())	601ms	//Workbook:Row[4]/Workbook:Cell[3]/Workbook:Data[1]
read4())	472ms	//Workbook:Worksheet/Workbook:Table/Workbook:Row[4]/Workbook:Cell[3]/Workbook:Data[1]
read5())	1094ms	//Workbook/Worksheet/Table/Row[4]/Cell[3]/Data[1]

Table1 Average Performance Statistics

From the above performance comparison, it can be known that:

1read4The run time of the () method is the shortest, that is, using DOM4The XPath expression parsing XML files with the full path of the J method call (starting from the root node) takes the shortest time;

2The XPath expression used by the DOM parsing method is the simplest (can be written as//Row[4]/Cell[3]/Data[1)]), as the namespace can be disabled in the DOM by using the setNamespaceAware(false) method.

That's all for this article. Hope it will be helpful to everyone's study and also hope everyone will support the Shouting Tutorial more.

Statement: The content of this article is from the Internet, and the copyright belongs to the original author. The content is contributed and uploaded by Internet users spontaneously. This website does not own the copyright, does not undergo manual editing, and does not bear relevant legal liabilities. If you find any content suspected of copyright infringement, please send an email to: notice#w3Please send an email to notice#w (replace # with @ when sending email) to report violations, and provide relevant evidence. Once verified, this site will immediately delete the content suspected of infringement.

Basic Tutorial