Vipan Singla e-mail: vipan@vipan.com XML and XPath UsageMost common DOM interfaces:
Node
: The base datatype of the DOM.Element
: The vast majority of the objects you抣l deal with are "Elements".Attr
: Represents an attribute of an "Element".Text
: The actual content of an "Element" or "Attr".Document
: Represents the entire XML document. A "Document" object is often referred to as a DOM tree.Common DOM methods
Document.getDocumentElement()
- Returns the root "element" of the document. It is the top level tag in the document. It is different from the "root" itself which is just a "/". So the root element resides below the "/". There are other elements below the "/" such as an <xml> declaration or a "comment".
Node.getFirstChild()
andNode.getLastChild()
- Returns the first or last child of a given Node.
Node.getNextSibling()
andNode.getPreviousSibling()
- Return the next or previous element, node or whatever at the same level as the node itself in the document tree.
Node.getAttribute(attrName)
- For a given Node, returns the attribute with the requested name. For example, if you want the
Attr
object for the attribute namedid
, usegetAttribute("id")
.getElementsByTagName("tag_name")
- Retrieve all of the
<tag_name>
elements in the document. This method saves the trouble of writing code to traverse the entire tree. Or, you can use XPath. See below.All Seven Kinds of Nodes
- The root
- Elements
- Text
- Attributes
- Namespaces
- Processing instructions
- Comments
XPath Abbreviated Syntax Examples
- In all cases below, the "context node" is the node you want to start searching from in a pre-parsed "document" object. You must be holding a reference to the context node. Remember, a
Document
object is a type ofNode
. For example, in:NodeIterator nl = XPathAPI.selectNodeIterator(node, "para");, the argumentnode
is the context node you want to start searching from. You may obtain the "Document" object using:DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance(); DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder(); Document doc = docBuilder.parse(new File("C:\some_dir\some_file.xml");Theparse
method can also take an "InputStream", "URL" or XML "InputSource" object. After you get the "Document" object, you should collapse all contiguous whitespace and "Text" nodes into one "text" node using:doc.getDocumentElement().normalize();Otherwise, your "Document" object is going to contain so many useless (empty) "Text" nodes that you are going to have a tough time reaching the useful textual content within an element.para
selects the "para" element children of the context node*
selects all element children of the context nodetext()
selects all text node children of the context node@name
selects the "name" attribute of the context node@*
selects all the attributes of the context nodepara[1]
selects the first "para" child of the context nodepara[last()]
selects the last "para" child of the context node*/para
selects all para grandchildren of the context node/doc/chapter[5]/section[2]
selects the second section of the fifth chapter of the docchapter//para
selects the "para" element descendants of the "chapter" element children of the context node//para
selects all the para descendants of the "document root" and thus selects all "para" elements in the same document as the context node//olist/item
selects all the "item" elements in the same document as the context node that have an "olist" parent.
selects the context node itself.//para
selects the "para" element descendants of the context node..
selects the parent of the context node../@lang
selects the "lang" attribute of the parent of the context nodepara[@type="warning"]
selects all "para" children of the context node that have a "type" attribute with value "warning"para[@type="warning"][5]
selects the fifth "para" child of the context node that has a "type" attribute with value "warning"para[5][@type="warning"]
selects the fifth "para" child of the context node if that child has a "type" attribute with value "warning"chapter[title="Introduction"]
selects the "chapter" children of the context node that have one or more "title" children with string-value equal to "Introduction" (Use this to match to a particular element which contains the text value you desire)chapter[title]
selects the "chapter" children of the context node that have one or more "title" childrenemployee[@secretary and @assistant]
selects all the "employee" children of the context node that have both a "secretary" attribute and an "assistant" attribute- The default axes is "child". For example, a location path
div/para
is short forchild::div/child::para
.- So, abbreviation for
attribute::
is@
. For example, a location pathpara[@type="warning"]
is short forchild::para[attribute::type="warning"]
.//
is short for/descendant-or-self::node()/
. For example,//para
is short for/descendant-or-self::node()/child::para
. Here, even a "para" element that is a document element will be selected since the document element node is a child of the root node.- The location path
//para[1]
does not mean the same as the location path/descendant::para[1]
. The latter selects the first descendant para element; the former selects all descendant para elements that are the first para children of their parents.- A location step of
.
is short forself::node()
. This is particularly useful in conjunction with//
. For example, the location path.//para
is short forself::node()/descendant-or-self::node()/child::para
and so will select all para descendant elements of the context node.- Similarly, a location step of
..
is short forparent::node()
. For example,../title
is short forparent::node()/child::title
and so will select the title children of the parent of the context node.Demonstration Example of Using XML Xpath in a Java program
- Save this code in
XPathDemo.java
file:import java.io.*; import javax.xml.parsers.*; import org.xml.sax.*; import org.w3c.dom.*; import org.w3c.dom.traversal.*; import javax.xml.transform.*; import javax.xml.transform.dom.*; import javax.xml.transform.stream.*; import org.apache.xpath.*; /** * This class demonstrates how to use Java to parse an XML file and get * any element's content or attribute's value WITHOUT "walking the tree". * It uses XPath to achieve this goal. Also shown is a trivial usage of * an XML transform to print the parsed XML file to console. * * Some of the program snippets are by http://xml.apache.org. * */ public class XPathDemo { public static void main(String[] args) { if (args.length < 2) { System.out.println("Usage: "); System.out.println( "java -classpath xerces.jar;.;xalan.jar " + " XPathDemo your-file.xml your-xpath-string"); return; } try { /**************************************************************** * How to use turn an XML file into a document object in Java ****************************************************************/ System.out.println("Parsing XML file " + args[0] + " ..."); DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance(); DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder(); // Parse the XML file and build the Document object in RAM Document doc = docBuilder.parse(new File(args[0])); // Normalize text representation. // Collapses adjacent text nodes into one node. doc.getDocumentElement().normalize(); /****************************************************************/ /**************************************************************** * How to use xpath to extract info from document object in Java ****************************************************************/ String xpath = args[1]; System.out.println("\nQuerying DOM using xpath string:" + xpath); // Catches the first node that meets the criteria of xpath string String str = XPathAPI.eval(doc, xpath).toString(); System.out.println("=>" + str + "<=\n"); /****************************************************************/ /**************************************************************** * How to get root node of the document object ****************************************************************/ Node root = doc.getDocumentElement(); System.out.println("\nRoot element of the doc is =>" + root.getNodeName() + "<="); /****************************************************************/ /**************************************************************** * How to print the parsed xml file right back to system out ****************************************************************/ String xpathString = args[1]; // Set up an identity transformer to use as serializer. // This one can write input to output stream Transformer serializer = TransformerFactory.newInstance().newTransformer(); serializer.setOutputProperty( OutputKeys.OMIT_XML_DECLARATION, "yes"); // Use the simple XPath API to select a nodeIterator. System.out.println("\nPrinting subtree under xpath =>" + xpathString + "<="); NodeIterator nl = XPathAPI.selectNodeIterator(doc, xpathString); Node n; while ((n = nl.nextNode()) != null) { // Serialize the found nodes to System.out serializer.transform( new DOMSource(n), new StreamResult(System.out)); } /****************************************************************/ } catch (SAXParseException err) { String msg = "** SAXParseException" + ", line " + err.getLineNumber() + ", uri " + err.getSystemId() + "\n" + " " + err.getMessage(); System.out.println(msg); // print stack trace Exception x = err.getException(); ((x == null) ? err : x).printStackTrace(); } catch (SAXException e) { String msg = "SAXException"; System.out.println(msg); Exception x = e.getException(); ((x == null) ? e : x).printStackTrace(); } catch (Exception e) { e.printStackTrace(); } catch (Throwable t) { t.printStackTrace(); String msg = "Some other exception while getting XML"; System.out.println(msg); } } }- Download Xalan from http://xml.apache.org, extract/unzip the downloaded file, find
xerces.jar
andxalan.jar
files and copy these files in the same directory where you saved the above code inXPathDemo.java
file (just to make the demonstration easier).The download is about 7MB although the two files you need are about 2MB combined. The rest is documentation and the full Java source of Xalan!
- Compile
XPathDemo.java
using:javac -classpath xerces.jar;.;xalan.jar XPathDemo.java- Get or create any XML file. Here is a simple example. Save it as, say,
example.xml
file in the same directory as the above files (just to make the demonstration easier).<demo-xpath> <database-access db-name="db1"> Here is to xpath! <username>scott</username> <password>tiger</password> May be some text here. Some more text here. </database-access> Last text line! </demo-xpath>- Now, you have apache's XML parser in xerces.jar, XPath API in xalan.jar, your Java program in XPathDemo.class and a sample XML file example.xml. You can try to run your Java program and pass it the XML file name and any XPath string. And see what the program gives you! Some generic XPath strings to try are
.
for current node (in this Java program, same as the root node) and/
for root node.- Run
XPathDemo
using these commands one by one as examples:java -classpath xerces.jar;.;xalan.jar XPathDemo example.xml / java -classpath xerces.jar;.;xalan.jar XPathDemo example.xml . java -classpath xerces.jar;.;xalan.jar XPathDemo example.xml /demo-xpath java -classpath xerces.jar;.;xalan.jar XPathDemo example.xml //@db-name java -classpath xerces.jar;.;xalan.jar XPathDemo example.xml //usernameThese runs will demonstrate different ways to use XPath to get the content of an element or the value of an attribute.
- If you specify a non-existent element or attribute, the
toString()
method ofXObject
obtained from theXpathAPI.eval(...)
method returns an empty string, not a nullPointerException, by design. Actually, a subclass of XObject,XNull
, is returned whosetoString()
method has been programmed to return an empty string. See Xalan's javadoc.Core Functions
Each function in the function library is specified using a function prototype, which gives the return type, the name of the function, and the type of the arguments. If an argument type is followed by a question mark, then the argument is optional; otherwise, the argument is required.
Node-Set Functions
number last()
: The last node "number" in the node-set.number position()
number count(node-set)
: Number of nodes in the node-set.node-set id(object)
:id("foo")
selects the element with unique ID "foo" andid("foo")/child::para[position()=5]
selects the fifth "para" child of the element with unique ID "foo".string local-name(node-set?)
: Local part of the expanded-name of the node in the argument node-set that is first in document order. If the argument node-set is empty or the first node has no expanded-name, an empty string is returned. If the argument is omitted, it defaults to a node-set with the context node as its only member.string namespace-uri(node-set?)
: Some advanced function.string name(node-set?)
: Some advanced function. Returns weird-looking name.String Functions
string string(object?)
: Converts an object to a string as follows:
- A node-set is converted to a string by returning the string-value of the node in the node-set that is first in document order. If the node-set is empty, an empty string is returned.
- A number is converted to a string as follows:
NaN
is converted to the string NaN- positive zero is converted to the string 0
- negative zero is converted to the string 0
- positive infinity is converted to the string Infinity
- negative infinity is converted to the string -Infinity
- if the number is an integer, the number is represented in decimal form as a Number with no decimal point and no leading zeros, preceded by a minus sign (-) if the number is negative
- otherwise, the number is represented in decimal form as a Number including a decimal point with at least one digit before the decimal point and at least one digit after the decimal point, preceded by a minus sign (-) if the number is negative.
- The boolean false value is converted to the string false. The boolean
true
value is converted to the string true.- An object of a type other than the four basic types is converted to a string in a way that is dependent on that type.
If the argument is omitted, it defaults to a node-set with the context node as its only member.
NOTE: The
string
function is not intended for converting numbers into strings for presentation to users. Theformat-number
function andxsl:number
element in [XSLT] provide this functionality.string concat(string, string, string*)
: Concatenates its arguments.boolean starts-with("string1", "string2")
: Checks if "string1" starts with "string2".boolean contains("string1", "string2")
: Checks if "string1" contains "string2".string substring-before("string1", "string2")
: Returns a part of "string1" up to the first occurance of start of "string2". Or, empty string if no "string2" found.string substring-after(string, string)
: Similar to above.string substring(string, number1, number2?)
: Substring starting at number1 index position. number2 is end index position if present, otherwise go till the end.More precisely, each character in the string (see [3.6 Strings]) is considered to have a numeric position: the position of the first character is 1, the position of the second character is 2 and so on. This differs from Java and ECMAScript, in which the String.substring method treats the position of the first character as 0.
The returned substring contains those characters for which the position of the character is greater than or equal to the rounded value of the second argument and, if the third argument is specified, less than the sum of the rounded value of the second argument and the rounded value of the third argument; the comparisons and addition used for the above follow the standard IEEE 754 rules; rounding is done as if by a call to the round function. The following examples illustrate various unusual cases:
substring("12345", 1.5, 2.6) returns "234" substring("12345", 0, 3) returns "12" substring("12345", 0 div 0, 3) returns "" substring("12345", 1, 0 div 0) returns "" substring("12345", -42, 1 div 0) returns "12345" substring("12345", -1 div 0, 1 div 0) returns ""number string-length(string?)
: Number of characters in the string. If no argument, returns length of string-value of context node.string normalize-space(string?)
: Removes leading and trailing whitespace and replaces sequences of whitespace characters with a single space. If no argument, returns length of string-value of context node.string translate(string, string1, string2)
: In "string", replaces occurrences of characters in "string1" with character at the corresponding position in "string2". For example, translate("bar","abc","ABC") returns the string BAr. If there is a character in the second argument string with no character at a corresponding position in the third argument string (because the second argument string is longer than the third argument string), then occurrences of that character in the first argument string are removed. For example, translate("--aaa--","abc-","ABC") returns "AAA". If a character occurs more than once in the second argument string, then the first occurrence determines the replacement character. If the third argument string is longer than the second argument string, then excess characters are ignored. Generally used for case-conversion.Boolean Functions
boolean boolean(object)
: Converts object to a boolean as follows:
- a number is true if and only if it is neither positive or negative zero nor NaN
- a node-set is true if and only if it is non-empty
- a string is true if and only if its length is non-zero
- an object of a type other than the four basic types is converted to a boolean in a way that is dependent on that type
boolean not(boolean)
: Reverses the argument.boolean true()
: Returns true.boolean false()
: Returns false.boolean lang(string)
: Some advanced functionNumber Functions
number number(object?)
: Converts object to a number as follows:
- a string that consists of optional whitespace followed by an optional minus sign followed by a Number followed by whitespace is converted to a number that is nearest to the mathematical value represented by the string; any other string is converted to NaN
- boolean true is converted to 1; boolean false is converted to 0
- a node-set is first converted to a string as if by a call to the string function and then converted in the same way as a string argument
- an object of a type other than the four basic types is converted to a number in a way that is dependent on that type
If the argument is omitted, it defaults to a node-set with the context node as its only member.
number sum(node-set)
: Sum total of all nodes in node-set after converting their string-values to numbers.number floor(number)
: Lower integer than the numbernumber ceiling(number)
: Higher integer than the numbernumber round(number)
: The round function returns the number that is closest to the argument and that is an integer. If there are two such numbers, then the one that is closest to positive infinity is returned. If the argument is NaN, then NaN is returned. If the argument is positive infinity, then positive infinity is returned. If the argument is negative infinity, then negative infinity is returned. If the argument is positive zero, then positive zero is returned. If the argument is negative zero, then negative zero is returned. If the argument is less than zero, but greater than or equal to -0.5, then negative zero is returned.NOTE: For these last two cases, the result of calling the round function is not the same as the result of adding 0.5 and then calling the floor function.
Data Model
- XPath operates on an XML document as a tree. For all seven types of node, there is a way of determining a string-value for a node of that type. For some types of node, the string-value is part of the node; for other types of node, the string-value is computed from the string-value of descendant nodes.
NOTE: For element nodes and root nodes, the string-value of a node is not the same as the string returned by the DOM nodeValue method (see [DOM]).
- There is an ordering, document order, defined on all the nodes in the document corresponding to the order in which the first character of the XML representation of each node occurs in the XML representation of the document after expansion of general entities. Thus, the root node will be the first node. Element nodes occur before their children. Thus, document order orders element nodes in order of the occurrence of their start-tag in the XML (after expansion of entities). The attribute nodes and namespace nodes of an element occur before the children of the element. The namespace nodes are defined to occur before the attribute nodes. The relative order of namespace nodes is implementation-dependent. The relative order of attribute nodes is implementation-dependent. Reverse document order is the reverse of document order.
- Root nodes and element nodes have an ordered list of child nodes.
- Nodes never share children
- Every node other than the root node has exactly one parent, which is either an element node or the root node. A root node or an element node is the parent of each of its child nodes. The descendants of a node are the children of the node and the descendants of the children of the node.
- Root Node is the root of the tree. A root node does not occur except as the root of the tree. The element node for the document element is a child of the root node. The root node also has as children processing instruction and comment nodes for processing instructions and comments that occur in the prolog and after the end of the document element.
The string-value of the root node is the concatenation of the string-values of all text node descendants of the root node in document order.
- The children of an element node are the element nodes, comment nodes, processing instruction nodes and text nodes for its content. Entity references to both internal and external entities are expanded. Character references are resolved.
The string-value of an element node is the concatenation of the string-values of all text node descendants of the element node in document order.
- Each element node has an associated set of attribute nodes; the element is the parent of each of these attribute nodes; however, an attribute node is not a child of its parent element.
NOTE: This is different from the DOM, which does not treat the element bearing an attribute as the parent of the attribute.
- Elements never share attribute nodes.
- The
=
operator tests whether two nodes have the same value, not whether they are the same node. Thus attributes of two different elements may compare as equal using =, even though they are not the same node.- An attribute node has a normalized string-value. If it is an empty string, it results in an attribute node whose string-value is a zero-length string.
- There is a comment node for every comment, except for any comment that occurs within the document type declaration.
The string-value of comment is the content of the comment not including the opening <!-- or the closing -->.
- A text node never has an immediately following or preceding sibling that is another text node. The string-value of a text node is the character data. A text node always has at least one character of data.
- A CDATA section is treated as if the <![CDATA[ and ]]> were removed and every occurrence of < and & were replaced by & l t ; (no spaces) and & a m p ; (no spaces) respectively.
� Vipan Singla 2000