September 20, 2015

Groovy by example: XML / HTML transformation (part 1 of 2)



In this blog post, I present a complete example of using Groovy for XML / HTML transformation because I think that other, more simple examples available on the web don’t quite show best practices and common pitfalls as clearly as a more in-depth example does.

Note: This article does not cover XML namespace awareness. Please refer to other online resources if you are interested in this topic.

By example!

As an example use case, we will actually transform an HTML <table>.

We will flip it by 90 degrees, i.e. turning its columns into rows and vice versa which I think may be both an interesting and a generally handy example. I’ve randomly chosen the OPEC members table from Wikipedia to do that.

We will transform this:

into this:
(Without surrounding elements.)

Note that the entire source code of this example is available at the accompanying GitHub repository. You may want to open its code in a separate browser window for quick reference.

Let me start off with some Groovy XML processing basics.

Parsing XML

Parsing XML is nicely explained in Groovy’s official documentation, hence I will not repeat it here.

It’s important to note that there are two facilities for XML parsing, groovy.util.XmlParser and groovy.util.XmlSlurper, and their API to access information on parsed XML nodes differ. This is nicely shown by this code example by mrhaki. For the example code, I will use XmlParser with fits HTML parsing better than XmlSlurper because it has better support for text child nodes. We also don’t need the lazy evaluation feature provided by XmlSlurper in this example.

Note that the example will not work with XmlSlurper!

Parsing HTML

XmlParser expects well-formed XML, thus we must transform the (potentially not well-formed) HTML into well-formed XML first. This issue is addressed by this stackoverflow question.

Bottom line: You have to use an additional 3rd party parser. I’ve always used the tagsoup parser for this, and it has always done a good job.
<dependency>
    <groupId>org.ccil.cowan.tagsoup</groupId>
    <artifactId>tagsoup</artifactId>
    <version>1.2.1</version>
</dependency>
Note: The example code may not work with any other HTML sanitize.

The complete code to parse HTML text into a Groovy node structure is then:
Parser parser = new Parser()
def html = new XmlParser(parser).parseText(HTML)
It’s important to note that in the case of “sanitized” HTML, the HTML is then implicitly included in an <html><body> structure, hence the root node returned by the XmlParser is the <html> node.

Writing XML

In Groovy, you can build XML (as many other tree-like structures) from scratch by so-called builders. Groovy ships with two XML builder implementations: groovy.xml.MarkupBuilder and groovy.xml.StreamingMarkupBuilder; they are both briefly covered by the official documentation. The latter has better support for namespaces, but as we ignore them in this article, we will use MarkupBuilder here. Note that annoyingly, they differ in their API, which is again nicely illustrated on mrhaki’s blog.

Note that the example will not work with StreamingMarkupBuilder!

As every Groovy builder implements BuilderSupport, we can  use the same techniques and best practices for every builder, whether it builds an XML structure or any other nested structure. I have discussed many best practices in an earlier article on custom builders already.

So, to build basic HTML, you build the structure literally:
private static String buildSomeHtml() {
    Writer writer = new StringWriter()
    MarkupBuilder xmlBuilder = new MarkupBuilder(writer)
    
    xmlBuilder.html {
        body {
            div(id: 'myDiv') {
                mkp.yield('My Text')
            }
        }
    }
    
    return writer.toString()
}
A few notes about working with MarkupBuilder:
  • To print the output, initialize the builder with a StringWriter and use its #toString() function after building the structure. This will in fact pretty-print the XML outcome. (Note that the printing API for StreamingMarkupBuilder works completely different!)
  • MarkupBuilder has an implicit reference mkp to MarkupBuilderHelper which you can use to e.g. insert XML comments and inline text.

Manipulating XML

Both Groovy XmlParser and XmlSlurper provide methods to directly manipulate existing XML on-the-fly as you would do with a DOM or a SAX parser. This is covered by the official documentation.

However, the code might get messy quite easily if you try to restructure major parts of the original document. You can then literally get lost in your document structure!

For our example where we really want to flip an entire table structure over, I will here present a different approach, combining XML parsing and writing:

Transforming XML

This is realized here as the combination of parsing the original document in a node structure and then building a new node structure from these original parts.

This is really straightforward. For instance, let’s assume that we want to build an HTML table with the same attributes as an existing HTML table node.

Remember that the general syntax for building a tree node is (in pseudocode)
builder.nodeName(attributesMap) {
    child1()
    child2(…) {
        grandChild1()
        …
    }
    "${dynamicChildName}"()
    …
    childN()
}
In the following simple example, let’s assume xmlBuilder is our instance of MarkupBuilder, and html is the root node of an HTML document parsed with an XmlParser:
xmlBuilder.table(html.body.table[0].attributes())
This will build an HTML table with the attributes copied from the previously parsed original HTML table. Because we can use arbitrary Groovy code within the builder, we can manipulate the original nodes in whatever fashion we wish before using them in the new builder.

However, adding a node and all of its child nodes into a new builder is a bit tricky. This calls for a dynamic, recursive solution. Let’s build a general-purpose deep node-to-node-identity transformator function!

A dynamic XML builder

This method will dynamically insert the node provided, including its complete sub-structure, into the builder provided:
protected static copy(node, builder) {
    if (node in String) {
        builder.mkp.yield(node)
    }
    else {
        builder."${node.name().localPart}"(node.attributes()) {
            node.children().each { child ->
                copy(child, builder)
            }
        }
    }
}
Here’s how it works:
  • The node can be a simple String. This is the case for HTML inner texts. In this case, simply yield the text.
  • Else, add a node with the name dynamically built from the original node’s local name, i.e. the name without the namespace part.
  • Copy its attributes.
  • For all its children, call this function recursively.
This really is the identity transformation of a node tree. When called with the originally parsed node tree, it will return the node tree and is admittedly useless as such.

Still, it will serve as a key ingredient of our example tree transformation function which is explained on the next page.

Pages: 1 2