May 10, 2015

Writing a code generator – with Java Regex, Groovy or ANTLR? (part 1 of 2)





In this article I’d like to introduce the topic of code generation in a very practical quick tutorial. Having discussed the general ideas and advantages of code generation in a previous blog post, this article aims to show you how to kick-start code generation in a project, and especially how to choose the right tool in a JVM-based environment: A general purpose programming language, a DSL or a full-fletched parser?


When it comes to applying the Generate Everything strategy (Automate code / document / file generation for all non-trivial, non-creative software project artifacts), it ultimately boils down to choosing the right tool for the job. Most developers seem to naturally favor tools which just fulfill the requirements the concrete task imposes; however, I think it would be worth considering choosing a single general-purpose tool, which might be overpowered for a certain task, but as long as it’s comprehensive and scalable down to the current requirements, you’re probably on the safe side:
  • If the current requirements one day develop further, you can still stick with the same tool.
  • There’s potentially only one single tool for that kind of task in your entire development pipeline which eases knowledge transfer.
  • The danger having to adopt the tool or coming up with home-brew extensions is decreased; you can just use the arsenal the all-mighty tool provides.
I will here illustrate this using a small example of code generation.

Let’s assume we’re in a Java project where we have lots of conversion code between entities. Because Java is statically typed and because Java’s grammar doesn’t support property accessors, this typically results in lots of getter/setter boilerplate code:
myCustomer.setName(remoteCustomer.getName());
myCustomer.setAddress(remoteCustomer.getAddress());
myCustomer.getAddress().setStreet(remoteCustomer.getAddress().getStreet());
myCustomer.getAddress().setCity(remoteCustomer.getAddress().getCity());
This code is hard to read and hard to maintain. We’ll thus write a simple code generator for this which should translate the following simple syntax to above Java code:
myCustomer.name = remoteCustomer.name
myCustomer.address = remoteCustomer.address
myCustomer.address.street = remoteCustomer.address.street
myCustomer.address.city = remoteCustomer.address.city
(This is actually the Groovy equivalent of above Java code; you could say that we’re about to write a mini Groovy-to-Java-translator.)

When restricted to Java or Java-related technologies, which tool will likely do the job the most efficiently (in terms of development costs)? Three choices come to my mind, with ascending (apparent) complexity:
  • String Java Regex parsing
  • Building a DSL, e.g. with Groovy
  • Building a full-fletched parser, e.g. with ANTLR4.
We’ll now try out and discuss these three tools in detail.

Using Java String Regex parsing

At first glance, this problem seems rather trivial; any other technique than using Regex parsing looks like overkill. So let’s go for that. I will write the code in Groovy to keep the source code short and concise.
class RegexTranslator {
    public static String translate(String input) {
        List<Assignment> assignments = input.split("\n")
            .findAll {it != ""}.collect { String assignment ->
            translateAssignment(assignment)
        }
        return Assignment.toString(assignments)
    }
    
    private static Assignment translateAssignment(String assignment) {
        def expressions = (assignment =~ /(.*?) = (.*)/)[0]
        String leftExpression = expressions[1]
        String rightExpression = expressions[2]
        Expression left = translateExpression(leftExpression)
        Expression right = translateExpression(rightExpression)
        return new Assignment(left, right)
    }
    
    private static Expression translateExpression(String expression) {
        List properties = expression.split(/\./).collect { new Property(it) }
        return new Expression(properties)
    }
}
Here’s a simple translator which uses Java Regex to parse an input String and translate it into an internal data structure which is used to print its content in the desired format (e.g. assignments with getter / setter calls). For this discussion, we’re not actually interested in that last step of translation, which is done in the “model“ classes Assignment, Expression and Property, thus I leave that code out here. Please check out the source code of these classes in the GitHub repository if you are interested.

Let’s examine that source code. It seems as if using Regex parsing solves the problem accurately and efficiently with 20 odd lines of code. However, I would assert that this code is hard to read and to maintain: The actual parsing is cluttered all over the source code because it happens imperatively. Note that apart from the apparent Regex matching at line 11, there are more “hidden” uses of Regex all over the place, i.e. the calls on String#split(String).

One should prefer to define the parser's grammar in a declarative way instead, just like a template, in order to separate grammar structure and interpretation.

Let’s now assume that in the meantime, we discovered that the requirements need to be updated. Actually, the generated Java source code would lack an important feature, possibly leading to runtime errors: It needs null checks when copying sub-structures, as in this snippet:
myCustomer.setName(remoteCustomer.getName());
if (remoteCustomer.getAddress() != null) {
    myCustomer.setAddress(remoteCustomer.getAddress());
    myCustomer.getAddress().setStreet(remoteCustomer.getAddress().getStreet());
    myCustomer.getAddress().setCity(remoteCustomer.getAddress().getCity());
}
We will thus extend our grammar to offer the possibility for null-checks:
myCustomer.name = remoteCustomer.name
notNull(remoteCustomer.address) {
    myCustomer.address = remoteCustomer.address
    myCustomer.address.street = remoteCustomer.address.street
    myCustomer.address.city = remoteCustomer.address.city
}
Now the disadvantages of using plain Java Regex parsing become obvious. One would have to hook in somewhere in the parsing code, potentially breaking existing logic; also it seems clear that parsing more sophisticated structures like code blocks just can’t be done efficiently with pure Java Regex.

Let’s recap the Java Regex approach here:
  • Bad: The parsing is happening imperatively; there’s no separation between grammar and its interpretation. Grammar and application logic are tightly coupled.
  • Bad: Parsing structures more complex than single code lines gets intricate quickly. Parsing nested structures is hard or even impossible, depending on the Regex implementation.
I will not even try to update the Regex parser to this extended requirement. I will instead move on to the next candidate technique.

Building a DSL with Groovy

As one might still hesitate to build a full-fletched grammar to solve this problem, the next logical step would be to build some kind of “mini language”, i.e. a domain specific language (DSL). Actually, Regex itself is an example of a DSL, designed to express regular expressions (what surprise). When building our own DSL, we really embed requirement-specific syntax into the syntax of an existing general-purpose programming language. Here, I again choose Groovy as the “host language”.
class GroovyTranslator {
    private List<Assignment> assignments = []
    
    public static String translate(String input) {
        GroovyTranslator instance = new GroovyTranslator()
        
        // create a new expression upon getter property access (.)
        Expression.metaClass.propertyMissing << { String name ->
            delegate.properties << new Property(name)
            return delegate
        }
        
        // create a new assignment upon setter property access (=)
        // note than getter has higher precedence than setter property access
        Expression.metaClass.propertyMissing << { String name, value ->
            delegate.properties << new Property(name)
            instance.assignments << new Assignment(delegate, value)
        }
        
        Eval.me("instance", instance, """
import ch.codebulb.codegenerationcompared.groovy.*
import ch.codebulb.codegenerationcompared.structure.*
""" + input + """
public propertyMissing(String name) {
    // create a new property for unknown variable references
    // and pack it into a new expression
    new Expression([new Property(name)])
}
public notNull(Expression condition, Closure closure) {
    // pure delegate
    instance.notNull(condition, closure)
}
""")
        return Assignment.toString(nestAssignments(instance.assignments))
    }
    
    /**
     * Nests assignments in a flat list into blocks of assignments
     * where "start" and "end" block markers are present.
     */
    private static List<Assignment> nestAssignments(List<Assignment> assignments) {
        AssignmentBlock currentBlock
        return assignments.collect {
            // "start" marker block
            if (it in AssignmentBlock && it.condition != null) {
                currentBlock = it
                return null
            }
            
            if (currentBlock == null) {
                return it // a simple assignment, not part of a block
            }
            else {
                 // "end" marker block
                if (it in AssignmentBlock) {
                    assert null == it.condition
                    def ret = currentBlock
                    currentBlock = null
                    return ret
                }
                // an assignment within a block
                else {
                    currentBlock.assignments << it
                    return null
                }
            }
        }.findAll {it != null}
    }

    protected notNull(Expression condition, Closure closure) {
        // the actual block which also serves as the "start" marker block
        assignments << new AssignmentBlock(condition)
        // create all the block's child assignments
        closure()
        // "end" marker block
        assignments << new AssignmentBlock()
    }
}
This is the complete source code. Note that there were no changes on the internal representation structures; however, a new structure called AssignmentBlock has been introduced which will bind multiple Assignments to a “not null” condition, just as it is imposed by the new requirement. Also because of the updated requirements, the code naturally would have become more complex, potentially even more so with the Regex solution.

This code works and fulfils the requirements; how it exactly works is actually not that relevant to this discussion . Instead, let’s make some general observations:

First of all, as mentioned above, it works, which is great. Using the power of the Groovy language (especially its capabilities of runtime evaluation, dynamic typing, and meta-programming), we’re able to design a grammar which obviously diverges quite remarkably from what Java code looks like, and we’re still able to let the Groovy compiler parse it for us instead of having to write String parsing on our own. Note that there is no single (implicit or explicit) Regex functionality invocation anywhere in the code!

Still, we were forced to include some advanced or even nasty meta-programming to really do the trick. The very same techniques that help writing the code at all also diminish its maintainability: runtime evaluation, dynamic typing, and meta-programming all make the code hard to read and hard to maintain. If you change a single line, just everything might crash, and debugging would be really hard.

And even though decoupling from the actual parsing process is higher now, one is instead forced to think in the host language’s structure. The grammar specification is more declarative now, but not declarative in terms of the grammar itself, but in terms of the host language’s internal structure.

Let’s recap what we learnt about DSLs here:
  • Good: Using a powerful host language, DSL syntax is not restricted to the host language’s syntactic features whilst it still can be parsed by that host language. This means: no manual String parsing.
  • Good: Because the host language’s capabilities are still retained even in the custom DSL, one could use host language’s features (i.e. write valid Groovy / Java code) inside the DSL code (a features unused in this example, but which might come in handy in other scenarios).
  • Bad: Needs a profound knowledge of the host language’s DSL capabilities.
  • Bad: Doesn’t provide separation between grammatical structure and interpretation either. Code becomes highly dependent on the internal mechanisms of the host language.
So, unfortunately, even building a DSL seems to not always be the best option when it comes to “ease of development / maintainability” considerations.

Pages: 1 2