May 10, 2015

Writing a code generator – with Java Regex, Groovy or ANTLR? (part 2 of 2)


In this article I’d like to introduce the topic of code generation in a very practical quick tutorial and to to show how to kick-start code generation in a project, and especially how to choose the right tool in a JVM-based environment: A general purpose programming language, a DSL or a full-fletched parser? Make sure to read part 1 first.

Building a Parser using ANTRL4

Reluctantly, we now consider involving a full-fletched parsing framework to do the job. I choose ANTLR4 here, the most recent release of a very straightforward Java-based parser.

The process of parsing a custom grammar with ANTLR4 is twofold:
  • First, you need to describe your grammar.
  • Then you invoke ANTLR source code generation on that grammar which generates the base class of a listener which can then be extended to react upon grammar construct recognition.
Thus let’s start with the grammar. The grammar is described in a file called PropertyAssignment.g4:
grammar PropertyAssignment;

file: instruction+ ;

instruction
 : assignmentBlock
 | assignment
 ;

assignmentBlock: 'notNull(' condition ') {' assignment+ '}' ;
condition: expression ;
assignment: expression ' = ' expression ;
expression: property ('.' property)* ;
property: TEXT ;

TEXT: [A-Za-z]+ ;
WS: [\t\r\n]+ -> skip ;
This is how you specify your grammar in ANTLR syntax which you can think of as a somewhat enhanced variety of Regex syntax. Even without knowing the details, with basic Regex knowledge one is able to grasp the grammar rules denoted here: A file consists of one or more instructions, an instruction consists of either an assignmentBlock or an assignment, and so on.

Grammars specified in ANTLR4 are easy to read because unlike pure Regex, all its parts are implicitly labeled. When compared on a functional level, it’s superior to Regex because it allows repetition and nesting of matches. Also note the -> skip directive which lets us declaratively specify which parts of input we simply want to ignore whenever they’re met.

By using ANTLR, we’re finally able to specify our grammar in a declarative way.

Step 2 is to let ANTLR generate the parser for that grammar. This needs just a single command line instruction (presuming ANTLR is set up properly):
antlr4 PropertyAssignment.g4
ANTLR generated the PropertyAssignmentBaseListener class, the listener methods of which will be invoked upon enter / exit of one of the grammar nodes. By extending this class, you can execute arbitrary code in the listener. Of course we will use these listener methods to build our internal structure, once again. Because ANTLR is based on Java, we can just as well implement the listener in Groovy:
class AntlrTranslatorListener extends PropertyAssignmentBaseListener {
    ParseTreeProperty assignments = new ParseTreeProperty();
    
    @Override
    public void exitProperty(@NotNull PropertyAssignmentParser.PropertyContext ctx) {
        assignments[ctx] = new Property(ctx.TEXT().text)
    }
    
    @Override
    public void exitExpression(@NotNull PropertyAssignmentParser.ExpressionContext ctx) {
        assignments[ctx] = new Expression(ctx.property().collect {assignments[it]})
    }
    
    @Override
    public void exitAssignment(@NotNull PropertyAssignmentParser.AssignmentContext ctx) {
        assignments[ctx] = new Assignment(assignments[ctx.expression(0)], assignments[ctx.expression(1)])
    }
    
    @Override
    public void exitCondition(@NotNull PropertyAssignmentParser.ConditionContext ctx) {
        exitExpression(ctx.expression())
    }
    
    @Override
    public void exitAssignmentBlock(@NotNull PropertyAssignmentParser.AssignmentBlockContext ctx) {
        assignments[ctx] = new AssignmentBlock(
            // because we skip annotating "condition", we get its child directly
            assignments[ctx.condition().expression()],
            ctx.assignment().collect {assignments[it]})
    }
    
    @Override
    public void exitFile(@NotNull PropertyAssignmentParser.FileContext ctx) {
        assignments[ctx] = ctx.instruction()
            // because we skip annotating "instruction", we get its children directly
            .collect { it.assignment() ?: it.assignmentBlock() }
            .collect {assignments[it]}
    }
}
As mentioned before, all you have to do is react upon discovery of syntax nodes. This code actually reacts on the “exit” events of a node. As you can see, the listener method’s names correspond to the associated grammar rule. All we do inside the listener methods is constructing the corresponding internal representation of the node and store it in a key-value map.

So we do e.g. on exitProperty. (You can think of the ctx object as representing the current tree node.) When we climb up the tree to nodes which consist of child nodes, we access the map to retrieve the internal representation we previously constructed from the child nodes. For example, in exitExpression, we build a new Expression as the internal representation of that node, and that constructor takes a list of Propertys; thus, for every property of the node, we retrieve its internal representation which we put into the map earlier.

Because we can write the listener in plain Java, we can really do whatever we want inside listener methods. Using some little meta-programming, this seems to work even more smoothly in Groovy.

And that code is all it takes to solve this problem using ANTLR! Of course we later need to invoke that listener on our input String, which just involves some codes of boilerplate code. Please refer to the project’s source code if you’re interested.

By using ANTLR, we’re finally able to decouple the grammar definition from its interpretation.

As you may have guessed already, this example covered but some of the basic functionality ANTLR provides. There are several advanced parsing features such as controlling the parser’s execution flow or altering grammar rules on the fly.

One more very interesting feature is its error handling. Using ANTLR, you basically got grammar verification out of the box. ANTLR naturally detects syntax errors in the input and reports errors intelligibly. Just imagine how hard it would be to implement reaction to illegal input using the Regex or Groovy DSL approach! ANTLR4 goes even one step further and tries to guess what you might have meant when providing erroneous input. If, in our example grammar, you would make a tiny error such as removing the closing bracket } of a “notNull” block, ANTLR will print out a warning during the parsing process, and will then just pretend the bracket was there, thus resulting in correct output!

Without going too much into the details of ANTLR, let’s recap here:
  • Good: You can describe highly readable grammar declaratively without any restriction.
  • Good: Grammar declaration and interpretation is clearly separated; an interpreter really is just an event listener which can be implemented in your preferred JVM language.
  • Good: Built-in grammar error recognition and recovery.
  • Fairly good: It’s possible to include dynamic grammar processing; however, this needs advanced knowledge of ANTLR.

Conclusion

Here, I’d like to can make a final comparison and assessment on the three possibilities of code generation discussed above:

It’s obvious that ANTLR offers the most powerful, straightforward and concise way of specifying and implementing a parser. It promotes the good engineering practice of separation of concerns, namely by clearly separating grammar definition and interpretation. Despite its power, the learning curve for its basic usage is really flat; especially when compared with, say Groovy DSL meta-programming.

It’s really just another very handy tool you want to keep on your shelf, and which will most likely scale to any requirements you will encounter on code generation terrain. It’s simple enough for use in a small project as it was done here, but it’s also a reliable companion for the most sophisticated parsing tasks you will likely ever encounter.

However, in some situations, you will probably still have recourse to alternative tools:
  • For very small parsing tasks (basically one-liners) of course everything else than Regex is likely overkill.
  • If you have good Groovy skills and you need a simple way to implement a really dynamic Grammar, or if you need to really mix your own language with pure Java/Groovy statements, it will be easier to do so in Groovy DSL.
I encourage you to download and study the source code of this small demo project. Take a look at the ANTRL website and try to build up your own mini-grammar.

Whether you agree or disagree with this article, please share your own experience with any of these (or other) means of code generation in the comments below.


Pages: 1 2

No comments:

Post a Comment