March 29, 2015

Software artifact management with code generation


In this post, I’d like to express my opinion and present what I learnt as “best practices” through my experience when it comes to managing software project artifacts, and strategies for generating file types out of different types.

Throughout the software development process, artifacts of different types are created. Some become part of the final product, while others serve informational or organizational purposes during the process only; they typically don’t get delivered as part of the product.

Creating an artifact is always time-consuming. Depending on the artifact type and the amount of information it stores, this time can vary considerably. Thus one should aim to only create a minimum of artifacts and to use whatever artifact type is most suited to store the information one wants to keep.

As with other processes in software engineering, automation can and should be used to increase productivity when generating artifacts. Automating artifact generation both increases productivity and decreases probability of error in the process. The simplest strategy clearly is to “generate everything”. But if we take a closer look at the differences between software development artifacts, we can come up with some more detailed best practices for artifact generation.

I will here specifically address the two biggest challenges with artifact generation: its implementation and artifact synchronization.

Know your artifacts

Artifacts can be distinguished by type and function.

Rough categories can be used to differ various types: Text documents (e.g. Word documents), source code (e.g. Java source code files), structured information (e.g. XML files), compiled files (e.g. JAR archives). We shall come back to these different types later.

More important is the distinction by function into the two categories of “human artifacts” and “machine artifacts”.
  • A human artifact is actively worked on by humans. Its creation involves a creative thinking process which cannot be automated by a machine.
  • A machine artifact is basically of no use for humans, but for technical reasons, its existence is required by the technology / framework the project is built upon.

Opportunities for automated artifact generation

There are a couple of variations on how human and machine artifacts can be connected with each other:

No auto-generation

A human artifact is also a machine artifact and vice versa. This is the typical situation of “no automated artifact generation at all”: an artifact is created (and, optionally, worked with) by humans and interpreted by the machine.

This is quite an unlucky situation for both humans and machines: Unless very trivially encoded, artifacts are typically either very hard to read for humans or machines, and there are only a few simple formats which might satisfy both. However, this is quite a common situation. Most prominent examples include XML, HTML, CSS, even source code (as it’s interpreted by the IDE for syntax highlighting and the like).

This is typically a great opportunity to separate that artifact into one which is created by humans, and another, generated from the first, which is then interpreted by the machine, thus making both parties happy, as shown in the next paragraph.

Human-to-machine-generation

This is the most straightforward application of artifact generation: An artifact is created by humans, and then converted into a format which fits the machine’s needs to interpret it. The most typical example here is source code (as it’s compiled to bytecode / machine code by the compiler).

Note that in this situation, the human artifacts stays the actively maintained one. It is thus easy to simply re-create and overwrite the machine artifact created from a former version of the human artifact.

Machine-to-human-generation

This is an anti-pattern you want to avoid. Typically, extracting information from a machine-artifact into human-readably form is hard, and automating it can also be a challenge. A typical example is a hex code file you have to search for information, or an error log.

However, if you are in this situation, you really should try to automate it unless reading the file format is trivial. Typically, there are existing “viewers” for common machine code artifacts available. If there aren’t, creating a viewer or a machine-to-human-generator once will on the long run be more efficient that manually interpret the file every time. Here you can take advantage of the fact that machine files follow clearly defined grammar rules which you can use to build your generator.

Circles

This is what you want to avoid at all costs. It’s typically the outcome of a poor artifact management strategy. This situation typically starts out with a human artifact being converted into a machine artifact, but then that latter either is edited directly by humans (thus turned into a human artifact) or used as a base to generate other human artifacts. In this situation, synchronization is no more assured, which may form a considerable source for errors.

As a typical example, take this (real-world) situation:
  1. A web service is described in a Word document after negotiations with third party software vendors, and handed over to the engineers to implement it. They implement the web service, based on the information in the Word document (a non-automated human-to-machine-generation), the outcome of which is a web service descriptor (e.g. a WSDL file).
  2. However, as implementation proceeds, slight changes are directly incorporated into the WSDL file. After that, the team is forced to extract the updated service contract from the WSDL file into a human-readable Word file (a non-automated machine-to-human-generation) in order to inform the third party software vendors about the final implementation of the web service.
In this situation, information is brought back from the generated artifact in the originally generating artifact (or even into a third one). Without having one clear “primary” artifacts, changes are hard to synchronize, and introducing automation is typically hard if not impossible, if it involves artifact types which clearly target a human audience, such as Word files.

Here, the better strategy is to break up the circle and instead transform it into a chain:
  1. The Word file description of the web service is actually turned into a WSDL file (which typically happens manually). After this point, the original Word document is abandoned and not updated any further.
  2. As the web service is then edited by the engineers, it actually becomes another semi-human-semi-machine artifact. It is now the “primary” artifact in the process.
  3. As the WSDL is tolerably readable by the engineers, but not by the third party software vendors, it must be turned into a “truly” human-readable format. Note that even though this format must be agreed on by all the parties involved, it must not match the original Word file format. In this situation, I would suggest choosing a format for which generation out of the WSDL file can be automated, e.g. using a WSDL-to-HTML generator. Note that here, the WSDL file stays the “primary” artifact, and all subsequent updates would happen there, with the possibility to re-generate the “human-readable” (e.g. HTML) outcome at any time.

Even better would be a chain such as this:

Here, the Word file is turned into another human/engineers-readable file from which both the machine file (WSDL) and the final human-readable service description (HTML) can be generated.

However, this will typically require a custom-made file generator which will be discussed below.

Choosing the right file type

When it comes to choosing the right file type for storing information, the choice is by its nature much freer for “human” files as the type of “machine” files will be preset be the technology / framework in use, thus there is only an indirect choice by using the right technology. Hence, do not underestimate this impact during technology assessment!

Keep it simple

For human files, on the other hand, we can really choose the file type which best suits our needs:
  • It has the most simple format.
  • It has the most insignificant overhead, i.e. we just need to provide actual information with the least amount of meta-data.
  • If there’s the need to convert from / to machine artifacts, or to use it both as a human and machine artifact, conversion should be easy and easily automatable.
Note that the first two points actually could be outweighed by a more “chatty” file format with great tool support which automates all the boilerplate meta-data generation. Keep in mind though, that this will make you dependent on that tool. If you don’t have it or if it’s complicated to apply it in a certain situation (e.g. when browsing plain request / response log messages in a server console) you might still face the full complexity of the original file format.

As a rule of thumb, however, simplicity is key: The file really should only include the amount of meta-data which is absolutely critical for the job.

That’s why I would prefer e.g. a Java .properties file over a JSON file, and a JSON file over an XML file.

In certain situations, the best strategy might even be to build your own file format. This brings two major advantages:
  • You can keep the format as simple and as concise as you need it.
  • You can generate it into any other format of your choice as long as you create the generator yourself.
Using some fairly modern tools, creating a file converter / code generator for your own “domain specific language” (DSL) is quite simple. For example, the Groovy programming language provides a wide range of tools extending the Java language for DSL code generation. I will probably illustrate modern code generation practices in a future blog post…

Thus it’s really your choice, based on comparing effort and benefit. Take for example a quite complicated XML format such as WSDL. Actual information in that file is maybe lower than 25%. The rest is code duplication and boilerplate noise. If you had a custom tool that would allow you to just input the information and generate the rest for you, productivity would be drastically increased, and probability of errors would be lowered.

Keep it the same

Another factor to keep in mind is that you want to work with as few different file format types as possible, in particular when it comes to the most complex ones. In an ideal world, you would even have only one file format, and generate the rest out of it. Of course, this is not realistic.

A good strategy is to identify a small set of “primary file formats” you want to work with, and try to generate all the other formats from them. These should typically be the most simple file formats, or the ones your team is most familiar with.

If we take the WSDL file type as the example again, in a Java project, a possible strategy would be to use a Java-to-WSDL-generator: As all your engineers are fluent in the Java language, why introduce another language instead of letting your engineers stick to what they know best?

Choosing the right generator

Of course, the choice of an automated generator / converter depends on the file type chosen, thus the same rules apply as presented in the last section:
  • Keep it simple: Consider writing your own generator if any existing one leaves you with a considerable amount of boilerplate code.
  • Keep it the same: A generator should translate from a very simple, familiar file format to a more complex one, never the other way around. Thus write your source file in the language your are familiar with, and translate it to whatever “foreign language” the machine needs.
Key is here to use the right tool for the job. You don’t want to introduce a generator which is more intricate to work with than the actual file it would generate! Thus, some requirements for any code generator / converter are:
  • It does introduce the least amount of complexity to your tool chain.
  • It scales with your requirements, e.g it is easily extendable to support more output options.
Again, in certain situations, it may be your best bet to opt for a custom file generator which you build yourself.

Let’s take the WSDL to HTML example again. Yes, there are quite a few tools which provide WSDL-to-HTML conversion (or more generally speaking, XML-to-HTML conversion, e.g. based on XSLT), but most of them are probably rather complex to use, offer limited customization, require license payments or / and are closed source. In this scenario, you may as well opt for a custom solution considering both WSDL / XML parsing and HTML generation can be done pretty easily with some modern programming languages, providing you with full control over the actual conversion process. Again, see Groovy’s support for XML /  HTML processing as an example.

Considering aspects such as ease of development and maintainability, I would thus prefer e.g. a Groovy-based solution over XSTL.

Enabling Continuous Delivery

Integrating your (custom) generator into your project build tool chain will even allow you to significantly speed up and unify your deployment process, which is, amongst other automation techniques, a key ingredient for Continuous Delivery best practices.

Conclusion

Writing complex code artifacts in unfamiliar code formats is an exhausting, error-prone task. Clever engineers will do what they are best at: write code to automate file conversion. As in other software engineering disciples such as testing, automation done right will increase productivity and robustness when it comes to project artifacts management.

Investing in a custom-made file generator may be worth while by further increasing everyday productivity, especially as the costs for creating custom DSL code generators are quite low using modern technology and tools.

Last but not least, incorporating code generation in your deployment pipeline and thus tearing out manual translation steps is key when you want to move towards Continuous Delivery.

As always, please share your thoughts on this topic in the comments section below! Do you agree with my pro-“code generation” opinion? I would be particularly interested in hearing your most memorable experience when relying on or skipping code generation.

No comments:

Post a Comment