Domain Specific Languages
This page discusses Domain Specific Languages, providing categories and information about how DSLs are engineered (or Life After Parsing). It shows how DSLs are key to enabling and extremely effective class of program generation technology based on program transformation, tracing the origins back to a 1970s tool called Draco. A modern code generation engine, the DMS Software Reengineering Toolkit, uses the Draco concepts to process many DSLs effectively. Surprisingly, DSLs are fundamental to implementing this class of generators; DSLs are used to support DSLs!
A typical procedural computer language is general purpose: it can solve arbitrary problems ("Turing capable") but is basically clumsy for almost everything (ask any programmer). A domain-specific language or DSL is a notation system (and corresponding semantics) that allows the succint description of a problem or a solution in a narrow problem ("domain") area. The notation takes advantage of the existence of a community of philsophers and engineers who have struggled to characterize a problem area and found such a notation that is effective, and solutions that can be produced for instances of the notation. Such notations may be text-like (mathematics, Ascii text, Unicode) or graphical (varieties of geometric shapes linked by various kinds of arcs), often annotated with text (from another DSL!)
A Zoo of DSLs
There are many existing DSLs, and people design new ones quite frequently. Here are some examples that existed well before computers:
- George Boole's algebra for writing down logic equations
- Leibniz' Calculus (he and Newton disagreed on the notation and Newton eventually won)
- Electronic circuit diagrams, with wires, batteries, resistors, vacuum tubes and even transistors
- Subway maps
Computers made it possible and useful to design many other DSLs:
- SQL, used to specify relational models of data and complex queries over that data, in an implementation dependent way
- Prolog, a language for solving problems using goal-directed backward chaining search and unification
- Relay ladder logic, a (grapical) notation derived from relay circuits, used to run factories
- Verilog, a language for specifying digital circuits
- Labview, a graphical dataflow langauge for implementing data acquisition and device control systems
- Drools, a language for stating problems with constraints, and finding consistent solutions
- Regular expressions, for describing a wide variety of sequences of characters
- HMTL, the presentation language of the World Wide Web
Engineers invent new notations frequently:
- PSL, the Property Specification Language, used to define conditions over sequences of events in digital systems
- Attribute grammars, for defining computations over abstract syntax trees
- XAML, Microsoft's approach to specifying rich user interfaces
Our particular perspective treats "typical computer languages" as just another kind of special purpose domain: general purpose computing. From this perspective, these languages are just notational systems with semantics with associated solutions ("compilers"), and viewed from afar, look just like all the other kinds of DSLs.
- COBOL, a language designed for business purposes, that by an large runs the corporations of the world
- Algol, one of early, cleanly designed procedural languages allowing call-by-name and recursion
- LISP, an early dynamic language, with the ability to manipulate its own structures, still favored by many AI programmer
- C, the language at the bottom of most extant operating systems and many embedded devices
- C++, used to build large complex applications and a favorite in the stock trading business
- Java, a failed attempt at an embedded language that succeded as run-anywhere enterprise application language
- C#, Microsoft's attempt to overrun Java and succeed at world domination
- Ruby, a hot new dynamic language used for building complex web applications
Somewhere, somehow, these DSLs get invented. Breifly, DSLs are motivated by the observation by a small set of individuals, who see a problem area, recognize that problems or solutions in that area all have some kind of commonality, and find a way to write down the commonality and the variability; a DSL is born. This process is necessarily messy; it is the scientific process applied to language design and usually involves a lot of guess- and re-work. Most of us are lucky enough to get to see and use the final product without the intermediate pain. This process is called domain analysis. You can read more about how to approach this is a systematic manner in Guillermo Arango's papers, including Domain Analysis Concepts and Research Directions.
For a DSL to become useful, ultimately it needs to acquire a stable notation, semantics, and in this modern day, tools to analyze the DSL, or to generate solutions based on the problem statement the DSL poses.
With these, all that is left is building machinery that can accepts instances for the formal language, and somehow process that instance into a solution, taking advantage knowledge of the semantics of the language. The classical approach for this is to build a compiler.
Now we get to a technology trap: building compilers is hard. We are stuck: we have a great way to write down problems... and it doesn't do us any good without the compiler.
Generic Program Transformation Engines
The term domain analysis came from Dr. James Neighbors in the late 1970s. He wanted to build sets of cooperating DSLs and faced the compiler problem. Fortunately, program transformation technology was being developed at the time, as a kind of generalized compiler. What program transformations do is transforms structures (think "graphs") incrementally into other structures; if you apply enough transforms, you can cause arbitrarily complex changes to the result (think optimization and code generation). So if you can represent your DSL as a structure, then you can use a program transformation system to manipulate it. Neighbors married metacompiler technology (in particular parser-generator generators) with program transformation systems to enable his Draco system to accept a variety of DSLs, and generate complete application systems from this. The structures he happened to use are Abstract Syntax Trees because string-parsers can produce these easily, but the idea applies equally well to graphical (e.g, graph-based) languages.
Draco is hardly the first application code generator system constructed. People have wanted these since time began. But Draco was one whose ideas are extremely well founded in software engineering and computer science techology, and whose elements acted with incredible synergy. Better, the Draco elements were reusable across many domains, and thus solves much of the compiler problem by amortizing the cost of building the compiler pieces. Sure, the pieces were still hard to build, but used across hundreds of domains implementations those pieces suddenly look cheap, and more importantly, available for use in building the tools for the next domain.
Too bad Draco never went commercial.
DMS Software Reengineering Toolkit
This author happened to be in Neighbor's research group, and caught the DSL bug. The DMS Software Reengineering Toolkit was inspired by Draco, and consequently has as its philsophical core the same concepts as Draco, with several technology twists fundamental to its commercial success:
- Arbitrary structures to represent DSLs (both ASTs and [hyper]graphs) are supported
- Extremely strong (context-free GLR) parsing technology to make it easy to define arbitrary DSLs and generate the ASTs accurately and easily. Once a grammar is defined to DMS, it can parse the DSL.
- Classic compiler flow analysis machinery, generalized to support the needs of arbitrary DSLs, making it possible to reason deeply about a particular DSL instance ("program"). The problem this solves is one of facts-at-a-distance; a piece of information from one part of a large DSL instance may be arbitrarily far away from where that fact is needed process another part of the DSL to provide for good code generation.
- Source pattern matching, and source-to-source transformations. Given the working grammar, DMS can directly manipulate the structures. (For an example of this, see Algebra defined as a DMS Domain).
- A direct confrontation with the problem of scale, through DMS's implementation in a parallel computer language. Any successful computer language, will eventually produce large, complex artifacts, that must be processed in the context of the rest of the complex artifacts that make up the rest of the software system. For this to work, the tool must be able to handle many domains concurrently, reason about the set as a whole, and bring enough computer cycles so that answers are produced in reasonable length of time. The underlying parallel language allows DMS to harness the available manycore computers that are now becoming widespread, to the benefit of the engineer and timely delivery to his end users.
The claim is that DMS is ideal for building DSLs. That claim is put to the test in DMS itself, which is largely bootstrapped using DMS and a set of DSLs that comprise the the core of DMS. That list of DSLs include:
- DMSLexical, a language for describe regular expressions over sequences of Unicode text, including a "subtract" operation over regular expressions
- DMSStringGrammar, which is a remarkably simply and clean EBNF notation for describing context free grammars and certain semantic properties such as associativity and commutativity of operators. Proven on very tough languages such as C++.
- DMSPrettyPrinter, a Latex-like language for assembling boxes of text into larger boxes, used to regenate source text from ASTs
- DMSAttributeGrammar, enabling the specification of computations over values passed around an abstract syntax tree, and compiling down to partial-order parallel programs in PARLANSE
- PARLANSE, a task-parallel programming language with extremely fine grain (fast context switch) tasks, and safe exception handling in the face of parallelism, to enable the construction of very large, complex program analysis tools that DMS is capable of expressing.
That claim is also tested in DMS's daily application to a wide variety of DSLs (both specific and general purpose as outlined above). You can get a sense of the variety of DSLs that DMS can process by looking at the set of off-the-shelf language modules available for DMS.
For a discussion of how DMS's machinery can be concretely used to produce code, see this discussion on Code Generation.