Life After Parsing™:
Got My Grammar... uh, now what?
It is astonishing how often people think that the key to building a tool to process a computer (or domain-specific) language is to get a parser.
Its true... in the same sense that playing poker is all about putting the ante into the pot. If you believe this, the other poker players are going to eat you alive. Yes, you have to ante. No, that's not where the game is played.
Only the most trivial computer languages are easy to analyze and transform. The compiler community has been working on this problem for 50 years, and one ignores what they have learned at one's peril. They tell us following technologies are needed to realistically process computer language(s):
- Lexing and Parsing, ... of course. But there are some really nasty problems here, including preprocessors.
- Representation Capture: A way to represent and capture program structure
- Symbol Tables: A mapping between identifiers used in the program, and their meaning (and scope).
- Inference Methods: A way to understand the meaning and consequences of the written code:
- Simple Fact Collection: Inferences that are easy to extract from the program structure.
- Control Flow Analysis: Data about the order in which activities occur in a program.
- Data Flow Analysis: Data about how information flows from one part of the program to another.
- Symbolic Reasoning: Method to compute complex consequences of semantics and various information flows.
- Pattern Matching and Transformation Capability: Means to find program points that look like places of interest, as well as means to convert one program representation instance into others, to provide optimizations or translation to another representation
- Pretty-Printing: If the representation is modified, one needs to be able to regenerate valid source code from it.
- Multi-module and multi-lingual Integration: Inclusion of other programs in the same or another notation to allow one to process systems of code.
Parsing is hard ... but just a distraction
Tools like YACC, JavaCC, ANTLR, and variants are all about parsing and a little bit about AST building. It is a big effort to choose the "right" (lexer and) parser generator, find/create a grammar, wrestle into a form acceptable by a parser generator (lookaheads? ambiguities? context sensitivities?), deciding on how to build a tree, ... Then there's the often surprisingly big struggle to make the parser match what the compilers really accept as opposed to what everbody knows to be true without any further thought (Hey, C source code doesn't have any complications, its a simple language, right? Can't be that hard...oops, what's a trigraph?). Putting together a working parser is enough work so that people tend to focus on just getting that right, paying zero attention to what is needed after that. And that makes them lose sight of the full problem.
But success are parsing and tree building is such a small part of the problem that a victory really means very little. Climbing 11,000 feet to the foothills of Everest requires only basic technology: some camping gear and good hiking boots. Any determined clod can get this equipment and climb the foothills. To climb to the peak it is clear that one needs a completely different set of very sophisticated technology (oxygen tanks are pretty different than hiking boots) and a much different climbing process to finish the climb to Everest's peak. Those that attempt the remaining climb with just the basic equipment simply die. Those that have just a parser never really get a useful tool.
Our DMS Software Reengineering Toolkit is that extra technology. Semantic Designs has spent over 17 years building that technology, and proving it on real languages such as COBOL, Java and C++, so that would-be tool builders don't have to reinvent the infrastructure, and can get on with building the tool they really wanted. Along the way, we found that the parsing process could be significantly enhanced.
A Higher Climb
Let us consider these additional mechanisms further. First, what drives the need for this? Our programming languages, like our computers, used to be smaller and simpler: FORTRAN66, COBOL, RPG, BASIC, K&R C, ran on small machines, and were all written in ASCII text in a small number of files. Lots of complications have occured since then:
- Big, complex languges: C99, Enterprise COBOL, Ada2005, Java7, C#4, C++11, Fortran2005, HTML5 each in a variety of dialects... even C isn't small (check out GCC and Microsoft's variants), and Perl is an amazing mess. Language committees can't resist adding new goodies; there tends to be a "me too" race (Java and C# seem to co-evolve furiously). And companies providing compilers have a vested interest in adding features to create customer lock-ins.
- Values in a wide variety of forms and precision: IEEE Floating point, decimal numbers, infinite precision rationals, ...
- Preprocessors, with include files and macros.
- Locale-specific character sets, Unicode in several flavors, EBCDIC
- Embeddings of language fragments into other languages (JavaScript and PHP in HTML, SQL in COBOL, ...
- Generics, reflection, and metaprogramming are used more frequently to consruct software systems
Lets examine the necessary tool-supporting mechanism in a bit more detail. What are they, and why are they needed? (This discussion parallels the opening):
- Lexing and Parsing (still necessary!)
- Lexical Capture: It is classic to perform
lexical
analysis of the source code, e.g., to break up the stream of
input characters into tokens that represent the conceptual
elements of a program: identifiers, numbers, keywords, comments,
strings, even whitespace. That idea still generally holds. It is
also classic to build a hand-coded lexer or use Flex to accomplish
this, and process the ASCII characters in files that made up all the
languages we knew and loved, and in fact still dominate Western
programming. But for highly evolved legacy languages, and modern
languages, you need to process possibly many character representations
(Katakana in comments in C, anyone?) including Unicode, convert
numbers to floating point without losing precision, capture numbers
with possibly arbitrarily large precision, handle keywords that might
be identifiers (PL/1 is famous: every keyword can be an identifier!),
etc.
DMS provides libraries and tools to handle all of these issues. The DMS lexer handles full Unicode, has support for opening nested streams (to handle include files), etc, and can easily build multi-mode lexers to handle the different lexing needs in the possibly different parts of the language. DMS's lexers also capture comments, which are extremely useful in program analysis and fundamental for program transformation; users will reject perfectly transformed code that has lost their comments. Similarly, DMS will capture lexical details, such as radix of numbers and string quote style; programmers reject modified code in which these are wrong, too. We don't know of other production lexer generators that have any or even most of these properties out of the box. (If your foundations aren't right, how are you going to construct a large building?) - Parsing: It is surprising that the choice of parser is often
made without regard to the language being parsed, often some parser
generator whose major property is that is convenient to download (e.g, Bison, ANTLR, ...).
If your language requires a lot of lookahead, an LL(1) parser is going
to be a very painful choice; if your language has ambiguities,
a parser generator that can only produce one parse will be trouble
(consider parsing C++). But when parsing real languages,
it is hard to know what issues the parser faces until you are done
building the parser, because you don't know what the grammar will be
until it can basically parse everything. And the manuals lie about what is legal. What this
suggests is you should simply use a very strong parsing technology to
avoid as much of this pain as you can.
DMS uses a GLR parser, which means it handles arbitrary lookahead and ambiguous grammars with aplomb. Arbitrary semantic reduction conditions allow DMS parsers to eliminate many ambiguous choices while parsing. You write a grammar; DMS's GLR parser can parse it. So in general parsing is hard; with DMS, parsing is a lot easier.
DMS parsers can be configured to collect preprocessor definitions and conditionals without expanding them. This makes it possible to reason about the code as seen by the programmer, with all the preprocessor directives in place. No other tool known to us can do this.
- Lexical Capture: It is classic to perform
lexical
analysis of the source code, e.g., to break up the stream of
input characters into tokens that represent the conceptual
elements of a program: identifiers, numbers, keywords, comments,
strings, even whitespace. That idea still generally holds. It is
also classic to build a hand-coded lexer or use Flex to accomplish
this, and process the ASCII characters in files that made up all the
languages we knew and loved, and in fact still dominate Western
programming. But for highly evolved legacy languages, and modern
languages, you need to process possibly many character representations
(Katakana in comments in C, anyone?) including Unicode, convert
numbers to floating point without losing precision, capture numbers
with possibly arbitrarily large precision, handle keywords that might
be identifiers (PL/1 is famous: every keyword can be an identifier!),
etc.
- Representation Capture: Real tools require more than just "parsing"; the structure of the program must be captured to enable further processing. Usually an Abstract Syntax Tree (AST) is needed. Typical parser generators force the grammar engineer to specify not only the grammar, but also to explicitly specify how to procedurally build the AST as well. This essentially (more than) doubles the work, and it makes maintaining the grammar hard because any grammar changes require the AST building code as well. In contrast, DMS automatically defines a builds a syntax tree, either a concrete tree (mainly for debugging, containing all the language tokens), or what amounts to an AST (for production, in which non-value carrying terminals are eliminated, useless unary productions are removed, and lists are formed). Changes to the grammar automatically cause the C/AST to change correspondingly. And the tree is always correct. It includes precise source line information (file/line/column) as well as comments passed to it by the lexer, attached to appropriate tree nodes, and it can include preprocessor directives.
- Symbol Tables: A mapping between identifiers used in the
program, and their meaning (and scope). It is only the grammar which
is context-free in most programming languages; the meaning of
identifiers is usually context ("scope") and semantics
dependent. Generally
to analyze, reason about, or transform programs one must know each
identifier instance represents, where and how that identifier is
defined.
This is traditionally done with a
symbol table,
that maps code regions to sets of identifiers and thier definitions.
For real languages a symbol table is often relatively complex because
different code regions implicitly
designate different sets of identifiers (e.g., local scopes, lexical
scopes, namespaces, parameter names, object slots, etc.). So the
symbol table must model the relationship of the various "scopes" to
enable symbols to be correctly found. Such a symbol table
has to be constructed from the program representation; the grammar
gives no clue about this.
DMS provides a symbol table structures with local mappings ("symbol spaces") of identifiers to definitions to model scopes. These provide multiple inheritance support, and are have proven strong enough to handle C++ and the many other languages that DMS can presently process. DMS has a library of routines for symbol-definition insert and fast hashed symbol lookup with overload resolution that can automatically navigate the symbol "spaces" and find the proper definition. DMS provides means to help build symbol tables (see "attribute grammars", below). No parser generator system known to us provides symbol table structure support or help in building symbol tables; the language engineer typically has to write all of this machinery herself. - Inference Methods: A way to understand the meaning and consequences of the written code, e.g., program analysis. Every program analysis or modification task starts with understanding the code, which requires collections of various kinds of facts about the code structures and how code elements interact. We call such determination of facts "inference". There are inferences that are easy to extract from the program structure, e.g., the AST. One can programmatically traverse an AST with recursive function calls, and almost any parser generator will provide programmatic access to the nodes in an AST, but this is inconvenient for a large or complex grammar. DMS of course provides such programmatic access, but more importantly provides a variety of other inference mechanisms. (Parser generators never provide any capability for this; the language engineer has to build it all by herself):
- Simple Fact Collection: One way to collect information is attribute grammars, which enable easy computation of arbitrary properties of syntax instances as annotations on the grammar rules themselves (a domain-specific language). This makes it very easy to collect complexity metrics. A more interesting example is collection of symbol table data. Since scopes for identifier names tend to follow the program structure, and this structure is defined by the grammar, there are almost always grammar rules which define lexical scopes. So attribute grammars are a nearly ideal way to collect the symbol table information. Coupled with the symbol-insertion capability of the symbol table library, attribute grammars make it relatively easy to build symbol tables from parse trees.
- Control Flow Analysis: To reason about most (especially procedural) languages, one needs to know the order and under what conditions actions occur. This requires control flow analysis. DMS implements this using a language-engineer specified attribute grammar analyzer that understands the control flow semantics of the language syntax, and a library that constructs a control flow graph graph from information collected by that analyzer. The resulting control flow information encodes a "flowchart" of the code in a function or method, including blocks of actions, conditional branches and exception branches, operations that occur in unspecified ("parallel"/fork-join) order, and provides information such as dominators and control dependences. Additional support allows one to structure this control flow graph into regions, even if the original program is full of unstructured blocks (e.g., using uncontrolled "goto"). DMS also provides tools to build call graphs (even in the face of indirect pointers, as found in C, and implicitly in OO programs as "objects", which are really just pointers), using data flow anlaysis and points-to information, see below).
- Data Flow Analysis: How information flows through the program is fundamental to understanding. What is needed is knowledge about how data flows from one point in the code to another. DMS provides a variety of support for computing data flow information, using built-in iterative solvers operating over the (possibly parallel) control flow graph, operating on arbitrary domains (including bit vectors), using identifier access and update information as determined by another language-engineer specified attribute grammar. Using these solvers, DMS can determine reaching and use-def chains, and consequently Points-to analysis. DMS has been used to compute global points to analysis for very large code bases.
- Symbolic Reasoning: More sophisticated analyzers are useful to reason about code, including method to compute complex consequences of semantics and various information flows. At present, DMS provides abstract interpretation to compute symbolic range information for values of variables in terms of other program variables.
- Pattern Matching and Transformation Capability:It is useful to be able to match arbitrary program fragments, or to generate arbitrary program fragments. DMS does with with surface-sytax patterns, in which one writes code fragements with placeholders. DMS can match such patterns against ASTs it has parsed, binding placeholders to corresponding subtrees, and/or it can generate ASTs that instantiate the patterns with pre-supplied subtrees, allowing one to build arbitrarily complicated trees. One can also write optimizations or transforms on code, using rewrite rules consisting of pattern pairs, one for matching and one for replacement when a match is found. Since rewrite rules can have one language in the match-pattern, and another in the replace-pattern, one can write language-to-language transformations.
- Pretty-Printing: If code transformations are applied to improve the code, one needs to be able to regenerate valid source code from the modified representations. DMS provides this ability in the form of prettyprinting rules, which specify how the AST should be regenerated in terms of its original layout ("fidelity printing") or in terms of nested, indented text boxes ("pretty printing"). These capabilities can be used to show smaller subtrees, also. The prettyprinter regenerates accurate numerical (especially floating point or infinite precision) values, lexical formatting information, (string quotes, radix, character escape codes, etc.), as well as comments, preprocessor macro declarations, includes, ifdefs and macro calls.
- Multi-module and multi-lingual Integration: Real sofware systems are not coded entirely in one language (Java? What, no SQL or HTML?). As a practical matter to process large systems of code, a good tool must be able to parse all the sublanguages and tie information together across the elements. DMS can host multiple language front ends/symbol tables/flow analyzers/prettyprinters simultaneously.
Bottom line: A parser simply isn't enough to do serious work. DMS is designed to be enough.
Bonus: Pretested Language FrontEnds
DMS has many predefined language front ends which have been developed and honed over the last decade. These front ends include grammars, prettyprinters, and to varying degrees, symbol tables, flow analysis, etc. For real languages such as COBOL, Java, C# and C++, you are much better off getting one (or more) of SD's tested language front ends than trying to define the grammar let alone the full front end yourself, or attempting to bend another that hasn't been used for serious tasks.
Double Bonus: Faster answers through Parallel foundations
Programs are getting bigger (millions of lines of code), and people (programmers and their managers) want answers fast. One way to get faster answers is to use parallelism.
All of the DMS machinery is implemented in a parallel programming language, PARLANSE which is designed to handle multicore irregular ("task") parallel computation. The DMS implementation provides for thread-safe access and update to all the structures to enable multiple analyses and transformations to occur. This allows DMS to scalably process very large systems of code, using the multiple CPUs available in virtually every desktop workstation that programmers have. We have not seen any other parser generator, program analysis, or program transformation tool that provides this capability.
It should be obvious that defining a full set of tools for processing languages is difficult. Defining such a set with parallel foundations is something a language engineer would not consider. But it is available built into DMS.
