Start Details

Details - Scanner, Parser, Generator, and their Data Structures

The following figure shows the components that are part of the #recognize! library, and the data structures which are shared between those components. The arrows in the figure indicate the data flow.

C# scanner, C# parser, and C# generator, and the involved data structures

Consider the following C# source code snippet (which is intentionally badly formatted):

Console.  WriteLine  (
"Hello, World!") ;

The scanner is responsible for lexical analysis. It splits the sequence of input characters into a sequence of tokens:

The token list is result of the C# scanner

The parser performs syntactic analysis. The result of this process is a parse tree, or -- as we say in #recognize! terms -- a code model. For the above example, the corresponding code model snippet looks as follows:

The code model is the result of the C# parse process.

As you can see, the code model tree is tightly connected to the tokens of the token list created by the scanner. All relationships are bidirectional and make tree navigation easy. For traversal, #recognize! has implemented the visitor pattern, and some easy-to-use visitor classes are included.

Most of the node types in the code model tree correspond directly to a grammar rule in the C# Language Specification. More than 200 different node types exist, to cover all aspects of the C# language. Furthermore, your code is able to alter every property of a code model node, to add new nodes to the model, and to remove existing nodes from the model. This way, you can change a code model created by the parser in any imaginable way, or you can even build a completely new code model from scratch.

Modifying a code model only makes sense if there is a possibility to generate C# code again from it. Of course, this is also a main feature of #recognize!. If you apply the generator on the code model snippet from our example above, the output will look like this:

Console.WriteLine("Hello, World!");

The formatting options regarding spacing, indentation, and line breaks, are configurable to a great extent. This is just what was produced by the default settings. And under some circumstances, the original formatting (remember the badly formatted C# code at the beginning), can even be preserved during generation.

Last not least, #recognize! provides sophisticated error and warning reporting if the input is mistaken. Scanner and parser can raise 100+ different messages which are compatible to Microsoft's csc and Mono's msc compiler regarding message codes, warning levels, and message text. #recognize! currently comes with two localized versions: English and German.

Evaluate now...