grammar describes the specification grammar and the parse tree and other data structures
%grammar definition ...
A definition is
name : body
A body is one of:
( item ... [ %prec tok-name ]
[ | item ... [ %prec tok-name ] ] ... )
lex-token [ lex-action { C-code } ]
balance lex-token lex-token lex-token
name
CTYPE [ = C-expression ; ]
list-structure body
An item is one of:
definition
name
lex-token [ lex-action { C-code } ]
balance lex-token lex-token lex-token
name +|* [ separated-by item ]
[ item ... ]
The grammar section of a source description file describes the syntax of the specification files (also called the source or input) read by an SDTool. Simultaneously, the grammar section defines the parse tree data structures, and other auxiliary structures.
The grammar section consists of a sequence of definitions.
There are six types of definitions; in the order in which
they are listed in the SYNOPSIS section, they are:
grammar rules
lex-tokens
balanced lex-tokens
renamings
attributes
list-structures
A grammar rule consists of a sequence of one or more
alternatives; each alternative is made up of a sequence of
one or more items. There are six kinds of items; in the
order in which they are listed in the SYNOPSIS section, they
are:
definitions
names
lex-tokens
balanced lex-tokens
sequence items
optional items
The names must be defined elsewhere; forward references are
allowed. Sequence items consist of a name followed by an +
or *, optionally followed by a separated-by clause.
Optional items consist of a sequence of one or more items
within square brackets.
Names
Names are made up of lower case letters, digits and/or the underscore character. The first character of a name must be a letter, and there must be at least one more letter in the name. Every name, with the exception of the predefined name error, must be defined exactly once in the grammar section. Names do not need to be defined before they are used. There are many reserved words that cannot be used as names (see RESERVED(1) ).
Definitions
Except for the first definition, which must be a grammar rule definition that defines the start symbol of the specification grammar, definitions may occur in any order. In addition, definitions may be listed sequentially or they may be nested within grammar rule definitions. Each definition causes ivy*meta SDTB to create identifiers that are associated with the parse tree that the SDTool will build from the specification. These identifiers provide the C interface that gives access to the parse trees and other data structures.
In general, each definition causes the creation of three new identifiers, one for a new C data type, one for a new selector (also called a navigation) macro, and one for a type index (also called a discriminant) value. These identifiers are all declared and defined in the generated global.h header file, making them all accessible from within the C escape portions of an SDTool description (e.g., the middlecode section), and from all custom C code files that include global.h (see FILES(2) ).
For example, the lex-token definition:
iden : <[a-z]+>
causes three identifiers to be created:
IDEN The C type for the parse tree nodes that will
hold the iden tokens.
iden The selector macro that will select a parse tree node of type IDEN from its parent node in a parse tree or list-structure.
Iden The type index value that identifies any node of type IDEN (accessible via the type macro).
Every new definition (except renaming and attribute definitions) creates a new C type for the parse tree or list-structure nodes that correspond to the definition. The names of the C types are taken from the grammar name by converting all letters to upper case. The C types created from grammar rule definitions are new structure types. The C types created for tokens are aliases for the predefined TOKEN type (see TOKEN(5) ).
Every new definition also creates a selector macro that may be used to select the corresponding component from a higher level node (see example below). The selector macros may be used as both l-values (on the left side of assignment operators) or as r-values (on the right side of assignment operators). The names of the selector macros are exactly the same as the grammar names.
Finally, every definition creates a type index value; type index values are used to identify individual parse tree and list-structure nodes, and to discriminate among grammar rule alternatives that appear in parse trees. The names of the type index values are taken from the grammar names by converting the first character, which must be a lower case letter, to upper case.
Consider the following grammar rule definition: expr: ( id | left:expr oper right:expr )
This grammar rule definition contains two nested renaming
definitions for the names left and right. It also contains
two names, id and oper, that must be defined elsewhere. The
rule also contains two applied occurrences of the name expr.
This rule causes the creation of a new C data type, EXPR,
which is used to hold parse tree nodes for the nonterminal
expr. Each node of type EXPR will have the type index value
Expr. The two nested renaming definitions cause the
creation of the selector macros left and right; the
definitions of id and oper (not shown) cause the creation of
the selector macros id and oper. These four selector macros
can be applied to nodes of type EXPR to access the
corresponding nodes below the EXPR node in a parse tree.
Note that the two renaming definitions provide unique
selector macros for the two sub-EXPR nodes, but do not cause
the creation of new types.
Grammar Rules
A grammar rule defines all the alternatives for a nonterminal in the specification grammar. The body of a grammar rule consists of a sequence of one or more alternatives, and each alternative is a sequence of one or more items. If there is more than one alternative, they are separated by vertical bar characters. The entire body is enclosed by parentheses.
Each alternative defines a production rule of a context-free grammar. ivy*meta SDTB translates grammar rules in a straightforward manner into yacc rules; any restrictions on yacc grammars also apply to ivy*meta SDTB rules. The class of grammars accepted by yacc are LALR(1) grammars. The yacc section of the source description file can be used to specify token precedence and associativity (see YACC(2) ), and the prec clause can be used in grammar rule definitions to set the precedence of an alternative (see the yacc documentation for complete details about rule and token precedence and about token associativity).
There are two restrictions that apply to the names that
occur in grammar rules:
The names of all named items within an individual grammar
rule alternative must be unique. This ensures that each
named component of a nonterminal parse tree node can be
selected by a unique selector macro.
The names of the first named items in every alternative
within an individual grammar rule must be unique. This
ensures that the alternatives that appear below
nonterminal nodes can be discriminated.
Each grammar rule definition creates a new C type for the parse tree nodes of the nonterminal defined by the rule. Logically, the new C type can be considered to be a union of records, where each component of the union corresponds to one of the alternatives. Each parse tree node of the new C type contains a discriminant value, accessible via the tag macro, that specifies which grammar rule alternative a node represents. This discriminant value is the type index value of the first named item in the alternative. Consider the rule:
expr: ( num:<[0-9]+>
| left:expr op:<[-+*/]> right:expr
| «%» digit:<[0-9]>
)
This grammar rule, which has three alternatives, defines the new C type EXPR. If x is any C expression of type EXPR, then the C expression tag(x) returns the discriminant value for the alternative below x in the parse tree. The value returned by tag(x) will be either Num, Left or Digit.
Each named item in a grammar rule can be selected directly by using its selector macro. For example, the C expression right(x) selects the right sub-expression node in the second alternative. This expression only makes sense if tag(x) returns Left.
Sequence items and optional items in parse trees are
accessed in the same way as any other kind of item. When
accessing a sequence item, the first node in the list is
returned (if the sequence is empty then NULL is returned);
other nodes in the sequence can be accessed via the the next
function (see LIST(5)
and TRAVERSE(5)
). Consider the rule:
cmd: ( id arg* [ filename ] )
This grammar rule defines the new C type CMD. If x is a C
expression of this type, then the macros id, arg and
filename can all be applied to x. The value of id(x) will
be the ID node below x in the parse tree. The value of
arg(x) will be the first ARG node below x in the parse tree;
if the sequence is empty then arg(x) will return NULL.
Subsequent ARG nodes can be obtained with the next function;
for example, next(arg(x)) returns the second ARG in the
sequence. The value of filename(x) will be the FILENAME
node below x in the parse tree; if there is no such node
then filename(x) will return NULL.
Lex-tokens
A lex-token defines the structure of a token that will be
read from a specification. The language used to define
lex-tokens is the same as that used by the lex program.
Only two types of lex-tokens, those that are delimited by
quotes, and those that are delimited by curly brackets, can
be used directly in a grammar section. All other lex-tokens
must be delimited with angle brackets, < and >, and all
embedded occurrences of > must be escaped with a backslash
character. Some example lex-tokens are:
For more details on the lex-token C data structure see TOKEN(5) . Also see LEX(2) for information on using lex definitions.
Consider the rule:
func: (name:<[A-Z]+> «(» arg:<[a-z]> «)» )
This grammar rule contains four lex-tokens, two as the
bodies of lex-token definitions and two as literal, lextoken
items. The two definitions cause the C data types
NAME and ARG to be created. Every parse tree node of type
NAME will have the type index value of Name, and every parse
tree node of type ARG will have the type index value of Arg.
The grammar rule itself creates the C type FUNC. The two
lex-token definitions also cause the creation of two new
selector macros, name and arg, that may be applied to any
node of type FUNC to select the NAME and ARG nodes below it
in a parse tree.
Unlike other items, the order in which lex-tokens appear in
a grammar section can be significant. This is because
ivy*meta SDTB uses the lex-tokens to generate a lex file,
and the order of the lex-tokens in the lex file is used by
lex in disambiguating overlapping tokens. For example,
consider the following grammar rule:
word: ( beg:<begin|BEGIN> | id:<[a-z]+> )
In this example, the word nonterminal can either be an id
token, defined as a sequence of lower case letters, or one
of two keywords begin and BEGIN. An ambiguity exists here,
because the input ``begin'' matches both lex-tokens. This
ambiguity is resolved by lex because lex chooses the lextoken
that matches the longest string in the specification,
and if there are equal length matches, lex chooses the lextoken
that was defined first. In the above example, the
input ``begin'' is recognized as a beg keyword.
There is one predefined lex-token named error. This token may be used to implement syntax error recovery (see the yacc documentation).
Lex-actions
A lex-action may be associated with any lex-token. A lexaction
immediately follows a lex-token, and consists of the
keyword lex-action followed by C code enclosed in curly
brackets. The lex-actions are inserted into the generated
lex file; they are executed directly after the parse tree
node for the token is created. For example:
declared_id:<[a-zA-Z]+>
lex-action { if(!lookup(yytext))
return T_UNDECLARED_ID; }
undeclared_id:<[A-Za-z]+>
In this example, each input string that matches the
declared_id lex-token is checked in a symbol table; if it is
not there then the generated lexer returns an undeclared_id
rather than a declared_id. Note the use of the yacc-lex
interface name T_UNDECLARED_ID. The yacc-lex interface name
is the identifier that is used by the generated lexer to
return a token code to the generated parser. The name is
formed by prepending T_ to the name of the token in all
upper case.
Follow Graphs
Before ivy*meta SDTB generates the lex.l file, it first builds the token follow graph of the grammar, a graph that specifies which tokens can follow which other tokens in a valid specification. This graph allows ivy*meta SDTB to remove many of the token ambiguities caused by overlapping tokens. The follow graph is used to compute lex start conditions that are automatically included in the generated lex file; the lexer will only match tokens that can validly follow the previously matched token. This default behavior can be overridden by using the red section (see RED(2) ).
Token Merging
Token merging is an action that takes place while ivy*meta SDTB is building the token follow graph. Initially every token definition and literal token in the grammar section is treated as a separate token. If ivy*meta discovers that two lex-tokens with identical structures can follow the same token, then the two lex-tokens are merged, simplifying the follow graph. Token merging reduces the number of overlapping token ambiguities.
Balanced Lex-tokens
A balanced lex-token defines a token that is delimited by,
and may contain, balanced delimiter characters. Balanced
tokens are typically used to recognize, as tokens, strings
of characters, possibly quite long, that are too complex to
recognize as simple tokens. For example, the following
balanced lex-token will match parenthesized expressions that
may contain nested, parenthesized subexpressions:
balance «(» <[^()]> «)"
The first lex-token following the balance keyword specifies
the opening delimiter, the second lex-token specifies the
non-delimiter part of the token, and the last lex-token
specifies the closing delimiter. The second lex-token must
exclude the opening and closing delimiters.
As a more complex balanced lex-token, consider the problem
of describing a sequence of C statements within curly
brackets as a token. In this example curly brackets may be
deeply nested, and any curly brackets that occur in
comments, quoted strings, or character constants should not
be treated as delimiters. The following balanced lex-token
solves the problem:
balance «{» <({comment}|{cquote}|{squote}|{other})> «}"
where the lex-definitions would be provided in a lexdefinitions
section:
%lex-definitions
comment \/\*([^*]|\*+[^*/])*\*+\/
cquote \'([^'\\]|\\(.|\n))*\'
squote \"([^"\\]|\\(.|\n))*\"
other [^{}"']
A simpler implementation of a C escape token, one that doesn't handle curly brackets embedded in quotes or comments, is given in AGPUTS(5) .
Each balanced lex-token definition creates a new C type, a new selector macro, and a new type index value, exactly as normal lex-token definitions do.
Renamings
A renaming definition has the form:
name: name
Renaming definitions cause new selector macros and new type
index values to be created. Renaming definitions provide a
simple means for adhering to the two restrictions that apply
to named items in grammar rules (see the Grammar Rule
section for an example).
Attributes
An attribute definition has the form:
name : CTYPE [ = C-expression ; ]
An attribute does not describe a grammatical component of a
specification language; that is, an attribute plays no
syntactic role. When an attribute definition or attribute
name is used as an item in a grammar rule, an extra
component is added to the C type created for the rule. The
new component may be used to store any type of auxiliary
data. Attributes are commonly used to implement attribute
grammars. In the definition, name is the name of the
attribute; CTYPE must be in all upper case and is the C type
of the attribute's value; and C-expression is an optional C
expression that may be used to initialize attribute values
as the parse tree is being built. Because the CTYPE must be
in upper case, ivy*meta SDTB automatically provides INT and
STRING as pre-defined C types (also included are BOOLEAN,
CHAR, FLOAT, DOUBLE, SHORT, LONG and UNSIGNED). Any of the
generated C node types can also be used as an attribute
CTYPE; additional CTYPE definitions may be placed in a
header section that precedes the grammar section.
Each attribute definition causes a new selector macro to be created that accesses the attribute value. The selector macro can be used as an l-value (on the left side of an assignment operator) or as an r-value (on the right side of an assignment).
Because attributes play no syntactic role they are not
members of any specific grammar rule alternative (but see
the NOTES section). For example, the val attribute in the
following grammar rule:
expr: (num:<[0-9]+>
| id:<[a-z]+>
| left:expr op:<[-+*/]> expr
val:INT
)
is accessible (via the val macro) in every parse tree node
of type EXPR, regardless of alternative.
In the following example, all identifiers in the input are
given a unique number.
id: ( token:<[a-zA-Z]+> id_no:INT = ++count; )
An attribute name may not be renamed.
Sequence Items
When used as an item in a grammar rule, a name may be followed by a +, which means a list of one or more specification objects, or it can be followed by an *, which means a list of zero or more objects. If those text objects are separated by commas, semicolons, or any other items, this may be specified with a separated-by clause. See LIST(5) for further details on lists. See the Grammar Rule section for an example.
Optional Items
Optional items are the items that appear within square brackets in grammar rules. They represent text objects that may or may not appear in a specification. Optional items that do not appear in the specification are represented by NULL. See the Grammar Rule section for an example.
List-structures
A list-structure is a C type that describes nodes that make
up lists. A list-structure has no relationship to the
specification grammar (although components of a liststructure
node may refer to parse tree nodes). Such lists
may be used to store any type of auxiliary data. A typical
list-structure definition has the form:
name: list-structure ( item... )
The items in the body are generally attribute names or
definitions, or names of nonterminals or tokens. From this
definition ivy*meta SDTB creates a new C type. Each node of
this new type has a component for each of the items in the
definition, and the values of those components are
accessible through selector macros. In addition, the runtime
support that is provided for lists of parse tree nodes
is also available for list-structures (see LIST(5)
).
The body may also be structured as a grammar rule with alternatives, in which case the list-structure represents a union.
List-structure nodes must be created specifically by using the alloc function (see ALLOC(5) ). (Parse tree nodes are created automatically by the SDTool's parser.)
The following example shows a ivy*meta SDTB grammar section
that describes the syntax of the ivy*meta SDTB grammar
section:
%grammar
gram: ( <^%grammar> defn+ )
defn: ( name «:» [ «list-structure» ] item )
alt: ( item+ [ «%prec» tnam ] )
name: <[a-z][a-z0-9_]*>
ltok: <\"([^"\\]|"\\".)*\"|"{"[^}]*"}"|"<"([^\>\\]|"\\".)*\>>
tnam: <T_[A-Z0-9]+|'.'|[a-z0-9]+>
A renaming definition does in fact create a new C type that is an alias for the type of the name being renamed. However, no parse tree node will ever be created with this type (that is, no parse tree node will ever be assigned the type index value that is created by the renaming). Using such types is error prone, does not lead to expected behavior, and should be avoided.
An attribute definition does in fact create a new C type that is an alias for the C type of its values. This type name may be used with impunity.
An attribute definition creates a field in a nonterminal node type that may be accessed in every parse tree node of the type, regardless of the rule alternative below the node. However, if an initialization expression is included in the attribute definition, the expression will be executed only when the alternative that includes the attribute is recognized.
A grammar rule may in fact be used as an item in a grammar rule. This causes ivy*meta SDTB to create an anonymous nonterminal, but no navigation macro can be provided that accesses the corresponding anonymous part of a parse tree (although the various traversal mechanisms will traverse the anonymous part unimpeded). Although allowed, there seems little justification in using this feature.
FILES(2)
HEADER(2)
MIDDLECODE(2)
RED(2)
YACC(2)
LEX(2)
ALLOC(5)
Lex A Lexical Analyzer Generator, by M.E. Lesk and E. Schmidt Yacc Yet Another Compiler-Compiler, by S.C. Johnson
Long tokens can cause buffer overflow (to increase limits see LEX(2) ).