GRAMMAR(2)

Name
Synopsis
Description
Example
Notes
See Also
Documentation
Warning

Name

grammar describes the specification grammar and the parse tree and other data structures

Synopsis

%grammar definition ...
A definition is
name : body

A body is one of:
( item ... [ %prec tok-name ]
[ | item ... [ %prec tok-name ] ] ... ) lex-token [ lex-action { C-code } ] balance lex-token lex-token lex-token name
CTYPE [ = C-expression ; ]
list-structure body

An item is one of:
definition
name
lex-token [ lex-action { C-code } ] balance lex-token lex-token lex-token name +|* [ separated-by item ]
[ item ... ]

Description

The grammar section of a source description file describes the syntax of the specification files (also called the source or input) read by an SDTool. Simultaneously, the grammar section defines the parse tree data structures, and other auxiliary structures.

The grammar section consists of a sequence of definitions. There are six types of definitions; in the order in which they are listed in the SYNOPSIS section, they are: grammar rules
lex-tokens
balanced lex-tokens
renamings
attributes
list-structures

A grammar rule consists of a sequence of one or more alternatives; each alternative is made up of a sequence of one or more items. There are six kinds of items; in the order in which they are listed in the SYNOPSIS section, they are:
definitions
names lex-tokens
balanced lex-tokens
sequence items
optional items
The names must be defined elsewhere; forward references are allowed. Sequence items consist of a name followed by an + or *, optionally followed by a separated-by clause. Optional items consist of a sequence of one or more items within square brackets.

Names

Names are made up of lower case letters, digits and/or the underscore character. The first character of a name must be a letter, and there must be at least one more letter in the name. Every name, with the exception of the predefined name error, must be defined exactly once in the grammar section. Names do not need to be defined before they are used. There are many reserved words that cannot be used as names (see RESERVED(1) ).

Definitions

Except for the first definition, which must be a grammar rule definition that defines the start symbol of the specification grammar, definitions may occur in any order. In addition, definitions may be listed sequentially or they may be nested within grammar rule definitions. Each definition causes ivy*meta SDTB to create identifiers that are associated with the parse tree that the SDTool will build from the specification. These identifiers provide the C interface that gives access to the parse trees and other data structures.

In general, each definition causes the creation of three new identifiers, one for a new C data type, one for a new selector (also called a navigation) macro, and one for a type index (also called a discriminant) value. These identifiers are all declared and defined in the generated global.h header file, making them all accessible from within the C escape portions of an SDTool description (e.g., the middlecode section), and from all custom C code files that include global.h (see FILES(2) ).

For example, the lex-token definition:
iden : <[a-z]+>
causes three identifiers to be created:

IDEN The C type for the parse tree nodes that will hold the iden tokens.

iden The selector macro that will select a parse tree node of type IDEN from its parent node in a parse tree or list-structure.

Iden The type index value that identifies any node of type IDEN (accessible via the type macro).

Every new definition (except renaming and attribute definitions) creates a new C type for the parse tree or list-structure nodes that correspond to the definition. The names of the C types are taken from the grammar name by converting all letters to upper case. The C types created from grammar rule definitions are new structure types. The C types created for tokens are aliases for the predefined TOKEN type (see TOKEN(5) ).

Every new definition also creates a selector macro that may be used to select the corresponding component from a higher level node (see example below). The selector macros may be used as both l-values (on the left side of assignment operators) or as r-values (on the right side of assignment operators). The names of the selector macros are exactly the same as the grammar names.

Finally, every definition creates a type index value; type index values are used to identify individual parse tree and list-structure nodes, and to discriminate among grammar rule alternatives that appear in parse trees. The names of the type index values are taken from the grammar names by converting the first character, which must be a lower case letter, to upper case.

Consider the following grammar rule definition: expr: ( id | left:expr oper right:expr )

This grammar rule definition contains two nested renaming definitions for the names left and right. It also contains two names, id and oper, that must be defined elsewhere. The rule also contains two applied occurrences of the name expr. This rule causes the creation of a new C data type, EXPR, which is used to hold parse tree nodes for the nonterminal expr. Each node of type EXPR will have the type index value Expr. The two nested renaming definitions cause the creation of the selector macros left and right; the definitions of id and oper (not shown) cause the creation of the selector macros id and oper. These four selector macros can be applied to nodes of type EXPR to access the corresponding nodes below the EXPR node in a parse tree. Note that the two renaming definitions provide unique selector macros for the two sub-EXPR nodes, but do not cause the creation of new types.

Grammar Rules

A grammar rule defines all the alternatives for a nonterminal in the specification grammar. The body of a grammar rule consists of a sequence of one or more alternatives, and each alternative is a sequence of one or more items. If there is more than one alternative, they are separated by vertical bar characters. The entire body is enclosed by parentheses.

Each alternative defines a production rule of a context-free grammar. ivy*meta SDTB translates grammar rules in a straightforward manner into yacc rules; any restrictions on yacc grammars also apply to ivy*meta SDTB rules. The class of grammars accepted by yacc are LALR(1) grammars. The yacc section of the source description file can be used to specify token precedence and associativity (see YACC(2) ), and the prec clause can be used in grammar rule definitions to set the precedence of an alternative (see the yacc documentation for complete details about rule and token precedence and about token associativity).

There are two restrictions that apply to the names that occur in grammar rules:
The names of all named items within an individual grammar rule alternative must be unique. This ensures that each named component of a nonterminal parse tree node can be selected by a unique selector macro.
The names of the first named items in every alternative within an individual grammar rule must be unique. This ensures that the alternatives that appear below nonterminal nodes can be discriminated.

Each grammar rule definition creates a new C type for the parse tree nodes of the nonterminal defined by the rule. Logically, the new C type can be considered to be a union of records, where each component of the union corresponds to one of the alternatives. Each parse tree node of the new C type contains a discriminant value, accessible via the tag macro, that specifies which grammar rule alternative a node represents. This discriminant value is the type index value of the first named item in the alternative. Consider the rule:

expr: ( num:<[0-9]+>
| left:expr op:<[-+*/]> right:expr | «%» digit:<[0-9]>
)

This grammar rule, which has three alternatives, defines the new C type EXPR. If x is any C expression of type EXPR, then the C expression tag(x) returns the discriminant value for the alternative below x in the parse tree. The value returned by tag(x) will be either Num, Left or Digit.

Each named item in a grammar rule can be selected directly by using its selector macro. For example, the C expression right(x) selects the right sub-expression node in the second alternative. This expression only makes sense if tag(x) returns Left.

Sequence items and optional items in parse trees are accessed in the same way as any other kind of item. When accessing a sequence item, the first node in the list is returned (if the sequence is empty then NULL is returned); other nodes in the sequence can be accessed via the the next function (see LIST(5) and TRAVERSE(5) ). Consider the rule: cmd: ( id arg* [ filename ] )
This grammar rule defines the new C type CMD. If x is a C expression of this type, then the macros id, arg and filename can all be applied to x. The value of id(x) will be the ID node below x in the parse tree. The value of arg(x) will be the first ARG node below x in the parse tree; if the sequence is empty then arg(x) will return NULL. Subsequent ARG nodes can be obtained with the next function; for example, next(arg(x)) returns the second ARG in the sequence. The value of filename(x) will be the FILENAME node below x in the parse tree; if there is no such node then filename(x) will return NULL.

Lex-tokens

A lex-token defines the structure of a token that will be read from a specification. The language used to define lex-tokens is the same as that used by the lex program. Only two types of lex-tokens, those that are delimited by quotes, and those that are delimited by curly brackets, can be used directly in a grammar section. All other lex-tokens must be delimited with angle brackets, < and >, and all embedded occurrences of > must be escaped with a backslash character. Some example lex-tokens are:

«abc»: The literal token abc.
<[abc]>: An a, or b, or c.
<[0-9]>: A single digit between 0 and 9 inclusive.
<[^0-9]>: Any non-numeric character.
<[a-z]+>: A sequence of one or more lower case letters.
<[I-N]*>: A sequence of zero or more upper case letters between I and N.
«\n»: A newline character.
«\t»: A tab character.
«\ «: A blank character (A blank follows the backslash).
«\\»: A backslash character.
<[+-]?>: An optional sign character.
<ab|cd>: Either ab or cd.
<[a-z]{2,5}>: Two to five lower case letters.
{ID}: The token defined as ID in a lex-definitions section.

For more details on the lex-token C data structure see TOKEN(5) . Also see LEX(2) for information on using lex definitions.

Consider the rule:
func: (name:<[A-Z]+> «(» arg:<[a-z]> «)» ) This grammar rule contains four lex-tokens, two as the bodies of lex-token definitions and two as literal, lextoken items. The two definitions cause the C data types NAME and ARG to be created. Every parse tree node of type NAME will have the type index value of Name, and every parse tree node of type ARG will have the type index value of Arg. The grammar rule itself creates the C type FUNC. The two lex-token definitions also cause the creation of two new selector macros, name and arg, that may be applied to any node of type FUNC to select the NAME and ARG nodes below it in a parse tree.

Unlike other items, the order in which lex-tokens appear in a grammar section can be significant. This is because ivy*meta SDTB uses the lex-tokens to generate a lex file, and the order of the lex-tokens in the lex file is used by lex in disambiguating overlapping tokens. For example, consider the following grammar rule:
word: ( beg:<begin|BEGIN> | id:<[a-z]+> ) In this example, the word nonterminal can either be an id token, defined as a sequence of lower case letters, or one of two keywords begin and BEGIN. An ambiguity exists here, because the input ``begin'' matches both lex-tokens. This ambiguity is resolved by lex because lex chooses the lextoken that matches the longest string in the specification, and if there are equal length matches, lex chooses the lextoken that was defined first. In the above example, the input ``begin'' is recognized as a beg keyword.

There is one predefined lex-token named error. This token may be used to implement syntax error recovery (see the yacc documentation).

Lex-actions

A lex-action may be associated with any lex-token. A lexaction immediately follows a lex-token, and consists of the keyword lex-action followed by C code enclosed in curly brackets. The lex-actions are inserted into the generated lex file; they are executed directly after the parse tree node for the token is created. For example: declared_id:<[a-zA-Z]+>
lex-action { if(!lookup(yytext)) return T_UNDECLARED_ID; } undeclared_id:<[A-Za-z]+>
In this example, each input string that matches the declared_id lex-token is checked in a symbol table; if it is not there then the generated lexer returns an undeclared_id rather than a declared_id. Note the use of the yacc-lex interface name T_UNDECLARED_ID. The yacc-lex interface name is the identifier that is used by the generated lexer to return a token code to the generated parser. The name is formed by prepending T_ to the name of the token in all upper case.

Follow Graphs

Before ivy*meta SDTB generates the lex.l file, it first builds the token follow graph of the grammar, a graph that specifies which tokens can follow which other tokens in a valid specification. This graph allows ivy*meta SDTB to remove many of the token ambiguities caused by overlapping tokens. The follow graph is used to compute lex start conditions that are automatically included in the generated lex file; the lexer will only match tokens that can validly follow the previously matched token. This default behavior can be overridden by using the red section (see RED(2) ).

Token Merging

Token merging is an action that takes place while ivy*meta SDTB is building the token follow graph. Initially every token definition and literal token in the grammar section is treated as a separate token. If ivy*meta discovers that two lex-tokens with identical structures can follow the same token, then the two lex-tokens are merged, simplifying the follow graph. Token merging reduces the number of overlapping token ambiguities.

Balanced Lex-tokens

A balanced lex-token defines a token that is delimited by, and may contain, balanced delimiter characters. Balanced tokens are typically used to recognize, as tokens, strings of characters, possibly quite long, that are too complex to recognize as simple tokens. For example, the following balanced lex-token will match parenthesized expressions that may contain nested, parenthesized subexpressions: balance «(» <[^()]> «)"
The first lex-token following the balance keyword specifies the opening delimiter, the second lex-token specifies the non-delimiter part of the token, and the last lex-token specifies the closing delimiter. The second lex-token must exclude the opening and closing delimiters.

As a more complex balanced lex-token, consider the problem of describing a sequence of C statements within curly brackets as a token. In this example curly brackets may be deeply nested, and any curly brackets that occur in comments, quoted strings, or character constants should not be treated as delimiters. The following balanced lex-token solves the problem:
balance «{» <({comment}|{cquote}|{squote}|{other})> «}" where the lex-definitions would be provided in a lexdefinitions section:
%lex-definitions
comment \/\*([^*]|\*+[^*/])*\*+\/
cquote \'([^'\\]|\\(.|\n))*\'
squote \"([^"\\]|\\(.|\n))*\"
other [^{}"']

A simpler implementation of a C escape token, one that doesn't handle curly brackets embedded in quotes or comments, is given in AGPUTS(5) .

Each balanced lex-token definition creates a new C type, a new selector macro, and a new type index value, exactly as normal lex-token definitions do.

Renamings

A renaming definition has the form:
name: name
Renaming definitions cause new selector macros and new type index values to be created. Renaming definitions provide a simple means for adhering to the two restrictions that apply to named items in grammar rules (see the Grammar Rule section for an example).

Attributes

An attribute definition has the form:
name : CTYPE [ = C-expression ; ]
An attribute does not describe a grammatical component of a specification language; that is, an attribute plays no syntactic role. When an attribute definition or attribute name is used as an item in a grammar rule, an extra component is added to the C type created for the rule. The new component may be used to store any type of auxiliary data. Attributes are commonly used to implement attribute grammars. In the definition, name is the name of the attribute; CTYPE must be in all upper case and is the C type of the attribute's value; and C-expression is an optional C expression that may be used to initialize attribute values as the parse tree is being built. Because the CTYPE must be in upper case, ivy*meta SDTB automatically provides INT and STRING as pre-defined C types (also included are BOOLEAN, CHAR, FLOAT, DOUBLE, SHORT, LONG and UNSIGNED). Any of the generated C node types can also be used as an attribute CTYPE; additional CTYPE definitions may be placed in a header section that precedes the grammar section.

Each attribute definition causes a new selector macro to be created that accesses the attribute value. The selector macro can be used as an l-value (on the left side of an assignment operator) or as an r-value (on the right side of an assignment).

Because attributes play no syntactic role they are not members of any specific grammar rule alternative (but see the NOTES section). For example, the val attribute in the following grammar rule:
expr: (num:<[0-9]+>
| id:<[a-z]+>
| left:expr op:<[-+*/]> expr
val:INT
)
is accessible (via the val macro) in every parse tree node of type EXPR, regardless of alternative.

In the following example, all identifiers in the input are given a unique number.
id: ( token:<[a-zA-Z]+> id_no:INT = ++count; )

An attribute name may not be renamed.

Sequence Items

When used as an item in a grammar rule, a name may be followed by a +, which means a list of one or more specification objects, or it can be followed by an *, which means a list of zero or more objects. If those text objects are separated by commas, semicolons, or any other items, this may be specified with a separated-by clause. See LIST(5) for further details on lists. See the Grammar Rule section for an example.

Optional Items

Optional items are the items that appear within square brackets in grammar rules. They represent text objects that may or may not appear in a specification. Optional items that do not appear in the specification are represented by NULL. See the Grammar Rule section for an example.

List-structures

A list-structure is a C type that describes nodes that make up lists. A list-structure has no relationship to the specification grammar (although components of a liststructure node may refer to parse tree nodes). Such lists may be used to store any type of auxiliary data. A typical list-structure definition has the form:
name: list-structure ( item... )
The items in the body are generally attribute names or definitions, or names of nonterminals or tokens. From this definition ivy*meta SDTB creates a new C type. Each node of this new type has a component for each of the items in the definition, and the values of those components are accessible through selector macros. In addition, the runtime support that is provided for lists of parse tree nodes is also available for list-structures (see LIST(5) ).

The body may also be structured as a grammar rule with alternatives, in which case the list-structure represents a union.

List-structure nodes must be created specifically by using the alloc function (see ALLOC(5) ). (Parse tree nodes are created automatically by the SDTool's parser.)

Example

The following example shows a ivy*meta SDTB grammar section that describes the syntax of the ivy*meta SDTB grammar section:
%grammar
gram: ( <^%grammar> defn+ )

defn: ( name «:» [ «list-structure» ] item )

item: ( name: [ <"+"|"*"> [ «separated-by» item ] ] | defn | ltok [ «lex-action» balance «{» <[^{}]> «}» ] | btok: «balance» opn:ltok ltok clos:ltok | attr: <[A-Z][A-Z0-9_]*> [ «=» <[^;\n]*> «;» ] | rule: «(» alt+ separated-by «|» «)" | optn: «[» item+ «]" )

alt: ( item+ [ «%prec» tnam ] )

name: <[a-z][a-z0-9_]*>
ltok: <\"([^"\\]|"\\".)*\"|"{"[^}]*"}"|"<"([^\>\\]|"\\".)*\>> tnam: <T_[A-Z0-9]+|'.'|[a-z0-9]+>

Notes

A renaming definition does in fact create a new C type that is an alias for the type of the name being renamed. However, no parse tree node will ever be created with this type (that is, no parse tree node will ever be assigned the type index value that is created by the renaming). Using such types is error prone, does not lead to expected behavior, and should be avoided.

An attribute definition does in fact create a new C type that is an alias for the C type of its values. This type name may be used with impunity.

An attribute definition creates a field in a nonterminal node type that may be accessed in every parse tree node of the type, regardless of the rule alternative below the node. However, if an initialization expression is included in the attribute definition, the expression will be executed only when the alternative that includes the attribute is recognized.

A grammar rule may in fact be used as an item in a grammar rule. This causes ivy*meta SDTB to create an anonymous nonterminal, but no navigation macro can be provided that accesses the corresponding anonymous part of a parse tree (although the various traversal mechanisms will traverse the anonymous part unimpeded). Although allowed, there seems little justification in using this feature.

Documentation

Lex A Lexical Analyzer Generator, by M.E. Lesk and E. Schmidt Yacc Yet Another Compiler-Compiler, by S.C. Johnson

Warning

Long tokens can cause buffer overflow (to increase limits see LEX(2) ).