Title

ECS 120 Theory of Computation
Regular expressions and introduction to context-free grammars
Julian Panetta
University of California, Davis

Declarative Models of Computation

  • Our first model of computation, the DFA, is like an imperative programming language.
    • It defines a language by precisely defining every step that should be followed to accept/reject a string.
    • It is trivial to translate the DFA into a imperative program, e.g., using a switch statement, lookup table, or goto instructions.
  • In contrast, the next models of computation that we will study are declarative.
    • They describe the strings in a language without giving specific processing steps.
    • The first, regular expressions, are patterns describing a set of strings.
    • The second, context-free grammars, give rules to generate strings that should be accepted;
      they turn out to be more powerful than regular expressions.
    • The third, non-deterministic finite automata, are like DFAs but are less imperative since they do not define specific steps to follow.

Regular Expressions

  • Extremely useful for searching and manipulating text:
    • Advanced search/replace features in text editors
    • Major reason for the popularity of the Perl programming language (featuring built-in regex support)
    • The sed, awk, and grep POSIX commands
      (the name grep comes from g/re/p, a command in ed to run a run a global regular expression search and print the matched lines)
  • Used to tokenize input strings in the
    front-end of a compiler.
  • Turn out to be equivalent to DFAs in terms of the languages they can describe.
    • We’ll eventually see how to convert a regular expression into a DFA and vice versa.

Regular Expressions

Just like an arithmetic expression \[(5 + 3) \times 4\] is a string built from numbers and operators that evaluates to a numerical value,
a regular expression (regex) is a string built from symbols and special operators like \[ (0 \cup 1) 0^* \] whose value is a language (a set of strings).

Most regex tools replace \(\cup\) with |, e.g., (0|1)0*. Use \(\cup\) when handwriting (e.g., on an exam) to avoid confusing | and \(1\).

  • Intuitively: regexes are patterns that match certain strings and not other.
  • We can understand this expression as shorthand for building the language it decides.
    • Symbols \(0\) and \(1\) are shorthand for singletons \(\{0\}\) and \(\{1\}\).
    • So \((0 \cup 1)\) is really the language \(\{0, 1\}\), and \(0^*\) is \(\{0\}^*\).
    • \((0 \cup 1) 0^*\) is thus the concatenation of these two languages \[ L((0 \cup 1) 0^*) = \{0, 1\} \circ \{0\}^* = \{0, 1, 00, 10, \ldots \} \]

Basic Examples

  • The regular expression \[ (0 \cup 1)^* \quad \quad \fragment{L((0 \cup 1)^*) = \{0, 1\}^*} \] defines/matches all strings over \(\{0, 1\}\).
  • If \(\Sigma\) is any alphabet, then
    • \(\Sigma\) is a regex that matches all strings of length 1 over \(\Sigma\).
    • \(\Sigma^*\) matches all strings over \(\Sigma\).
    • \(\Sigma^* 1\) matches all strings over \(\Sigma\) that end with a 1.
    • \((0 \Sigma^*) \cup (\Sigma^* 1)\) matches all strings that
      start with a 0 or end with a 1

Operator Precedence

  • * has highest precedence,
    followed by concatenation, then \(\cup\)
    (analogous to exponentiation, multiplication, and addition).
  • Use parentheses to override this order. \[ 0 \cup 1 \Sigma^* \quad \text{ vs } \quad (0 \cup 1) \Sigma^* \]
  • We could write \((0 \Sigma^*) \cup (\Sigma^* 1)\) equivalently as \(0 \Sigma^* \cup \Sigma^* 1\)

Formal Definition of Regular Expressions

We now define the syntax (valid expressions) and semantics (languages they decide) of regexes.

Let \(\Delta = \{ \cup, (, ), *, \emptyset, \emptystring \}\) be the regex control alphabet.

Let \(\Sigma\), the input alphabet, be an alphabet such that \(\Sigma \cap \Delta = \emptyset\).

\(R \in (\Sigma \cup \Delta)^*\) is a regular expression deciding language \(L(R) \subseteq \Sigma^*\) if one of the following holds:

This is an inductive definition!

  1. (base) \(R = a\) for some \(a \in \Sigma\). Then \(L(R) = \{a\}\).
  2. (base) \(R = \emptystring\). Then \(L(R) = \{\emptystring\}\).
  3. (base) \(R = \emptyset\). Then \(L(R) = \{\}\).
  4. (inductive) \(R = (R_1) \cup (R_2)\) for \(R_1, R_2\) regexes. Then \(L(R) = L(R_1) \cup L(R_2)\).
  5. (inductive) \(R = (R_1)(R_2)\) for \(R_1, R_2\) regexes. Then \(L(R) = L(R_1) \circ L(R_2)\).
  6. (inductive) \(R = (R_1)^*\) for \(R_1\) a regex. Then \(L(R) = L(R_1)^*\).

Parentheses can be omitted, in which case the precedence rules apply.

A language \(A\) is regex-decidable if there exists a regex \(R\) such that \(L(R) = A\). In Chapter 6 we prove that a language is regex-decidable if and only if it is DFA-decidable.

Some Conventions and Convenient Notation

  • Sometimes we don’t distinguish between \(R\) and \(L(R)\).
  • Remember that \(R^*\) allows 0 repetitions of \(R\) (matching the empty string).
    • What if we want to match 1 or more repetitions of \(R\)? \(R R^*\)
    • We can use the shorthand \(R^+\) for \(RR^*\).
    • \(R^+ \cup \emptystring = \fragment{R^*}\)
  • We use \(R^k\) to denote \(R\) repeated exactly \(k\) times.
    • \(R^0 = \emptystring\)
    • \(R^1 = R\)
    • \(R^k = \underbrace{R\ldots R}_k\) where \(k\) is a constant, e.g., \(R^3 = RRR\)

Examples

In the following examples, we assume that \(\Sigma = \{0, 1\}\).

  • \(0^* 1 0^* = \fragment{\setbuild{w \in \Sigma^*}{w \text{ contains a single 1}}}\)
  • \(\Sigma^* 1 \Sigma^* = \fragment{\setbuild{w \in \Sigma^*}{w \text{ has at least one 1}}}\)
  • \(\setbuild{w \in \Sigma^*}{w \text{ has at least two 1's}} = \fragment{\Sigma^* 1 \Sigma^* 1 \Sigma^*}\)
  • \(\setbuild{w \in \Sigma^*}{w \text{ has exactly two 1's}} = \fragment{0^* 1 0^* 1 0^*}\)
  • \(\setbuild{w \in \Sigma^*}{w \text{ contains the substring 001}} = \fragment{\Sigma^* 001 \Sigma^*}\)
  • \(\setbuild{w \in \Sigma^*}{\text{ every 0 in $w$ is followed by at least one 1}} = \fragment{1^* (01^+)^*}\)
  • \(\setbuild{w \in \Sigma^*}{|w| \text{ is even}} = \fragment{(\Sigma\Sigma)^*}\)
  • \(\setbuild{w \in \Sigma^*}{w \text{ starts and ends with the same symbol}} = \fragment{0\Sigma^*0\ \cup\ 1\Sigma^*1\ \cup\ 0\ \cup\ 1}\)

Algebra of Regular Expressions

  • Regular expressions satisfy the “distributive law”: \[ A(B \cup C) = AB \cup AC \] (Think of \(\cup\) as analogous to addition and concatenation as multiplication.)
  • So FOIL works: \(L((0 \cup \varepsilon) (1 \cup \varepsilon)) = \{01, 0, 1, \varepsilon\}\)
  • \(\emptyset\) is identity for \(\cup\) (like 0 is identity for +) \[ R \cup \emptyset = R, \quad \quad R \emptyset = \emptyset \]
  • \(\emptystring\) is identity for concatenation (like 1 is identity for \(\times\)) \[ R \emptystring = R \]

Matching a Numerical Constant

Suppose you’re writing a compiler for a programming language

  • In the first step, you need to break down the input string into tokens that could represent variable names, keywords, operators, and constants.

  • This is generally done by a lexer defined by regex’s matching each token type.

  • Let’s try to do this for numerical constants with an optional fractional part and sign.

    • Let \(P = 1 \cup 2 \cup \ldots \cup 9\) and \(D = 0 \cup P\).
    • Could be an integer: either 0 or a nonempty string of digits not starting with 0: \(I = 0 \cup \fragment{P D^*}\)
    • Or a decimal number with a nonempty integer part: \(\fragment{I . D^*}\)
    • Or a decimal number with no integer part and nonempty fractional part: \(\fragment{. D^+}\)
    • The leading sign part could be \(-\) or missing: \(\fragment{(- \cup \emptystring)}\)
  • Putting everything together, we get \[ (- \cup \emptystring)(I \cup I . D^* \cup . D^+) \hspace{10em} \]

    Technically 203 is a int literal, not a float literal, since it has no decimal point. How to modify to accept only float literals?

Context-Free Grammars

  • Our second declarative model of computation is the context-free grammar (CFG).
  • It’s more powerful than regular expressions.
    • Any regular expression can be expressed as a CFG.
    • Not all CFGs can be expressed as regular expressions.
  • CFGs are useful to describe the syntax of programming languages.
    • They are very powerful for defining domain-specific languages (DSLs), simple interpreted languages, or structured file formats.
    • Tools can automatically turn a CFG into a parser, which converts a sequence of tokens into a parse tree.
    • More complicated languages like C and C++ are unfortunately context sensitive and so cannot be fully specified by CFGs.

Context-Free Grammars

  • Simple example: \[\begin{align*} A &\to 0 A 1 \\ A &\to B \\ B &\to\ ! \end{align*}\]
  • A CFG has:
    • Productions or substitution rules in each line
      indicated by symbol, an arrow, and a replacement string.
    • A variable or non-terminal symbol appearing before each arrow
      that is to be replaced by the production.
      (\(A\) and \(B\) in this example)
    • Terminals (the alphabet over which it generates strings):
      symbols in the replacement string that are not variables.
      (\(0\), \(1\), and \(!\) in this example)
    • A start variable for kicking off string generation;
      the symbol at the upper-left corner, unless specified otherwise.
      (\(A\) in this example)

Why is this “context-free”?

  • The left-hand side of each rule is a single non-terminal symbol.
  • A context-sensitive grammar would allow the left-hand side to be a string of symbols as in: \[ A B \to A \string{xy},\] requiring information about the context in which the rule is applied.

Example CFG

\[\begin{align*} A &\to 0 A 1 \\ A &\to B \\ B &\to\ ! \end{align*}\]

  • Let’s use this grammar to generate a string.
    1. Start by writing the start variable \(A\).
    2. Find a variable in the current string and
      replace it with an applicable production rule.
    3. Repeat step 2 until no variables are left.
  • Example 1: \[ A \fragment{\yields 0 A 1} \fragment{\yields 0 0 A 1 1 } \fragment{\yields 0 0 B 1 1 } \fragment{\yields 00!11 } \hspace{8em} \] The grammar generates \(\string{00!11}\) through this derivation.
  • Example 2: \[ A \fragment{\yields 0 A 1} \fragment{\yields 0 B 1} \fragment{\yields 0 ! 1} \hspace{8em} \]

Parse tree for Example 1

0 A 1 0 8 1 0->1 0->8 2 A 0->2 3 0 7 1 2->3 2->7 4 A 2->4 5 B 4->5 6 ! 5->6