ECS 120 Theory of Computation

Regular expressions and introduction to context-free grammars

Julian Panetta

University of California, Davis

Declarative Models of Computation

Our first model of computation, the DFA, is like an imperative programming language.
- It defines a language by precisely defining every step that should be followed to accept/reject a string.
- It is trivial to translate the DFA into a imperative program, e.g., using a switch statement, lookup table, or goto instructions.
In contrast, the next models of computation that we will study are declarative.
- They describe the strings in a language without giving specific processing steps.
- The first, regular expressions, are patterns describing a set of strings.
- The second, context-free grammars, give rules to generate strings that should be accepted;
  they turn out to be more powerful than regular expressions.
- The third, non-deterministic finite automata, are like DFAs but are less imperative since they do not define specific steps to follow.

Regular Expressions

Extremely useful for searching and manipulating text:
- Advanced search/replace features in text editors
- Major reason for the popularity of the Perl programming language (featuring built-in regex support)
- The sed, awk, and grep POSIX commands
  (the name grep comes from g/re/p, a command in ed to run a run a global regular expression search and print the matched lines)
Used to tokenize input strings in the
front-end of a compiler.
Turn out to be equivalent to DFAs in terms of the languages they can describe.
- We’ll eventually see how to convert a regular expression into a DFA and vice versa.

… also Category:Regex

Regular Expressions

Just like an arithmetic expression \[(5 + 3) \times 4\] is a string built from numbers and operators that evaluates to a numerical value,
a regular expression (regex) is a string built from symbols and special operators like \[ (0 \cup 1) 0^* \] whose value is a language (a set of strings).

Most regex tools replace $\cup$ with |, e.g., (0|1)0*. Use $\cup$ when handwriting (e.g., on an exam) to avoid confusing | and $1$.

Intuitively: regexes are patterns that match certain strings and not other.
We can understand this expression as shorthand for building the language it decides.
- Symbols $0$ and $1$ are shorthand for singletons $\{0\}$ and $\{1\}$.
- So $(0 \cup 1)$ is really the language $\{0, 1\}$, and $0^*$ is $\{0\}^*$.
- $(0 \cup 1) 0^*$ is thus the concatenation of these two languages \[ L((0 \cup 1) 0^*) = \{0, 1\} \circ \{0\}^* = \{0, 1, 00, 10, \ldots \} \]

Basic Examples

The regular expression \[ (0 \cup 1)^* \quad \quad \fragment{L((0 \cup 1)^*) = \{0, 1\}^*} \] defines/matches all strings over $\{0, 1\}$.
If $\Sigma$ is any alphabet, then
- $\Sigma$ is a regex that matches all strings of length 1 over $\Sigma$.
- $\Sigma^*$ matches all strings over $\Sigma$.
- $\Sigma^* 1$ matches all strings over $\Sigma$ that end with a 1.
- $(0 \Sigma^*) \cup (\Sigma^* 1)$ matches all strings that
  start with a 0 or end with a 1

Operator Precedence

* has highest precedence,
followed by concatenation, then $\cup$
(analogous to exponentiation, multiplication, and addition).
Use parentheses to override this order. \[ 0 \cup 1 \Sigma^* \quad \text{ vs } \quad (0 \cup 1) \Sigma^* \]
We could write $(0 \Sigma^*) \cup (\Sigma^* 1)$ equivalently as $0 \Sigma^* \cup \Sigma^* 1$

Formal Definition of Regular Expressions

We now define the syntax (valid expressions) and semantics (languages they decide) of regexes.

Let $\Delta = \{ \cup, (, ), *, \emptyset, \emptystring \}$ be the regex control alphabet.

Let $\Sigma$, the input alphabet, be an alphabet such that $\Sigma \cap \Delta = \emptyset$.

$R \in (\Sigma \cup \Delta)^*$ is a regular expression deciding language $L(R) \subseteq \Sigma^*$ if one of the following holds:

This is an inductive definition!

(base) $R = a$ for some $a \in \Sigma$. Then $L(R) = \{a\}$.
(base) $R = \emptystring$. Then $L(R) = \{\emptystring\}$.
(base) $R = \emptyset$. Then $L(R) = \{\}$.
(inductive) $R = (R_1) \cup (R_2)$ for $R_1, R_2$ regexes. Then $L(R) = L(R_1) \cup L(R_2)$.
(inductive) $R = (R_1)(R_2)$ for $R_1, R_2$ regexes. Then $L(R) = L(R_1) \circ L(R_2)$.
(inductive) $R = (R_1)^*$ for $R_1$ a regex. Then $L(R) = L(R_1)^*$.

Parentheses can be omitted, in which case the precedence rules apply.

A language $A$ is regex-decidable if there exists a regex $R$ such that $L(R) = A$. In Chapter 6 we prove that a language is regex-decidable if and only if it is DFA-decidable.

Some Conventions and Convenient Notation

Sometimes we don’t distinguish between $R$ and $L(R)$.
Remember that $R^*$ allows 0 repetitions of $R$ (matching the empty string).
- What if we want to match 1 or more repetitions of $R$? $R R^*$
- We can use the shorthand $R^+$ for $RR^*$.
- $R^+ \cup \emptystring = \fragment{R^*}$
We use $R^k$ to denote $R$ repeated exactly $k$ times.
- $R^0 = \emptystring$
- $R^1 = R$
- $R^k = \underbrace{R\ldots R}_k$ where $k$ is a constant, e.g., $R^3 = RRR$

Examples

In the following examples, we assume that $\Sigma = \{0, 1\}$.

$0^* 1 0^* = \fragment{\setbuild{w \in \Sigma^*}{w \text{ contains a single 1}}}$
$\Sigma^* 1 \Sigma^* = \fragment{\setbuild{w \in \Sigma^*}{w \text{ has at least one 1}}}$
$\setbuild{w \in \Sigma^*}{w \text{ has at least two 1's}} = \fragment{\Sigma^* 1 \Sigma^* 1 \Sigma^*}$
$\setbuild{w \in \Sigma^*}{w \text{ has exactly two 1's}} = \fragment{0^* 1 0^* 1 0^*}$
$\setbuild{w \in \Sigma^*}{w \text{ contains the substring 001}} = \fragment{\Sigma^* 001 \Sigma^*}$
$\setbuild{w \in \Sigma^*}{\text{ every 0 in $w$ is followed by at least one 1}} = \fragment{1^* (01^+)^*}$
$\setbuild{w \in \Sigma^*}{|w| \text{ is even}} = \fragment{(\Sigma\Sigma)^*}$
$\setbuild{w \in \Sigma^*}{w \text{ starts and ends with the same symbol}} = \fragment{0\Sigma^*0\ \cup\ 1\Sigma^*1\ \cup\ 0\ \cup\ 1}$

Algebra of Regular Expressions

Regular expressions satisfy the “distributive law”: \[ A(B \cup C) = AB \cup AC \] (Think of $\cup$ as analogous to addition and concatenation as multiplication.)
So FOIL works: $L((0 \cup \varepsilon) (1 \cup \varepsilon)) = \{01, 0, 1, \varepsilon\}$
$\emptyset$ is identity for $\cup$ (like 0 is identity for +) \[ R \cup \emptyset = R, \quad \quad R \emptyset = \emptyset \]
$\emptystring$ is identity for concatenation (like 1 is identity for $\times$) \[ R \emptystring = R \]

Matching a Numerical Constant

Suppose you’re writing a compiler for a programming language

In the first step, you need to break down the input string into tokens that could represent variable names, keywords, operators, and constants.
This is generally done by a lexer defined by regex’s matching each token type.
Let’s try to do this for numerical constants with an optional fractional part and sign.
- Let $P = 1 \cup 2 \cup \ldots \cup 9$ and $D = 0 \cup P$.
- Could be an integer: either 0 or a nonempty string of digits not starting with 0: $I = 0 \cup \fragment{P D^*}$
- Or a decimal number with a nonempty integer part: $\fragment{I . D^*}$
- Or a decimal number with no integer part and nonempty fractional part: $\fragment{. D^+}$
- The leading sign part could be $-$ or missing: $\fragment{(- \cup \emptystring)}$
Putting everything together, we get \[ (- \cup \emptystring)(I \cup I . D^* \cup . D^+) \hspace{10em} \]

Technically 203 is a int literal, not a float literal, since it has no decimal point. How to modify to accept only float literals?

Context-Free Grammars

Our second declarative model of computation is the context-free grammar (CFG).
It’s more powerful than regular expressions.
- Any regular expression can be expressed as a CFG.
- Not all CFGs can be expressed as regular expressions.
CFGs are useful to describe the syntax of programming languages.
- They are very powerful for defining domain-specific languages (DSLs), simple interpreted languages, or structured file formats.
- Tools can automatically turn a CFG into a parser, which converts a sequence of tokens into a parse tree.
- More complicated languages like C and C++ are unfortunately context sensitive and so cannot be fully specified by CFGs.

Context-Free Grammars

Simple example: \[\begin{align*} A &\to 0 A 1 \\ A &\to B \\ B &\to\ ! \end{align*}\]
A CFG has:
- Productions or substitution rules in each line
  indicated by symbol, an arrow, and a replacement string.
- A variable or non-terminal symbol appearing before each arrow
  that is to be replaced by the production.
  ($A$ and $B$ in this example)
- Terminals (the alphabet over which it generates strings):
  symbols in the replacement string that are not variables.
  ($0$, $1$, and $!$ in this example)
- A start variable for kicking off string generation;
  the symbol at the upper-left corner, unless specified otherwise.
  ($A$ in this example)

Why is this “context-free”?

The left-hand side of each rule is a single non-terminal symbol.
A context-sensitive grammar would allow the left-hand side to be a string of symbols as in: \[ A B \to A \string{xy},\] requiring information about the context in which the rule is applied.

Example CFG

\[\begin{align*} A &\to 0 A 1 \\ A &\to B \\ B &\to\ ! \end{align*}\]

Let’s use this grammar to generate a string.
1. Start by writing the start variable $A$.
2. Find a variable in the current string and
  replace it with an applicable production rule.
3. Repeat step 2 until no variables are left.
Example 1: \[ A \fragment{\yields 0 A 1} \fragment{\yields 0 0 A 1 1 } \fragment{\yields 0 0 B 1 1 } \fragment{\yields 00!11 } \hspace{8em} \] The grammar generates $\string{00!11}$ through this derivation.
Example 2: \[ A \fragment{\yields 0 A 1} \fragment{\yields 0 B 1} \fragment{\yields 0 ! 1} \hspace{8em} \]

Parse tree for Example 1

Title

Declarative Models of Computation

Regular Expressions

Regular Expressions

Basic Examples

Formal Definition of Regular Expressions

Some Conventions and Convenient Notation

Examples

Algebra of Regular Expressions

Matching a Numerical Constant

Context-Free Grammars

Context-Free Grammars

Example CFG