switch
statement, lookup table, or goto
instructions.Perl
programming language (featuring built-in regex support)sed
, awk
, and grep
POSIX commandsgrep
comes from g/re/p
, a command in ed
to run a run a global regular expression search and print the matched lines)Just like an arithmetic expression \[(5 + 3) \times 4\] is a string built from numbers and operators that evaluates to a numerical value,
a regular expression (regex) is a string built from symbols and special operators like \[
(0 \cup 1) 0^*
\] whose value is a language (a set of strings).
Most regex
tools replace \(\cup\) with |
, e.g., (0|1)0*
. Use \(\cup\) when handwriting (e.g., on an exam) to avoid confusing |
and \(1\).
Operator Precedence
*
has highest precedence,We now define the syntax (valid expressions) and semantics (languages they decide) of regexes.
Let \(\Delta = \{ \cup, (, ), *, \emptyset, \emptystring \}\) be the regex control alphabet.
Let \(\Sigma\), the input alphabet, be an alphabet such that \(\Sigma \cap \Delta = \emptyset\).
\(R \in (\Sigma \cup \Delta)^*\) is a regular expression deciding language \(L(R) \subseteq \Sigma^*\) if one of the following holds:
This is an inductive definition!
Parentheses can be omitted, in which case the precedence rules apply.
A language \(A\) is regex-decidable if there exists a regex \(R\) such that \(L(R) = A\). In Chapter 6 we prove that a language is regex-decidable if and only if it is DFA-decidable.
In the following examples, we assume that \(\Sigma = \{0, 1\}\).
Suppose you’re writing a compiler for a programming language
In the first step, you need to break down the input string into tokens that could represent variable names, keywords, operators, and constants.
This is generally done by a lexer defined by regex’s matching each token type.
Let’s try to do this for numerical constants with an optional fractional part and sign.
Putting everything together, we get \[ (- \cup \emptystring)(I \cup I . D^* \cup . D^+) \hspace{10em} \]
Technically 203 is a int
literal, not a float
literal, since it has no decimal point. How to modify to accept only float literals?
Why is this “context-free”?
\[\begin{align*} A &\to 0 A 1 \\ A &\to B \\ B &\to\ ! \end{align*}\]
Parse tree for Example 1