ECS 120 Theory of Computation

Nonregular Languages

Julian Panetta

University of California, Davis

Recall: Regular Languages

We have shown that the following classes of languages are equivalent:
- DFA-decidable
- NFA-decidable
- RG-decidable
- Regex-decidable
We call this class of languages the regular languages.
Are all languages regular?
- To prove a language is regular we can exhibit a DFA, NFA, regex, or RG that decides it.
- How can we prove a language is not regular?

Why we need rigor

Let $\#(a, b)$ denote the number of occurrences of $a \in \Sigma^*$ as a substring in $b \in \Sigma^*$.

Claim: the following language is not regular: \[ L_2 = \setbuild{w \in \binary^*}{\#(0, w) = \#(1, w)} \] because deciding it requires counting the number of occurrences of symbols $0$ and $1$,
needing an unbounded number of states.

Correct
Incorrect

Why we need rigor

Let $\#(a, b)$ denote the number of occurrences of $a \in \Sigma^*$ as a substring in $b \in \Sigma^*$.

Claim: the following language is not regular: \[ L_3 = \setbuild{w \in \binary^*}{\#(01, w) = \#(10, w)} \] because deciding it requires counting the number of occurrences of substrings $01$ and $10$,
needing an unbounded number of states.

Correct
Incorrect

Compares the number of “rising edges” ($\textcolor{purple}01$) and “falling edges” ($\textcolor{orange}10$).
$000\underbrace{\textcolor{purple}01}_{\textcolor{purple}1}1\underbrace{\textcolor{orange}10}_{\textcolor{orange}1}0\underbrace{\textcolor{purple}01}_{\textcolor{purple}2}111\underbrace{\textcolor{orange}10}_{\textcolor{orange}2}0\underbrace{\textcolor{purple}01}_{\textcolor{purple}3}111\underbrace{\textcolor{orange}10}_{\textcolor{orange}3}00$ vs. $\underbrace{\textcolor{orange}10}_{\textcolor{orange}1}00\underbrace{\textcolor{purple}01}_{\textcolor{purple}1}1\underbrace{\textcolor{orange}10}_{\textcolor{orange}2}0\underbrace{\textcolor{purple}01}_{\textcolor{purple}2}111\underbrace{\textcolor{orange}10}_{\textcolor{orange}3}0\underbrace{\textcolor{purple}01}_{\textcolor{purple}3}111\underbrace{\textcolor{orange}10}_{\textcolor{purple}4}00$
Equivalent to the regular language $\setbuild{w \in \binary^*}{w[1] = w[|w|]}$.

Two rigorous but ad-hoc proofs

Let’s show (for real this time) that the following language is not regular (DFA-decidable): \[ L_1 = \setbuild{0^n 1^n}{n \in \mathbb{N}} = \{\emptystring, 01, 0011, 000111, \ldots\} \]

Suppose for the sake of contradiction that there is a DFA $D = (Q, \Sigma, \delta, s, F)$ deciding $L_1$.
Consider a string $0^n 1^n \in L_1$ with $n \ge |Q|$.
- The DFA must accept this string with a computation sequence: \[ r_0 \fragment{\transition{0} r_1} \fragment{\transition{0} \cdots} \fragment{\transition{0} r_{n}} \fragment{\transition{1} r_{n + 1}} \fragment{\transition{1} \cdots \transition{1} r_{2n}} \] where $r_0 = s$ and $r_{2n} \in F$.
- Since $n + 1 > |Q|$ states appear in $(r_0, r_1, \ldots, r_n)$, at least one must repeat (Pigeonhole Principle): \[ r_0 \cdots \overbrace{\underbrace{r_i \transition{0} \cdots \transition{0} r_j}_{r_i = r_j}}^{\text{substring $y=0^{j-i}$ follows a cycle}} \cdots r_n \transition{1} \cdots \transition{1} r_{2n} \qquad\qquad \]
- Nonempty sequence $r_i \transition{0} \cdots \transition{0} r_{j}$ can be repeated $k$ more times (overlapping $r_i, r_j$) obtaining a new computation sequence that accepts the string $x=0^{n + (j - i) k} 1^n \text{ for any } k \in \N^+$.
- But since $n+(j-i)k \ne n$, we have $x \notin L_1$, contradicting $L(D) = L_1$ since $D$ accepts $x$.

This is the idea behind the Pumping Lemma: (Optional sections 7.7-7.9)

Two Rigorous but Ad-hoc Proofs

Let’s show (a different way) that the following language is not regular (DFA-decidable): \[ L_1 = \setbuild{0^n 1^n}{n \in \mathbb{N}} = \{\emptystring, 01, 0011, 000111, \ldots\} \]

Suppose again we have a DFA $D = (Q, \Sigma, \delta, s, F)$ deciding $L_1$.
Consider processing two different strings $x, y \in \Sigma^*$ on separate copies of $D$.
Now suppose that by feeding the suffix string $z \in \Sigma^*$ into both copies we find that: \[ xz \in L_1 \quad \text{but} \quad yz \notin L_1. \]
Then running $D$ on $x$ must have reached a different state than running it on $y$!
Let’s apply this logic to the strings $0^i$ and $0^j$ for any $i \ne j$
- Taking $z = 1^i$, we find: \[ 0^i 1^i \in L_1 \quad \text{but} \quad 0^j 1^i \notin L_1. \]
- Therefore $D$ must be in a different state after processing $0^i$ than after processing $0^j$.
- In other words $D$, reading $0^i$ must put $D$ into a distinct state for each $i \in \mathbb{N}$.
- This means $D$ must have an infinite number of states, which is disallowed for DFAs!

This is the idea behind the Myhill-Nerode Theorem: (Sections 7.2-7.5)

Separating Extensions and $L$-equivalence

First, some definitions:

Given a language $L \in \Sigma^*$,
the strings $x, y \in \Sigma^*$ are called $L$-separable or $L$-distinguishable
if there exists a string $z \in \Sigma^*$ such that:

\[ xz \in L \iff yz \notin L \hspace{10em} \]

Exactly one of $xz$ and $yz$ is in $L$.

This $z$ is called a separating (or distinguishing) extension for $x, y$.

If $x$ and $y$ are not $L$-separable,
then we say they are $L$-equivalent, denoted $x \sim_L y$.

In other words, $x \sim_L y$ means that for all $z \in \Sigma^*$: \[ xz \in L \iff yz \in L \]

Note that $\sim_L$ is an equivalence relation on $\Sigma^*$, so it partitions $\Sigma^*$ into equivalence classes of strings that are indistinguishable by $L$.

Separating Extensions and $L$-equivalence: Examples

\[ L_1 = \setbuild{0^n 1^n}{n \in \mathbb{N}} = \{\emptystring, 01, 0011, 000111, \ldots\} \]

Do strings $00$ and $000$ have a separating extension with respect to $L_1$?
- $z = 11$ is one: $\quad\ \qquad 00 11\ \ \in L_1 \ \ \ \ \, \text{but} \quad 00011\ \ \notin L_1$.
- $z = 111$ is another: $\quad 00 111 \notin L_1 \quad \text{but} \quad 000111 \in L_1$.
- Thus $00 \not \sim_{L_1} 000$.
Do strings $010$ and $0010$ have a separating extension with respect to $L_1$?
- No, because no string of the form $0^+1^+0^+$ is a prefix of any string in $L_1$, i.e.: \[ (\forall z \in \Sigma^*) \quad 010z \notin L_1 \quad \text{and} \quad 0010z \notin L_1 \]
- Thus $010 \sim_{L_1} 0010$.
How many equivalence classes are there for $\sim_{L_1}$?
- $\emptystring, 0, 00, 000, \ldots$ are all mutually $L_1$-inequivalent. \[ x=0^i \text{ and } y=0^k \text{ have the separating extension } z = 1^i \text{ for all } 0 \le i < k \]
- Therefore there are an infinite number of equivalence classes.

Myhill-Nerode Theorem

Theorem (Myhill-Nerode):
A language $L$ is regular if and only if $\sim_L$ defines a finite number of equivalence classes.

Furthermore, the number of equivalence classes equals the number of states in the
minimal DFA deciding $L$.

To prove a language is nonregular we need only one direction of this theorem:

Corollary:
If a language $L$ has an infinite number of equivalence classes with respect to $\sim_L$,
then $L$ is nonregular.

Equivalent corollary:
If for language $L$ we can construct a set $S$ of pairwise $L$-inequivalent strings with $|S| = \infty$,
then $L$ is nonregular.

Myhill-Nerode Theorem: Examples

\[ L_1 = \setbuild{0^n 1^n}{n \in \mathbb{N}} = \{\emptystring, 01, 0011, 000111, \ldots\} \]

Claim: $L_1$ is nonregular.

Proof:

Define $S = \setbuild{0^n}{n \in \mathbb{N}} = \{\emptystring, 0, 00, 000, \ldots\}$.
Each pair of strings $0^i, 0^k \in S$ with $i \ne k$ has a separating extension $1^i$: \[ 0^i 1^i \in L_1 \quad \text{but} \quad 0^k 1^i \notin L_1, \] meaning $0^i \not \sim_{L_1} 0^k$.
Therefore, $L_1$ is nonregular by the Myhill-Nerode theorem.

Myhill-Nerode Theorem: Examples

\[ L_2 = \setbuild{w \in \binary^*}{\#(0, w) = \#(1, w)} \]

Claim: $L_2$ is nonregular.

Proof:

Define $S = \setbuild{0^n}{n \in \mathbb{N}} = \{\emptystring, 0, 00, 000, \ldots\}$.
Each pair of strings $0^i, 0^k \in S$ with $i \ne k$ has a separating extension $1^i$: \[ 0^i 1^i \in L_2 \quad \text{but} \quad 0^k 1^i \notin L_2, \] meaning $0^i \not \sim_{L_2} 0^k$.
Therefore, $L_2$ is nonregular by the Myhill-Nerode theorem.

The exact same proof applies for $L_2$!

Myhill-Nerode Theorem: Examples

\[ L_2 = \setbuild{w \in \binary^*}{\#(0, w) = \#(1, w)} \]

Claim: $L_2$ is nonregular.

Alternate proof by closure:

Assume for the sake of contradiction that $L_2$ is regular.
Note that $\{0^* 1^*\}$ is regular,
and $L_2 \cap \{0^* 1^*\} = \fragment{\setbuild{0^n 1^n}{n \in \mathbb{N}}} \fragment{= L_1}$
By closure of DFA-decidable languages under intersection,
$L_1$ must also be regular.
But this contradicts our proof that $L_1$ is nonregular, so $L_2$ must actually be nonregular!

Title

Recall: Regular Languages

Why we need rigor

Why we need rigor

Two rigorous but ad-hoc proofs

Two Rigorous but Ad-hoc Proofs

Separating Extensions and \(L\)-equivalence

Separating Extensions and \(L\)-equivalence: Examples

Myhill-Nerode Theorem

Myhill-Nerode Theorem: Examples

Myhill-Nerode Theorem: Examples

Myhill-Nerode Theorem: Examples