ECS 120 Theory of Computation

Nonregular Languages

Julian Panetta

University of California, Davis

Recall: Regular Languages

We have shown that the following classes of languages are equivalent, and we call them regular:
- DFA-decidable
- NFA-decidable
- RG-decidable
- regex-decidable
Are all languages regular?
- To prove a language is regular we can exhibit a DFA, NFA, regex, or RG that decides it.
- How can we prove a language is not regular?

Why we need rigor

Let $\#(a, b)$ denote the number of occurrences of $a \in \Sigma^*$ as a substring in $b \in \Sigma^*$.

Claim: the following language is not regular: \[ L_2 = \setbuild{w \in \binary^*}{\#(0, w) = \#(1, w)} \] because deciding it requires counting the number of occurrences of symbols $0$ and $1$,
needing an unbounded number of states.

Correct
Incorrect

Why we need rigor

Let $\#(a, b)$ denote the number of occurrences of $a \in \Sigma^*$ as a substring in $b \in \Sigma^*$.

Claim: the following language is not regular: \[ L_3 = \setbuild{w \in \binary^*}{\#(01, w) = \#(10, w)} \] because deciding it requires counting the number of occurrences of substrings $01$ and $10$,
needing an unbounded number of states.

Correct
Incorrect

Compare the number of “rising edges” ($\textcolor{purple}01$) and “falling edges” ($\textcolor{red}10$).
$000\underbrace{\textcolor{purple}01}_{\textcolor{purple}1}1\underbrace{\textcolor{red}10}_{\textcolor{red}1}0\underbrace{\textcolor{purple}01}_{\textcolor{purple}2}111\underbrace{\textcolor{red}10}_{\textcolor{red}2}0\underbrace{\textcolor{purple}01}_{\textcolor{purple}3}111\underbrace{\textcolor{red}10}_{\textcolor{red}3}00$ $\quad$ vs. $\quad$ $\underbrace{\textcolor{red}10}_{\textcolor{red}1}00\underbrace{\textcolor{purple}01}_{\textcolor{purple}1}1\underbrace{\textcolor{red}10}_{\textcolor{red}2}0\underbrace{\textcolor{purple}01}_{\textcolor{purple}2}111\underbrace{\textcolor{red}10}_{\textcolor{red}3}0\underbrace{\textcolor{purple}01}_{\textcolor{purple}3}111\underbrace{\textcolor{red}10}_{\textcolor{purple}4}00$
Equivalent to the regular language $\setbuild{w \in \binary^*}{w[1] = w[|w|]}$.

A rigorous (ad-hoc) proof

Let’s show (for real this time) that the following language is not regular (DFA-decidable): \[ L_1 = \setbuild{0^n 1^n}{n \in \mathbb{N}} = \{\emptystring, 01, 0011, 000111, 00001111, \dots\} \]

Suppose for the sake of contradiction that there is a DFA $D = (Q, \Sigma, \delta, s, F)$ deciding $L_1$.
Consider processing two different strings $x, y \in \Sigma^*$ on separate copies of $D$.
Now suppose that by feeding the suffix string $z \in \Sigma^*$ into both copies we find that: \[ xz \in L_1 \quad \text{but} \quad yz \notin L_1. \]
Then running $D$ on $x$ must have reached a different state than running it on $y$!
Let’s apply this logic to the strings $x=0^i$ and $y=0^j$ for any $i \ne j$
- Taking $z = 1^i$, we find: \[ xz=0^i 1^i \in L_1 \quad \text{but} \quad yz=0^j 1^i \notin L_1. \]
- Therefore $D$ must be in a different state after processing $0^i$ than after processing $0^j$.
- In other words $D$, reading $0^i$ must put $D$ into a distinct state for each of the infinitely many $i \in \mathbb{N}$.
- But $D$ has only a finite number of states, a contradiction.

This is the idea behind the Myhill-Nerode Theorem: Sections 7.2-7.5

Separating extensions and $L$-equivalence

First, some definitions:

Given a language $L \in \Sigma^*$,
the strings $x, y \in \Sigma^*$ are called $L$-separable or $L$-distinguishable if there is a string $z \in \Sigma^*$ such that:

\[ xz \in L \iff yz \notin L \hspace{15em} \]

Exactly one of $xz$ or $yz$ is in $L$.
($xz\in L\ $ XOR $\ yz\in L$)

This $z$ is called a separating (or distinguishing) extension for $x$ and $y$.

If $x$ and $y$ are not $L$-separable,
then we say they are $L$-equivalent, denoted $x \sim_L y$.

In other words, $x \sim_L y$ means that for all $z \in \Sigma^*$: \[ xz \in L \iff yz \in L \hspace{15em} \]

Both $xz$ and $yz$ are in $L$, or both are not.

Note that $\sim_L$ is an equivalence relation on $\Sigma^*$, so it partitions $\Sigma^*$ into equivalence classes of strings that are indistinguishable by $L$.

Separating extensions and $L$-equivalence: Examples

\[ L_1 = \setbuild{0^n 1^n}{n \in \mathbb{N}} = \{\emptystring, 01, 0011, 000111, \ldots\} \]

Do strings $00$ and $000$ have a separating extension with respect to $L_1$?
- $z = 11$ is one: $\quad\ \qquad 00 11\ \ \in L_1 \ \ \ \ \, \text{but} \quad 00011\ \ \notin L_1$.
- $z = 111$ is another: $\quad 00 111 \notin L_1 \quad \text{but} \quad 000111 \in L_1$.
- Thus $00 \not \sim_{L_1} 000$.
Do strings $010$ and $0010$ have a separating extension with respect to $L_1$?
- No, because no string of the form $0^+1^+0^+$ is a prefix of any string in $L_1$, i.e.: \[ (\forall z \in \Sigma^*) \quad 010z \notin L_1 \quad \text{and} \quad 0010z \notin L_1 \]
- Thus $010 \sim_{L_1} 0010$.

How many equivalence classes are there for $\sim_{L_1}$?
- $\emptystring, 0, 00, 000, \ldots$ are all $L_1$-inequivalent. $x=0^i \text{ and } y=0^k \text{ have separating extension } z = 1^i \text{ for all } i \ne k$
- Therefore there are an infinite number of equivalence classes.

Myhill-Nerode Theorem

Theorem (Myhill-Nerode):
A language $L$ is regular if and only if $\sim_L$ defines a finite number of equivalence classes.

Furthermore, the number of equivalence classes equals the number of states in the
minimal DFA deciding $L$.

To prove a language is nonregular we need only one direction of this theorem:

Corollary:
If a language $L$ has an infinite number of equivalence classes with respect to $\sim_L$,
then $L$ is nonregular.

Equivalent corollary:
If for language $L$ we can construct an infinite set $S$ of pairwise $L$-inequivalent strings,
then $L$ is nonregular.

Myhill-Nerode Theorem: Examples

\[ L_1 = \setbuild{0^n 1^n}{n \in \mathbb{N}} = \{\emptystring, 01, 0011, 000111, \ldots\} \]

Claim: $L_1$ is nonregular.

Proof:

Define $S = \setbuild{0^n}{n \in \mathbb{N}} = \{\emptystring, 0, 00, 000, \ldots\}$.
Each pair of strings $0^i, 0^k \in S$ with $i \ne k$ has a separating extension $1^i$: \[ 0^i 1^i \in L_1 \quad \text{but} \quad 0^k 1^i \notin L_1, \] meaning $0^i \not \sim_{L_1} 0^k$.
Therefore, $L_1$ is nonregular by the Myhill-Nerode theorem.

Myhill-Nerode Theorem: Examples

\[ L_2 = \setbuild{w \in \binary^*}{\#(0, w) = \#(1, w)} \]

Claim: $L_2$ is nonregular.

Proof:

Define $S = \setbuild{0^n}{n \in \mathbb{N}} = \{\emptystring, 0, 00, 000, \ldots\}$.
Each pair of strings $0^i, 0^k \in S$ with $i \ne k$ has a separating extension $1^i$: \[ 0^i 1^i \in L_2 \quad \text{but} \quad 0^k 1^i \notin L_2, \] meaning $0^i \not \sim_{L_2} 0^k$.
Therefore, $L_2$ is nonregular by the Myhill-Nerode theorem.

The exact same proof applies for $L_2$!

Myhill-Nerode Theorem: Examples

\[ L_2 = \setbuild{w \in \binary^*}{\#(0, w) = \#(1, w)} \]

Claim: $L_2$ is nonregular.

Alternate proof using closure properties:

Note that $L(0^* 1^*)$ is regular,
and $L_2 \cap L(0^* 1^*) = \fragment{\setbuild{0^n 1^n}{n \in \mathbb{N}}= L_1}$
By closure of DFA-decidable languages under intersection, if $L_2$ were regular, $L_1$ would also be regular.
But $L_1$ is nonregular, so $L_2$ must be nonregular also!

More examples applying the Myhill-Nerode Theorem

Claim: the following language is not regular: \[ L_4 = \setbuild{w w}{w \in \binary^*} \]

Proof:

Let $S = \setbuild{0^n 1}{n \in \mathbb{N}} = \{1, 01, 001, 0001, \ldots\}$.
Any pair $0^i 1, 0^k 1 \in S$ with $i \ne k$ is $L_4$-inequivalent $\quad (0^i 1 \not \sim_{L_4} 0^k 1)$.
Separating extension? $z = 0^i 1$

\[ 0^i 1 z \fragment{= 0^i 1 0^i 1} \fragment{\in L_4 \quad \text{but} \quad 0^k 1 z =} \fragment{0^k 1 0^i 1} \fragment{\notin L_4\quad\text{since } i \ne k} \]
Therefore every element of $S$ is in a different equivalence class.
Since $|S| = \infty$, $L_4$ is nonregular.

More examples applying the Myhill-Nerode Theorem

Claim: the following language is not regular: \[ L_5 = \setbuild{0^i 1^j}{i > j} \]

Proof:

Let $S = \setbuild{0^n}{n \in \mathbb{N}} = \{\epsilon, 0, 00, 000, \ldots\}$.
Any pair $0^i, 0^k \in S$ with $i > k$ is $L_5$-inequivalent $\quad (0^i \not \sim_{L_5} 0^k)$.
Separating extension?
- $1^k$
- $1^i$
- $1^{k + 1}$
- $1^{i - 1}$
- $01^{k + 1}$
\[ 0^i 1^k \in L_5 \quad \text{but} \quad 0^k 1^k \notin L_5 \]
Since $|S| = \infty$, $L_5$ is nonregular.

More examples applying the Myhill-Nerode Theorem

Claim: the following language is not regular: \[ L_6 = \setbuild{0^i 1^j 0^k}{i \cdot j = k} = \{\epsilon, 010, 00100, 01100, \ldots\} \]

Proof:

Let $S = \setbuild{0^n}{n \in \mathbb{N}}$
Consider any pair $0^i, 0^k \in S$ with $i \ne k$. Separating extension?
- $0^i$
- $10^i$
- $10^k$
- $110^{2i}$
- $110^{i + k}$
Since $|S| = \infty$, $L_6$ is nonregular.
Alternative choices for $S$: \[ S = \setbuild{0^n 1^n}{n \in \mathbb{N}}, \quad \quad S = \setbuild{0^n 1}{n \in \mathbb{N}}, \quad \quad \cdots \]

More examples applying the Myhill-Nerode Theorem

Claim: the following language is not regular: \[ L_7 = \setbuild{1^{n^2}}{n \in \mathbb{N}} = \{\epsilon, 1, 1111, 111111111, \ldots\} \]

Proof:

Let $S = L_7$.
Consider any pair $1^{i^2}, 1^{k^2} \in S$ with $i > k$.
Separating extension that works for all such pairs?
- $\emptystring$
- $1$
- $1^k$
- $1^{(k + 1)^2 - k^2}$
- $1^{(i + 1)^2 - i^2}$
The gap between adjacent perfect squares is $(k + 1)^2 - k^2 = 2k + 1$, which strictly increases with $k$. \[ 0,\; 1,\; 4,\; 9,\; 16,\; 25,\; 36,\; 49,\; 64,\; \ldots \]

Since $i>k$, adding $2k + 1$ to $i^2$ is not enough to reach $(i + 1)^2$.

Adding $2i + 1$ to $k^2$ could reach $(k + c)^2$ for $c > 1$.
Example: $k = 0, i = 4, c = 3$
Since $|S| = \infty$, $L_7$ is nonregular.

More examples applying the Myhill-Nerode Theorem

Claim: the following language is not regular: \[ L_8 = \setbuild{w \in \binary^*}{w = \reverse{w} \; (w \text{ is a palindrome})} \]

Proof:

Let $S = \setbuild{0^n 1}{n \in \mathbb{N}}$.
Consider any pair $0^i 1, 0^k 1 \in S$ with $i \ne k$.
These have the separating extension $z = 0^i$ (as well as $z = 0^k$).
Since $|S| = \infty$, $L_8$ is nonregular.

Proof of the Myhill-Nerode Corollary

Corollary (one direction) of the Myhill-Nerode Theorem:
If a language $L$ has an infinite number of equivalence classes with respect to $\sim_L$,
then $L$ is nonregular.

Proof: (Contrapositive: $L$ regular $\implies$ $\sim_L$ defines a finite number of equivalence classes.)

Let $L$ be regular, and let $D = (Q, \Sigma, \delta, s, F)$ be a DFA deciding $L$.
Denote by $\reachedState(x)$ the state reached by $D$ after processing a string $x \in \Sigma^*$.
- $\reachedState(x)$ is sometimes called the “extended transition function.”
- It is defined recursively as:
  - Base case: $\; \reachedState(\emptystring) = \fragment{s}$
  - For all $a \in \Sigma$ and $x \in \Sigma^*$: $\; \reachedState(x a) = \fragment{\delta(\reachedState(x), a)}$
For any $x, y \in \Sigma^*$ such that $\reachedState(x) = \reachedState(y)$, we have $x \sim_L y \quad$ ($\reachedState(xz) = \reachedState(yz)$ for all $z \in \Sigma^*$).
Thus for each state $q \in Q$, all strings $x \in \Sigma^*$ with $\reachedState(x) = q$ are in the same equivalence class.
This means the number of equivalence classes defined by $\sim_L$ is at most $|Q|$ (hence finite).

Optional: Another rigorous (ad-hoc) proof

Let’s show (another way) that the following language is not regular (DFA-decidable): \[ L_1 = \setbuild{0^n 1^n}{n \in \mathbb{N}} = \{\emptystring, 01, 0011, 000111, 00001111, \dots\} \]

Suppose for the sake of contradiction that there is a DFA $D = (Q, \Sigma, \delta, s, F)$ deciding $L_1$.
Consider a string $0^n 1^n \in L_1$ with $n \ge |Q|$.
- The DFA must accept this string with a computation sequence: \[ s = r_0 \fragment{\transition{0} r_1} \fragment{\transition{0} r_2} \fragment{\transition{0} \cdots \transition{0} r_{n}} \fragment{\transition{1} r_{n + 1}} \fragment{\transition{1} \cdots \transition{1} r_{2n} \in F} \qquad \qquad \]
- Since $n + 1 > |Q|$ states appear in $(r_0, r_1, \ldots, r_n)$, at least one must repeat (Pigeonhole Principle): \[ r_0 \cdots \overbrace{\underbrace{r_i \transition{0} \cdots \transition{0} r_j}_{r_i = r_j}}^{\text{substring $y=0^{j-i}$ follows a cycle}} \cdots r_n \transition{1} \cdots \transition{1} r_{2n} \qquad\qquad \]
- Nonempty sequence $r_i \transition{0} \cdots \transition{0} r_{j}$ can be repeated $k$ more times (overlapping $r_i, r_j$) obtaining a new computation sequence that also reaches state $r_{2n}$, i.e., accepts the string $x=0^{n + (j - i) k} 1^n \text{ for any } k \in \N^+$.
- But since $n+(j-i)k \ne n$, we have $x \notin L_1$, contradicting $L(D) = L_1$ since $D$ accepts $x$.

This is the idea behind the Pumping Lemma: Optional sections 7.7-7.9

Optional: The Pumping Lemma

Another common way to prove a language is nonregular is to apply the pumping lemma.
Underlying idea:
- Any sufficiently long string $w \in L$ will cause a DFA to revisit a state.
- The substring read between the two visits can be deleted or repeated to obtain $w' \in L$.

Pumping Lemma
If $L$ is regular, then there is a pumping length $p \in \mathbb{N}$ such that for all $w \in L$ with $|w| \geq p$, there is a decomposition into three substrings $w = x y z$ where:

$x y^i z \in L$ for all $i \geq 0$.
$|y| > 0$.
$|x y| \leq p$

To prove a language is nonregular using the pumping lemma, apply the contrapositive:
Given $p$, there is a string $w \in L, |w| \ge p$ such that for all possible decompositions $w = x y z$ with $|y| > 0$ and $|x y| \leq p$ we find $x y^i z \notin L$ for some $i \geq 0$.

Optional: Applying the Pumping Lemma

Claim: the following language is not regular: \[ L_1 = \setbuild{0^n 1^n}{n \in \mathbb{N}} \]

Proof via the Pumping Lemma:

Given pumping length $p$, consider the string $w = 0^p 1^p$ with $|w| > p$.
The only possible decompositions $w = x y z$ with $|y| > 0$ and $|x y| \leq p$
must have $y = 0^k$ for some $k \geq 1$.
Pumping (deleting or repeating) $y$ changes the numbers of 0s, making the string no longer in the language: \[ x y^i z = 0^{p + (i - 1)k} 1^p \notin L_1 \quad \text{when } i \ne 1 \]
Therefore $L_1$ is nonregular by the pumping lemma.

Optional: Applying the Pumping Lemma

Claim: the following language is not regular: \[ L_4 = \setbuild{w w}{w \in \binary^*} \]

Proof via the Pumping Lemma:

Assume that $L_4$ is regular, and let $p$ be the pumping length guaranteed by the pumping lemma.
Consider the string $w = 0^p 1 0^p 1 \in L_4$ with $|w| > p$.
The pumping lemma guarantees we can decompose $w$ into three substrings $w = x y z$ where:
1. $x y^i z \in L_4$ for all $i \geq 0$.
2. $|y| > 0$.
3. $|x y| \leq p$
The last two conditions imply that $y = 0^k$ for some $k \geq 1$.
But the pumped string $x y^i z$ is: \[ x y^i z = 0^{p + (i - 1)k} 1 0^p 1 \notin L_4 \quad \text{when } i \ne 1 \] contradicting the first condition.

Title

Recall: Regular Languages

Why we need rigor

Why we need rigor

A rigorous (ad-hoc) proof

Separating extensions and \(L\)-equivalence

Separating extensions and \(L\)-equivalence: Examples

Separating extensions and \(L\)-equivalence: Examples

Separating extensions and \(L\)-equivalence: Examples

Myhill-Nerode Theorem

Myhill-Nerode Theorem: Examples

Myhill-Nerode Theorem: Examples

Myhill-Nerode Theorem: Examples

More examples applying the Myhill-Nerode Theorem

More examples applying the Myhill-Nerode Theorem

More examples applying the Myhill-Nerode Theorem

More examples applying the Myhill-Nerode Theorem

More examples applying the Myhill-Nerode Theorem

Proof of the Myhill-Nerode Corollary

Optional: Another rigorous (ad-hoc) proof

Optional: The Pumping Lemma

Optional: Applying the Pumping Lemma

Optional: Applying the Pumping Lemma