Title

ECS 120 Theory of Computation
Nonregular Languages
Julian Panetta
University of California, Davis

Recall: Regular Languages

  • We have shown that the following classes of languages are equivalent, and we call them regular:

    • DFA-decidable
    • NFA-decidable
    • RG-decidable
    • regex-decidable
  • Are all languages regular?

    • To prove a language is regular we can exhibit a DFA, NFA, regex, or RG that decides it.
    • How can we prove a language is not regular?

Why we need rigor

Let \(\#(a, b)\) denote the number of occurrences of \(a \in \Sigma^*\) as a substring in \(b \in \Sigma^*\).

Claim: the following language is not regular: \[ L_2 = \setbuild{w \in \binary^*}{\#(0, w) = \#(1, w)} \] because deciding it requires counting the number of occurrences of symbols \(0\) and \(1\),
needing an unbounded number of states.

  • Correct
  • Incorrect

Why we need rigor

Let \(\#(a, b)\) denote the number of occurrences of \(a \in \Sigma^*\) as a substring in \(b \in \Sigma^*\).

Claim: the following language is not regular: \[ L_3 = \setbuild{w \in \binary^*}{\#(01, w) = \#(10, w)} \] because deciding it requires counting the number of occurrences of substrings \(01\) and \(10\),
needing an unbounded number of states.

  • Correct
  • Incorrect
  • Compare the number of “rising edges” (\(\textcolor{purple}01\)) and “falling edges” (\(\textcolor{red}10\)).
  • \(000\underbrace{\textcolor{purple}01}_{\textcolor{purple}1}1\underbrace{\textcolor{red}10}_{\textcolor{red}1}0\underbrace{\textcolor{purple}01}_{\textcolor{purple}2}111\underbrace{\textcolor{red}10}_{\textcolor{red}2}0\underbrace{\textcolor{purple}01}_{\textcolor{purple}3}111\underbrace{\textcolor{red}10}_{\textcolor{red}3}00\) \(\quad\) vs. \(\quad\) \(\underbrace{\textcolor{red}10}_{\textcolor{red}1}00\underbrace{\textcolor{purple}01}_{\textcolor{purple}1}1\underbrace{\textcolor{red}10}_{\textcolor{red}2}0\underbrace{\textcolor{purple}01}_{\textcolor{purple}2}111\underbrace{\textcolor{red}10}_{\textcolor{red}3}0\underbrace{\textcolor{purple}01}_{\textcolor{purple}3}111\underbrace{\textcolor{red}10}_{\textcolor{purple}4}00\)
  • Equivalent to the regular language \(\setbuild{w \in \binary^*}{w[1] = w[|w|]}\).

A rigorous (ad-hoc) proof

Let’s show (for real this time) that the following language is not regular (DFA-decidable): \[ L_1 = \setbuild{0^n 1^n}{n \in \mathbb{N}} = \{\emptystring, 01, 0011, 000111, 00001111, \dots\} \]

  • Suppose for the sake of contradiction that there is a DFA \(D = (Q, \Sigma, \delta, s, F)\) deciding \(L_1\).
  • Consider processing two different strings \(x, y \in \Sigma^*\) on separate copies of \(D\).
  • Now suppose that by feeding the suffix string \(z \in \Sigma^*\) into both copies we find that: \[ xz \in L_1 \quad \text{but} \quad yz \notin L_1. \]
  • Then running \(D\) on \(x\) must have reached a different state than running it on \(y\)!
  • Let’s apply this logic to the strings \(x=0^i\) and \(y=0^j\) for any \(i \ne j\)
    • Taking \(z = 1^i\), we find: \[ xz=0^i 1^i \in L_1 \quad \text{but} \quad yz=0^j 1^i \notin L_1. \]
    • Therefore \(D\) must be in a different state after processing \(0^i\) than after processing \(0^j\).
    • In other words \(D\), reading \(0^i\) must put \(D\) into a distinct state for each of the infinitely many \(i \in \mathbb{N}\).
    • But \(D\) has only a finite number of states, a contradiction.

This is the idea behind the Myhill-Nerode Theorem: Sections 7.2-7.5

Separating extensions and \(L\)-equivalence

First, some definitions:

Given a language \(L \in \Sigma^*\),
the strings \(x, y \in \Sigma^*\) are called \(L\)-separable or \(L\)-distinguishable if there is a string \(z \in \Sigma^*\) such that:

\[ xz \in L \iff yz \notin L \hspace{15em} \]

Exactly one of \(xz\) or \(yz\) is in \(L\).
(\(xz\in L\ \) XOR \(\ yz\in L\))

This \(z\) is called a separating (or distinguishing) extension for \(x\) and \(y\).

If \(x\) and \(y\) are not \(L\)-separable,
then we say they are \(L\)-equivalent, denoted \(x \sim_L y\).

In other words, \(x \sim_L y\) means that for all \(z \in \Sigma^*\): \[ xz \in L \iff yz \in L \hspace{15em} \]

Both \(xz\) and \(yz\) are in \(L\), or both are not.

Note that \(\sim_L\) is an equivalence relation on \(\Sigma^*\), so it partitions \(\Sigma^*\) into equivalence classes of strings that are indistinguishable by \(L\).

Separating extensions and \(L\)-equivalence: Examples

\[ L_1 = \setbuild{0^n 1^n}{n \in \mathbb{N}} = \{\emptystring, 01, 0011, 000111, \ldots\} \]

Do \(00\) and \(000\) have a separating extension with respect to \(L_1\)? Mark all that apply.

  • \(\emptystring\)
  • \(1\)
  • \(11\)
  • \(111\)
  • \(1111\)
  • \(0\)
  • \(00\)
  • \(000\)
  • \(0000\)
  • They have no separating extension; \(00 \sim_{L_1} 000\).

Separating extensions and \(L\)-equivalence: Examples

\[ L_1 = \setbuild{0^n 1^n}{n \in \mathbb{N}} = \{\emptystring, 01, 0011, 000111, \ldots\} \]

Do strings \(00\) and \(000\) have a separating extension with respect to \(L_1\)?

  • \(z = 11\) is one: \(\quad\ \qquad 00 11\ \ \in L_1 \ \ \ \ \, \text{but} \quad 00011\ \ \notin L_1\).
  • \(z = 111\) is another: \(\quad 00 111 \notin L_1 \quad \text{but} \quad 000111 \in L_1\).
  • Thus \(00 \not \sim_{L_1} 000\).

Do \(010\) and \(0010\) have a separating extension with respect to \(L_1\)? Mark all that apply.

  • \(\emptystring\)
  • \(11\)
  • \(111\)
  • \(00\)
  • \(000\)
  • \(010\)
  • \(0010\)
  • \(101\)
  • \(1101\)
  • They have no separating extension; \(010 \sim_{L_1} 0010\).

Separating extensions and \(L\)-equivalence: Examples

\[ L_1 = \setbuild{0^n 1^n}{n \in \mathbb{N}} = \{\emptystring, 01, 0011, 000111, \ldots\} \]

  • Do strings \(00\) and \(000\) have a separating extension with respect to \(L_1\)?
    • \(z = 11\) is one: \(\quad\ \qquad 00 11\ \ \in L_1 \ \ \ \ \, \text{but} \quad 00011\ \ \notin L_1\).
    • \(z = 111\) is another: \(\quad 00 111 \notin L_1 \quad \text{but} \quad 000111 \in L_1\).
    • Thus \(00 \not \sim_{L_1} 000\).
  • Do strings \(010\) and \(0010\) have a separating extension with respect to \(L_1\)?
    • No, because no string of the form \(0^+1^+0^+\) is a prefix of any string in \(L_1\), i.e.: \[ (\forall z \in \Sigma^*) \quad 010z \notin L_1 \quad \text{and} \quad 0010z \notin L_1 \]
    • Thus \(010 \sim_{L_1} 0010\).
  • How many equivalence classes are there for \(\sim_{L_1}\)?
    • \(\emptystring, 0, 00, 000, \ldots\) are all \(L_1\)-inequivalent. \(x=0^i \text{ and } y=0^k \text{ have separating extension } z = 1^i \text{ for all } i \ne k\)
    • Therefore there are an infinite number of equivalence classes.

Myhill-Nerode Theorem

Theorem (Myhill-Nerode):
A language \(L\) is regular if and only if \(\sim_L\) defines a finite number of equivalence classes.

Furthermore, the number of equivalence classes equals the number of states in the
minimal DFA deciding \(L\).

To prove a language is nonregular we need only one direction of this theorem:

Corollary:
If a language \(L\) has an infinite number of equivalence classes with respect to \(\sim_L\),
then \(L\) is nonregular.

Equivalent corollary:
If for language \(L\) we can construct an infinite set \(S\) of pairwise \(L\)-inequivalent strings,
then \(L\) is nonregular.

Myhill-Nerode Theorem: Examples

\[ L_1 = \setbuild{0^n 1^n}{n \in \mathbb{N}} = \{\emptystring, 01, 0011, 000111, \ldots\} \]

Claim: \(L_1\) is nonregular.

Proof:

  • Define \(S = \setbuild{0^n}{n \in \mathbb{N}} = \{\emptystring, 0, 00, 000, \ldots\}\).
  • Each pair of strings \(0^i, 0^k \in S\) with \(i \ne k\) has a separating extension \(1^i\): \[ 0^i 1^i \in L_1 \quad \text{but} \quad 0^k 1^i \notin L_1, \] meaning \(0^i \not \sim_{L_1} 0^k\).
  • Therefore, \(L_1\) is nonregular by the Myhill-Nerode theorem.

Myhill-Nerode Theorem: Examples

\[ L_2 = \setbuild{w \in \binary^*}{\#(0, w) = \#(1, w)} \]

Claim: \(L_2\) is nonregular.

Proof:

  • Define \(S = \setbuild{0^n}{n \in \mathbb{N}} = \{\emptystring, 0, 00, 000, \ldots\}\).
  • Each pair of strings \(0^i, 0^k \in S\) with \(i \ne k\) has a separating extension \(1^i\): \[ 0^i 1^i \in L_2 \quad \text{but} \quad 0^k 1^i \notin L_2, \] meaning \(0^i \not \sim_{L_2} 0^k\).
  • Therefore, \(L_2\) is nonregular by the Myhill-Nerode theorem.

The exact same proof applies for \(L_2\)!

Myhill-Nerode Theorem: Examples

\[ L_2 = \setbuild{w \in \binary^*}{\#(0, w) = \#(1, w)} \]

Claim: \(L_2\) is nonregular.

Alternate proof using closure properties:

  • Note that \(L(0^* 1^*)\) is regular,
    and \(L_2 \cap L(0^* 1^*) = \fragment{\setbuild{0^n 1^n}{n \in \mathbb{N}}= L_1}\)
  • By closure of DFA-decidable languages under intersection, if \(L_2\) were regular, \(L_1\) would also be regular.
  • But \(L_1\) is nonregular, so \(L_2\) must be nonregular also!

More examples applying the Myhill-Nerode Theorem

Claim: the following language is not regular: \[ L_4 = \setbuild{w w}{w \in \binary^*} \]

Proof:

  • Let \(S = \setbuild{0^n 1}{n \in \mathbb{N}} = \{1, 01, 001, 0001, \ldots\}\).

  • Any pair \(0^i 1, 0^k 1 \in S\) with \(i \ne k\) is \(L_4\)-inequivalent \(\quad (0^i 1 \not \sim_{L_4} 0^k 1)\).
    Separating extension? \(z = 0^i 1\)

    \[ 0^i 1 z \fragment{= 0^i 1 0^i 1} \fragment{\in L_4 \quad \text{but} \quad 0^k 1 z =} \fragment{0^k 1 0^i 1} \fragment{\notin L_4\quad\text{since } i \ne k} \]

  • Therefore every element of \(S\) is in a different equivalence class.

  • Since \(|S| = \infty\), \(L_4\) is nonregular.

More examples applying the Myhill-Nerode Theorem

Claim: the following language is not regular: \[ L_5 = \setbuild{0^i 1^j}{i > j} \]

Proof:

  • Let \(S = \setbuild{0^n}{n \in \mathbb{N}} = \{\epsilon, 0, 00, 000, \ldots\}\).

  • Any pair \(0^i, 0^k \in S\) with \(i > k\) is \(L_5\)-inequivalent \(\quad (0^i \not \sim_{L_5} 0^k)\).
    Separating extension?

    • \(1^k\)
    • \(1^i\)
    • \(1^{k + 1}\)
    • \(1^{i - 1}\)
    • \(01^{k + 1}\)

    \[ 0^i 1^k \in L_5 \quad \text{but} \quad 0^k 1^k \notin L_5 \]

  • Since \(|S| = \infty\), \(L_5\) is nonregular.

More examples applying the Myhill-Nerode Theorem

Claim: the following language is not regular: \[ L_6 = \setbuild{0^i 1^j 0^k}{i \cdot j = k} = \{\epsilon, 010, 00100, 01100, \ldots\} \]

Proof:

  • Let \(S = \setbuild{0^n}{n \in \mathbb{N}}\)

  • Consider any pair \(0^i, 0^k \in S\) with \(i \ne k\). Separating extension?

    • \(0^i\)
    • \(10^i\)
    • \(10^k\)
    • \(110^{2i}\)
    • \(110^{i + k}\)
  • Since \(|S| = \infty\), \(L_6\) is nonregular.

  • Alternative choices for \(S\): \[ S = \setbuild{0^n 1^n}{n \in \mathbb{N}}, \quad \quad S = \setbuild{0^n 1}{n \in \mathbb{N}}, \quad \quad \cdots \]

More examples applying the Myhill-Nerode Theorem

Claim: the following language is not regular: \[ L_7 = \setbuild{1^{n^2}}{n \in \mathbb{N}} = \{\epsilon, 1, 1111, 111111111, \ldots\} \]

Proof:

  • Let \(S = L_7\).

  • Consider any pair \(1^{i^2}, 1^{k^2} \in S\) with \(i > k\).

  • Separating extension that works for all such pairs?

    • \(\emptystring\)
    • \(1\)
    • \(1^k\)
    • \(1^{(k + 1)^2 - k^2}\)
    • \(1^{(i + 1)^2 - i^2}\)

    The gap between adjacent perfect squares is \((k + 1)^2 - k^2 = 2k + 1\), which strictly increases with \(k\). \[ 0,\; 1,\; 4,\; 9,\; 16,\; 25,\; 36,\; 49,\; 64,\; \ldots \]

    Since \(i>k\), adding \(2k + 1\) to \(i^2\) is not enough to reach \((i + 1)^2\).

    Adding \(2i + 1\) to \(k^2\) could reach \((k + c)^2\) for \(c > 1\).
    Example: \(k = 0, i = 4, c = 3\)

  • Since \(|S| = \infty\), \(L_7\) is nonregular.

More examples applying the Myhill-Nerode Theorem

Claim: the following language is not regular: \[ L_8 = \setbuild{w \in \binary^*}{w = \reverse{w} \; (w \text{ is a palindrome})} \]

Proof:

  • Let \(S = \setbuild{0^n 1}{n \in \mathbb{N}}\).
  • Consider any pair \(0^i 1, 0^k 1 \in S\) with \(i \ne k\).
  • These have the separating extension \(z = 0^i\) (as well as \(z = 0^k\)).
  • Since \(|S| = \infty\), \(L_8\) is nonregular.

Proof of the Myhill-Nerode Corollary

Corollary (one direction) of the Myhill-Nerode Theorem:
If a language \(L\) has an infinite number of equivalence classes with respect to \(\sim_L\),
then \(L\) is nonregular.

Proof: (Contrapositive: \(L\) regular \(\implies\) \(\sim_L\) defines a finite number of equivalence classes.)

  • Let \(L\) be regular, and let \(D = (Q, \Sigma, \delta, s, F)\) be a DFA deciding \(L\).
  • Denote by \(\reachedState(x)\) the state reached by \(D\) after processing a string \(x \in \Sigma^*\).
    • \(\reachedState(x)\) is sometimes called the “extended transition function.”
    • It is defined recursively as:
      • Base case: \(\; \reachedState(\emptystring) = \fragment{s}\)
      • For all \(a \in \Sigma\) and \(x \in \Sigma^*\): \(\; \reachedState(x a) = \fragment{\delta(\reachedState(x), a)}\)
  • For any \(x, y \in \Sigma^*\) such that \(\reachedState(x) = \reachedState(y)\), we have \(x \sim_L y \quad\) (\(\reachedState(xz) = \reachedState(yz)\) for all \(z \in \Sigma^*\)).
  • Thus for each state \(q \in Q\), all strings \(x \in \Sigma^*\) with \(\reachedState(x) = q\) are in the same equivalence class.
  • This means the number of equivalence classes defined by \(\sim_L\) is at most \(|Q|\) (hence finite).

Optional: Another rigorous (ad-hoc) proof

Let’s show (another way) that the following language is not regular (DFA-decidable): \[ L_1 = \setbuild{0^n 1^n}{n \in \mathbb{N}} = \{\emptystring, 01, 0011, 000111, 00001111, \dots\} \]

  • Suppose for the sake of contradiction that there is a DFA \(D = (Q, \Sigma, \delta, s, F)\) deciding \(L_1\).
  • Consider a string \(0^n 1^n \in L_1\) with \(n \ge |Q|\).
    • The DFA must accept this string with a computation sequence: \[ s = r_0 \fragment{\transition{0} r_1} \fragment{\transition{0} r_2} \fragment{\transition{0} \cdots \transition{0} r_{n}} \fragment{\transition{1} r_{n + 1}} \fragment{\transition{1} \cdots \transition{1} r_{2n} \in F} \qquad \qquad \]

    • Since \(n + 1 > |Q|\) states appear in \((r_0, r_1, \ldots, r_n)\), at least one must repeat (Pigeonhole Principle): \[ r_0 \cdots \overbrace{\underbrace{r_i \transition{0} \cdots \transition{0} r_j}_{r_i = r_j}}^{\text{substring $y=0^{j-i}$ follows a cycle}} \cdots r_n \transition{1} \cdots \transition{1} r_{2n} \qquad\qquad \]

      data/images/nonregular/pumping_vis.svg
    • Nonempty sequence \(r_i \transition{0} \cdots \transition{0} r_{j}\) can be repeated \(k\) more times (overlapping \(r_i, r_j\)) obtaining a new computation sequence that also reaches state \(r_{2n}\), i.e., accepts the string \(x=0^{n + (j - i) k} 1^n \text{ for any } k \in \N^+\).

    • But since \(n+(j-i)k \ne n\), we have \(x \notin L_1\), contradicting \(L(D) = L_1\) since \(D\) accepts \(x\).

This is the idea behind the Pumping Lemma: Optional sections 7.7-7.9

Optional: The Pumping Lemma

  • Another common way to prove a language is nonregular is to apply the pumping lemma.
  • Underlying idea:
    • Any sufficiently long string \(w \in L\) will cause a DFA to revisit a state.
    • The substring read between the two visits can be deleted or repeated to obtain \(w' \in L\).

Pumping Lemma
If \(L\) is regular, then there is a pumping length \(p \in \mathbb{N}\) such that for all \(w \in L\) with \(|w| \geq p\), there is a decomposition into three substrings \(w = x y z\) where:

  1. \(x y^i z \in L\) for all \(i \geq 0\).
  2. \(|y| > 0\).
  3. \(|x y| \leq p\)
data/images/nonregular/pumping_vis.svg

To prove a language is nonregular using the pumping lemma, apply the contrapositive:
Given \(p\), there is a string \(w \in L, |w| \ge p\) such that for all possible decompositions \(w = x y z\) with \(|y| > 0\) and \(|x y| \leq p\) we find \(x y^i z \notin L\) for some \(i \geq 0\).

Optional: Applying the Pumping Lemma

Claim: the following language is not regular: \[ L_1 = \setbuild{0^n 1^n}{n \in \mathbb{N}} \]

Proof via the Pumping Lemma:

  • Given pumping length \(p\), consider the string \(w = 0^p 1^p\) with \(|w| > p\).
  • The only possible decompositions \(w = x y z\) with \(|y| > 0\) and \(|x y| \leq p\)
    must have \(y = 0^k\) for some \(k \geq 1\).
  • Pumping (deleting or repeating) \(y\) changes the numbers of 0s, making the string no longer in the language: \[ x y^i z = 0^{p + (i - 1)k} 1^p \notin L_1 \quad \text{when } i \ne 1 \]
  • Therefore \(L_1\) is nonregular by the pumping lemma.

Optional: Applying the Pumping Lemma

Claim: the following language is not regular: \[ L_4 = \setbuild{w w}{w \in \binary^*} \]

Proof via the Pumping Lemma:

  • Assume that \(L_4\) is regular, and let \(p\) be the pumping length guaranteed by the pumping lemma.
  • Consider the string \(w = 0^p 1 0^p 1 \in L_4\) with \(|w| > p\).
  • The pumping lemma guarantees we can decompose \(w\) into three substrings \(w = x y z\) where:
    1. \(x y^i z \in L_4\) for all \(i \geq 0\).
    2. \(|y| > 0\).
    3. \(|x y| \leq p\)
  • The last two conditions imply that \(y = 0^k\) for some \(k \geq 1\).
  • But the pumped string \(x y^i z\) is: \[ x y^i z = 0^{p + (i - 1)k} 1 0^p 1 \notin L_4 \quad \text{when } i \ne 1 \] contradicting the first condition.