Title

ECS 120 Theory of Computation
Nonregular Languages
Julian Panetta
University of California, Davis

Recall: Regular Languages

  • We have shown that the following classes of languages are equivalent:

    • DFA-decidable
    • NFA-decidable
    • RG-decidable
    • Regex-decidable
  • We call this class of languages the regular languages.

  • Are all languages regular?

    • To prove a language is regular we can exhibit a DFA, NFA, regex, or RG that decides it.
    • How can we prove a language is not regular?

Why we need rigor

Let \(\#(a, b)\) denote the number of occurrences of \(a \in \Sigma^*\) as a substring in \(b \in \Sigma^*\).

Claim: the following language is not regular: \[ L_2 = \setbuild{w \in \binary^*}{\#(0, w) = \#(1, w)} \] because deciding it requires counting the number of occurrences of symbols \(0\) and \(1\),
needing an unbounded number of states.

  • Correct
  • Incorrect

Why we need rigor

Let \(\#(a, b)\) denote the number of occurrences of \(a \in \Sigma^*\) as a substring in \(b \in \Sigma^*\).

Claim: the following language is not regular: \[ L_3 = \setbuild{w \in \binary^*}{\#(01, w) = \#(10, w)} \] because deciding it requires counting the number of occurrences of substrings \(01\) and \(10\),
needing an unbounded number of states.

  • Correct
  • Incorrect
  • Compares the number of “rising edges” (\(\textcolor{purple}01\)) and “falling edges” (\(\textcolor{orange}10\)).
  • \(000\underbrace{\textcolor{purple}01}_{\textcolor{purple}1}1\underbrace{\textcolor{orange}10}_{\textcolor{orange}1}0\underbrace{\textcolor{purple}01}_{\textcolor{purple}2}111\underbrace{\textcolor{orange}10}_{\textcolor{orange}2}0\underbrace{\textcolor{purple}01}_{\textcolor{purple}3}111\underbrace{\textcolor{orange}10}_{\textcolor{orange}3}00\) vs. \(\underbrace{\textcolor{orange}10}_{\textcolor{orange}1}00\underbrace{\textcolor{purple}01}_{\textcolor{purple}1}1\underbrace{\textcolor{orange}10}_{\textcolor{orange}2}0\underbrace{\textcolor{purple}01}_{\textcolor{purple}2}111\underbrace{\textcolor{orange}10}_{\textcolor{orange}3}0\underbrace{\textcolor{purple}01}_{\textcolor{purple}3}111\underbrace{\textcolor{orange}10}_{\textcolor{purple}4}00\)
  • Equivalent to the regular language \(\setbuild{w \in \binary^*}{w[1] = w[|w|]}\).

Two rigorous but ad-hoc proofs

Let’s show (for real this time) that the following language is not regular (DFA-decidable): \[ L_1 = \setbuild{0^n 1^n}{n \in \mathbb{N}} = \{\emptystring, 01, 0011, 000111, \ldots\} \]

  • Suppose for the sake of contradiction that there is a DFA \(D = (Q, \Sigma, \delta, s, F)\) deciding \(L_1\).
  • Consider a string \(0^n 1^n \in L_1\) with \(n \ge |Q|\).
    • The DFA must accept this string with a computation sequence: \[ r_0 \fragment{\transition{0} r_1} \fragment{\transition{0} \cdots} \fragment{\transition{0} r_{n}} \fragment{\transition{1} r_{n + 1}} \fragment{\transition{1} \cdots \transition{1} r_{2n}} \] where \(r_0 = s\) and \(r_{2n} \in F\).

    • Since \(n + 1 > |Q|\) states appear in \((r_0, r_1, \ldots, r_n)\), at least one must repeat (Pigeonhole Principle): \[ r_0 \cdots \overbrace{\underbrace{r_i \transition{0} \cdots \transition{0} r_j}_{r_i = r_j}}^{\text{substring $y=0^{j-i}$ follows a cycle}} \cdots r_n \transition{1} \cdots \transition{1} r_{2n} \qquad\qquad \]

      data/images/nonregular/pumping_vis.svg
    • Nonempty sequence \(r_i \transition{0} \cdots \transition{0} r_{j}\) can be repeated \(k\) more times (overlapping \(r_i, r_j\)) obtaining a new computation sequence that accepts the string \(x=0^{n + (j - i) k} 1^n \text{ for any } k \in \N^+\).

    • But since \(n+(j-i)k \ne n\), we have \(x \notin L_1\), contradicting \(L(D) = L_1\) since \(D\) accepts \(x\).

This is the idea behind the Pumping Lemma: (Optional sections 7.7-7.9)

Two Rigorous but Ad-hoc Proofs

Let’s show (a different way) that the following language is not regular (DFA-decidable): \[ L_1 = \setbuild{0^n 1^n}{n \in \mathbb{N}} = \{\emptystring, 01, 0011, 000111, \ldots\} \]

  • Suppose again we have a DFA \(D = (Q, \Sigma, \delta, s, F)\) deciding \(L_1\).
  • Consider processing two different strings \(x, y \in \Sigma^*\) on separate copies of \(D\).
  • Now suppose that by feeding the suffix string \(z \in \Sigma^*\) into both copies we find that: \[ xz \in L_1 \quad \text{but} \quad yz \notin L_1. \]
  • Then running \(D\) on \(x\) must have reached a different state than running it on \(y\)!
  • Let’s apply this logic to the strings \(0^i\) and \(0^j\) for any \(i \ne j\)
    • Taking \(z = 1^i\), we find: \[ 0^i 1^i \in L_1 \quad \text{but} \quad 0^j 1^i \notin L_1. \]
    • Therefore \(D\) must be in a different state after processing \(0^i\) than after processing \(0^j\).
    • In other words \(D\), reading \(0^i\) must put \(D\) into a distinct state for each \(i \in \mathbb{N}\).
    • This means \(D\) must have an infinite number of states, which is disallowed for DFAs!

This is the idea behind the Myhill-Nerode Theorem: (Sections 7.2-7.5)

Separating Extensions and \(L\)-equivalence

First, some definitions:

Given a language \(L \in \Sigma^*\),
the strings \(x, y \in \Sigma^*\) are called \(L\)-separable or \(L\)-distinguishable
if there exists a string \(z \in \Sigma^*\) such that:

\[ xz \in L \iff yz \notin L \hspace{10em} \]

Exactly one of \(xz\) and \(yz\) is in \(L\).

This \(z\) is called a separating (or distinguishing) extension for \(x, y\).

If \(x\) and \(y\) are not \(L\)-separable,
then we say they are \(L\)-equivalent, denoted \(x \sim_L y\).

In other words, \(x \sim_L y\) means that for all \(z \in \Sigma^*\): \[ xz \in L \iff yz \in L \]

Note that \(\sim_L\) is an equivalence relation on \(\Sigma^*\), so it partitions \(\Sigma^*\) into equivalence classes of strings that are indistinguishable by \(L\).

Separating Extensions and \(L\)-equivalence: Examples

\[ L_1 = \setbuild{0^n 1^n}{n \in \mathbb{N}} = \{\emptystring, 01, 0011, 000111, \ldots\} \]

  • Do strings \(00\) and \(000\) have a separating extension with respect to \(L_1\)?
    • \(z = 11\) is one: \(\quad\ \qquad 00 11\ \ \in L_1 \ \ \ \ \, \text{but} \quad 00011\ \ \notin L_1\).
    • \(z = 111\) is another: \(\quad 00 111 \notin L_1 \quad \text{but} \quad 000111 \in L_1\).
    • Thus \(00 \not \sim_{L_1} 000\).
  • Do strings \(010\) and \(0010\) have a separating extension with respect to \(L_1\)?
    • No, because no string of the form \(0^+1^+0^+\) is a prefix of any string in \(L_1\), i.e.: \[ (\forall z \in \Sigma^*) \quad 010z \notin L_1 \quad \text{and} \quad 0010z \notin L_1 \]
    • Thus \(010 \sim_{L_1} 0010\).
  • How many equivalence classes are there for \(\sim_{L_1}\)?
    • \(\emptystring, 0, 00, 000, \ldots\) are all mutually \(L_1\)-inequivalent. \[ x=0^i \text{ and } y=0^k \text{ have the separating extension } z = 1^i \text{ for all } 0 \le i < k \]
    • Therefore there are an infinite number of equivalence classes.

Myhill-Nerode Theorem

Theorem (Myhill-Nerode):
A language \(L\) is regular if and only if \(\sim_L\) defines a finite number of equivalence classes.

Furthermore, the number of equivalence classes equals the number of states in the
minimal DFA deciding \(L\).

To prove a language is nonregular we need only one direction of this theorem:

Corollary:
If a language \(L\) has an infinite number of equivalence classes with respect to \(\sim_L\),
then \(L\) is nonregular.

Equivalent corollary:
If for language \(L\) we can construct a set \(S\) of pairwise \(L\)-inequivalent strings with \(|S| = \infty\),
then \(L\) is nonregular.

Myhill-Nerode Theorem: Examples

\[ L_1 = \setbuild{0^n 1^n}{n \in \mathbb{N}} = \{\emptystring, 01, 0011, 000111, \ldots\} \]

Claim: \(L_1\) is nonregular.

Proof:

  • Define \(S = \setbuild{0^n}{n \in \mathbb{N}} = \{\emptystring, 0, 00, 000, \ldots\}\).
  • Each pair of strings \(0^i, 0^k \in S\) with \(i \ne k\) has a separating extension \(1^i\): \[ 0^i 1^i \in L_1 \quad \text{but} \quad 0^k 1^i \notin L_1, \] meaning \(0^i \not \sim_{L_1} 0^k\).
  • Therefore, \(L_1\) is nonregular by the Myhill-Nerode theorem.

Myhill-Nerode Theorem: Examples

\[ L_2 = \setbuild{w \in \binary^*}{\#(0, w) = \#(1, w)} \]

Claim: \(L_2\) is nonregular.

Proof:

  • Define \(S = \setbuild{0^n}{n \in \mathbb{N}} = \{\emptystring, 0, 00, 000, \ldots\}\).
  • Each pair of strings \(0^i, 0^k \in S\) with \(i \ne k\) has a separating extension \(1^i\): \[ 0^i 1^i \in L_2 \quad \text{but} \quad 0^k 1^i \notin L_2, \] meaning \(0^i \not \sim_{L_2} 0^k\).
  • Therefore, \(L_2\) is nonregular by the Myhill-Nerode theorem.

The exact same proof applies for \(L_2\)!

Myhill-Nerode Theorem: Examples

\[ L_2 = \setbuild{w \in \binary^*}{\#(0, w) = \#(1, w)} \]

Claim: \(L_2\) is nonregular.

Alternate proof by closure:

  • Assume for the sake of contradiction that \(L_2\) is regular.
  • Note that \(\{0^* 1^*\}\) is regular,
    and \(L_2 \cap \{0^* 1^*\} = \fragment{\setbuild{0^n 1^n}{n \in \mathbb{N}}} \fragment{= L_1}\)
  • By closure of DFA-decidable languages under intersection,
    \(L_1\) must also be regular.
  • But this contradicts our proof that \(L_1\) is nonregular, so \(L_2\) must actually be nonregular!