Neural Networks, Finite State Machines, and Binary Encoding by Affine Hyperplanes
Ghkkhg
23:17
I'll focus today on the relationship between neural networks and finite state machines (aka finite state automata, or FSAs for short). Why? Interweb searches have informed me that the ability of a neural network to "learn" how to model finite state machines is one reason people first got excited about neural networks. Even better, FSAs are reasonably robust models for computation. If you build one big enough, you have yourself something that looks (to first order) like a sentient being. Not a very smart sentient being, but it'll do as a start...
Before I begin, I want to mention a couple of people with whom I've been chatting semiregularly about related stuff: Jesse Johnson and Tony Licata. They've both got a lot of other interesting things going on in their mathematical lives, but I've really benefitted from their turning their attention in this direction on occasion.
I will also link to a number of sources I found when I googled around about this. I am 100% sure I am missing some of the important references, since I am not an expert in this area. OTOH, if you're looking for other important references, they are very likely to be among the ones cited in the sources below. The point here is that none of these ideas are new to the world. They are just new to me:
 A logical calculus of the ideas immanent in nervous activity (1943, McCullough, Pitts)
 Computation: Finite and Infinite Machines (1967, Minsky)
 Recurrent Neural Networks and Finite Automata (1996, Siegelmann)
 Finite State Automata and Simple Recurrent Networks (1998, Cleeremans, ServanSchreiber, McClelland)
 Fool's Gold: Extracting Finite State Machines From Recurrent Network Dynamics (mid90s, Kolen)
Finite State Automata
OK, what's a finite state automaton? Informally, it's a collection of (finitely many) states and (finitely many) rules for transitioning between them.
More precisely/formally, the core structure of a finite state automaton is a triple
is a set of finitely many states,S={s1,…sN} is a set of finitely many transitions, andT={t1,…,tM} is a function that assigns to a pairμ:S×T→S the result of applying transition(si,tj)tj to statesi .
(The finite state automata one encounters in nature may have a bit more structure, e.g. a distinguished initial state and maybe an output function or something like that. But the core is what's described above.)
One common and natural way to describe a finite state automaton with
N
M
 The vertices of the graph are labeled by the states,
s1 .,…,sN  For each
i∈{1,…,N} there are directed edges with tails on the vertexMsi .  These
edges are labeledMt1 , and the head of edge,…,tMtj with tail onsi is the state .μ(si,tj)
Let's come up with an example, so it's not so boring abstract. I'll describe an FSA modeling the finitely many emotional states of an extremely simple (but...let's face it...sort of interesting) human named Charlotte with extremely simple (but also...let's face it...sort of interesting) reactions to time spent with her two friends Alice and Bob.
The set of Charlotte's emotional states is
Poor Charlotte. She needs some new friends.
OK, we have an example. But what can a finite state automaton do? Well, I'm gonna be honest...nothing too terribly exciting beyond telling you where you'll go next if you are at state
How? Well, given an ordered list of transitions, it can calculate the ordered list of states it passes through via that ordered list of transitions. This provides a means of modeling a decision problem whose inputs are finitelength strings in a finite alphabet. Simply designate some subset of the states as "Accept" and the remainder as "Fail" states. Now output a "1" if the end state is an accept state and a "0" if the end state is a fail state.
Going back to the example of Charlotte's FSA, someone (her therapist?) might choose to designate "Sad" as the only fail state, and the other 3 states as accept states. It's now not so hard to see that given a finite length string describing Charlotte's interactions with Alice and Bob, the FSA will output a
So, e.g., the string
BTW, while we're on the subject: finite state automata appear to play a prominent role in the theoretical study of how humans use and interpret language. And I should also note that this study overlaps with some areas of great interest to theoretical mathematicians. For example, the group theorists among you may notice that Poor Charlotte's FSA is precisely the one we'd use to answer the question of whether a word representing an element of the group
Neural Networks as Functions: Basic Structure
The meaning of the words "neural network" depends very much on who is uttering them, and in what context.
When I say those words here, I will mean a function
F:
Rn
→
Rn′
F=
F
…∘
F
∘
F
F
:
Rnk
→
Rnk+1
I'll unpack that slowly.
Let me start with the kind of drawing of a neural network one often sees in the news, since it will organize us:
We represent a neural network by a bunch of nodes (neurons) and lines (connections). The nodes are arranged in columns called layers. The number of hidden layers (those layers between the input layer on the left and the output layer on the right) tells us how many smoothed binary encoding functions we need to compose to describe the function, and the number of nodes in each layer tells us the dimensions of the vector spaces we encounter along the way.
In the picture I drew above, the neural network is a function
F:
R3
→
R5
F=
F
∘
F
∘
F
,
where
 F:R3→R2
 F:R2→R4
.F:R4→R5
Now that we understand the basic structure of a neural network, let's focus in on these functions
F
Let me start by saying what I mean by a binary encoding function by affine hyperplanes, then explain why this is the right picture to have in mind to start understanding neural networks.
Binary Encoding by Affine Hyperplanes
First, when I say an affine hyperplane in a vector space
∈
Rn
If you've recently taken a course in linear algebra (or if you've read one of my earlier posts on this blog, e.g. this one), you know that an affine hyperplane is always (nonuniquely) describable as the set of solutions of a single generic linear equation. That is, fix a nonzero vector
∈
Rn
∈
Rn
⋅
x⃗
=b
One important point is that any affine hyperplane in
∈
Rn
,b∈R
∈
Rn
⋅
x⃗
=b
H
(
a⃗
,b)
:={
x⃗
∈
Rn

a⃗
⋅
x⃗
<b}
H
(
a⃗
,b)
:={
x⃗
∈
Rn

a⃗
⋅
x⃗
>b}
Here's an example of what I mean, for the hyperplane in
∈
R2
(1,1)⋅
x⃗
=1
Notice that we're repeating (in slightly different language) some things we said in earlier posts. Namely, an affine hyperplane
H
θ:
Rn
∖H→{0,1}
which assigns a
H
H
Now, suppose you have
H
,…,
H
(
a⃗
,
bi
)
Θ:
Rn
∖(
H
∪…∪
H
)→{0,1
}m
from the complement of the union of the hyperplanes to the set of binary
0 ifa⃗ , and⋅x⃗<bi1 ifa⃗ .⋅x⃗>bi
Moreover, the value this function assigns to a vector
∈
Rn
∈
Rn
Here's an example of a binary encoding by 5 affine hyperplanes in
Phew! OK, that took longer than expected to explain. But it's a simple idea, right?
Smoothing a Binary Encoding by Affine Hyperplanes
The maps between layers of a neural network are really smoothed versions of this binary encoding function. Rather than having a step function along the hyperplanes for each component of the function (
you have a sigmoid function
1+
e−x
as one passes from one side of a hyperplane to the other.
Indeed, this is precisely what the function between adjacent layers of a neural network is. Each component of the function
F
:
Rnk
→
Rnk+1
σ(
a⃗
⋅
x⃗
−b)
∈
Rnk
,b∈R
So that's what each layer of a neural network is: a smoothed version of a binary encoding function by affine hyperplanes. This is a (maybe) different way of understanding stuff we discussed before. Woot!
A (Recurrent) Neural Network for Poor Charlotte's FSA
Now suppose we are trying to model a finite state automaton with
N
M
,…,
s
N
,…,
t
M
F:(
Rk
×
Rℓ
)→
Rℓ is a composition of smoothed binary encoding functionsF .F(si,tj)=μ(si,tj)
Note that we're abusing notation in the second line above. The inputs to the function
F
×
Rℓ
Now notice that if we're given an ordered list of transitions, then these can be fed into the neural network above in sequence. Moreover, the output to the neural network at one stage becomes part of the input to the neural network at the next stage. People call this type of gadget a recurrent neural network.
Example! Example!
OK. Now we're ready to describe a recurrent neural network model for Poor Charlotte's FSA. Recall that Poor Charlotte's
Let's encode the states (in the same order) in
Then the binaryencoded version of Poor Charlotte's FSA is the function
F:
R2
⊕
R2
→
R2
Here's a diagram of the encoded FSA, viewed as a directed labeled graph:
Now I claim that we can find
,
a2
,
a3
,
a4
,b∈R
+
a3
+
a4
+
a3
+
a4
+
a2
+
a3
+
a2
+
a4
and
,
c2
,
c3
,
c4
,d∈R
+
c3
+
c4
+
c3
+
c4
+
c2
+
c3
+
c2
+
c4
d
d
d
d
d
d
d
d
This is something you can check by hand in this case by, e.g., trying to plot numbers satisfying the above inequalities on a number line and seeing whether you succeed or fail. Here are two plots that look like they work for these two collections of inequalities:
It's not a general method, true. But it at least gets us started thinking about how this might go in general. Farkas' Lemma (sometimes called Gale's Theorem) seems like the right result to use to understand this linear separability question in general, so maybe I'll say more about this later if I have the inclination. I learned of it (though not quite yet about it) from this awesome set of lecture notes online. And, of course, this tells us nothing about how to nonlinearly separate a binary encoding. This would necessarily require a deeper recurrent neural network, i.e. one with more layers.
A topic for another time.
No comments :
Post a Comment