Peter M. Asaro
5-10-98
CS346, Pattern Recognition
& Machine Learning
Prof. Roth
 

Temporal Pattern Recognition with Neural Networks

 

Neural networks are well known for their abilities to recognize patterns. That is, they have been shown to be quite successful in correctly classifying complex patterns. This success depends in large part upon on carefully establishing the classification problem to be solved by the neural network and the appropriate use of data in training it. The aim of this project was to sketch out the problem of classifying temporal patterns, and to compare several neural network approaches to this problem. The general problem of temporal pattern recognition is central to much of economic forecasting, systems modeling, or any domain in which time-series data is collected and analyzed with the purpose of making enhanced predictions about the likely outcomes of observed trends or identifying recurrent patterns. In many such applied domains, the problem can often be constrained to a much more specific one for which mathematical tools already exist to address it, such as linear regression methods for time-series analysis in economics. This project starts with time-series data from a domain for which such tools and methods have not yet been fully developed-human gesture recognition. It is hoped that by studying these data sets, it would shed light on the computational basis of recognizing patterns in time and what applications neural networks might offer to this problem.

The paper consists of three main sections. The first section reviews some of the recent literature on using neural networks for recognizing temporal patterns. Several different architectures are suggested in the literature, and these were implemented and trained on the gesture data. The second section formulates the problem and explains the process by which the neural networks were designed to address it. The final section presents the results and an analysis of the different architectures. It concludes with a discussion of some of the possible extensions of this work.
 
 

I. Unfolding Spatial Recognition in Time

 

Much of the success of neural networks-specifically, supervised backpropagation neural networks-has been built on spatial recognition tasks. In a spatial recognition task, the input layer of the neural network is fed a spatially organized activation pattern, represented as a numerically-valued array, and the network is trained to match its output to a desired spatially organized array. Some of the spatial recognition tasks which neural networks have done particularly well on are hand-written character recognition tasks, facial recognition tasks, and even following a road while driving a car (Mitchell 1997, Hinton 1987). In each of these tasks, the input is conceived of as a two dimensional array which represents a visual field, but is implemented as a one dimensional array. Adding color and luminosity information further adds to the dimensionality in a sense, but even 32-bit pixels can ultimately be represented as one dimensional real-valued or binary strings. What essentially distinguishes the spatial recognition task is that, regardless of how the inputs are internally structured, there is a single static (during propagation) array which represents the entire input to the network. Temporal organization differs from spatial organization in many ways, yet the basic approaches to temporal pattern recognition mostly rely on translating temporal relationships into spatial relationships-folding temporal events into spatial features.

The application of neural networks to the problem of recognizing temporal patterns is not new. The main body of recent research on the matter comes from the areas of dynamic signal processing and speech recognition as well as economics. The field has been centered around various forms of hidden Markov Model (HMM) representations of speech recognition (Lang et al. 1990), stochastic process models for time-series analysis (Drossu 1996, Eisenstein et al. 1995), and Finite State Automata (FSA) simulation (Cleeremans et al. 1995). The first two approaches basically model temporal organization as a probabilistic network of different events connected in such a fashion that certain events render other events more or less likely, as represented by changes in the activation state of the network. Cleeremans et al.'s work on FSA's resulted in a model they call "graded state automata" which they show to be a very close probabilistic approximation to an FSA. All of these representational approaches are stochastic state models in that there is always a determinable, probabilistic activation state of the model based on the inputs to the system. The differences between them are certainly significant, but the similarities are perhaps more interesting.

There are three basic types of temporal patterns which these approaches attempt to capture: Delayed Features, Sequences, and Predictions. Delayed feature recognition is a simple variation on spatial pattern recognition with the difference being that the features of the pattern must be recognized within a time stream. E.g., recognizing when the visual input counts as an approaching bend in the road is the recognition of a static and temporally isolate "bend" pattern but one which must be recognized from a real-time stream of input patterns. Delayed features are a kind of pattern in time rather than a genuine temporal pattern. To observe the general pattern of something like "direction of motion" it is necessary to observe an object at more than one time and encode both its position at each time and the relative times or order in which each observation occurred. Similarly, in order to recognize the intended meaning of the word "ball" in the sentences: "The Countess threw the ball." and "The pitcher threw the ball." (from Cleeremans et al. 1995), it is necessary to understand the context in which the word appears. This is a clear demonstration of the need to determine distant and complex temporal relations in natural language processing. Contextual interpretation thus requires more than proximate temporal relations, it also requires the recognition of specific sequences of events in time. Sequences are the main focus of research in the natural language approaches to temporal pattern recognition. Whether the task is to recognize a phoneme (Lang et al. 1990) or a grammar (Cleeremans et al. 1995), linguistic and speech recognition is largely construed as a problem of recognizing temporally embedded patterns of words or utterances.

The classification of both delayed features and sequences involves the recognition of a target pattern within a set of observations. Prediction differs from these tasks in that it is necessary to predict the value of a set of features rather than a classification. Prediction can sometimes be reduced to the other tasks. Predicting from a series of weather conditions that tomorrow's weather will be "partly cloudy" involves matching a temporal pattern to a classification, whereas predicting the temperature, humidity and barometric pressure requires a more complex function, or at least a function with a greater range of outputs. In general, prediction involves projecting the future states of multiple variables based on the historical states of those variables. This problem has been widely studied by economists, social and natural scientists attempting to build models which will provide useful indications of the complex behaviors of systems like markets, populations, atmospheres and ecologies. Neural networks have been used to simulate various traditional approaches to time-series prediction, such as curve-fitting, linear regression, and even autoregressive moving average (ARMA) stationary stochastic models (Drossu 1996). The traditional manual application of these techniques, the building of time-series functions for specific phenomena, is difficult and time-consuming. Neural networks have shown reasonable success in approximating these functions and their parameters, and while they themselves can require the tuning of many parameters, they promise to greatly enhance the speed with which such analysis may be conducted.

There have been three basic architectural design approaches for temporal pattern recognition in neural networks: Streaming inputs, Buffered series, and Recurrent. Each of these acknowledge that the fundamental difficulty lies in the fact that neural networks have a static input layer. Since they process very quickly once trained, however, they can be fed a streaming series of inputs in real-time, as in the car-driving ALVINN neural network (Pomerleau 1993), provided that they are only expected to recognize patterns that are bounded within a single input vector. Streaming input architectures rely on the speed of spatial recognition in neural nets to pick out a static pattern from a stream of unrelated inputs. For temporal patterns that can be bounded in a finite length of time, it is possible to expand the number of inputs in the network to accommodate the longest possible series as a single input to the network-to create a buffer which holds multiple inputs. By buffering inputs, a neural network can be trained to recognize relationships between inputs occurring at different times. This technique has been used to recognize bounded-length spectrograms of speech elements on the order of milliseconds, but does not scale well to lengthier series as the number of connections, and therefore training time, grows exponentially with the size of the buffer. A combination of these two approaches is the Sliding Window architecture which feeds a segment of the time-series, in order, into the input layer at time1, then shifts the segment, removing the oldest time unit from one end and adding the new unit for time2 to the other end. The result is that the neural network is trained over the ordered series by seeing its progression through a bounded window, which allows the recognition of temporal relationships without requiring prohibitive numbers of inputs. One drawback to this technique is that it only offers a limited scope for recognizing temporal patterns and ignores patterns which extend beyond the length of the observation window. Thus the window must be at least as long as the longest recognizable pattern.

The most sophisticated methods for temporal pattern recognition employ various forms of Recurrent architectures. One such method implements a large set of internal nodes which act as a kind of stack that stores the input history and adds this to newly incoming inputs at the hidden layer of the network. Like the buffered series technique, this results in very large networks and exponential training times as the size of the internal stack is increased to accommodate longer histories. Another approach is to add a fully connected recurrent layer whose state represents the activation states of the network at the previous time step. Backpropagation training in such a network can be very inefficient, however, and it is not always clear how to propagate errors back through time. A more elegant solution is to insert a singly connected recurrent layer which "remembers" a partially processed input history. Mozer (1995) calls this a Context layer and derives a modified update rule for backpropagation with the context layer. He argues that while other recurrent methods simply store a single previous state, a context layer stores a robust dynamic representation of the past by efficiently focusing error propagation to the past states encoded in that layer.

There are, of course, many modifications and variations of these basic approaches discussed in the literature. There are also many temporal relationships not addressed by any of these models, including tempo or rate recognition, rate-independent pattern recognition, and change-of-rate recognition. For this project, just the three basic architectures were selected and applied to the problem of gesture recognition. We now turn to a discussion of the project.
 
 
 
 

II. Tracking Points Moving Through Space

 

Before discussing the implementation of the neural networks themselves, it is necessary to consider the nature of the data itself and the framing of the problem. The data consisted of three time-series which represented three separate instances of a human performing an identical sequence of bodily gestures or set of motions. That is, the human attempted to perform the same set of motions in each of the three sequences. The data itself was recorded from two electro-magnetic tracking devices attached to the performer's head and wrist. The devices are typically used to track the head and wand/mouse positions in the CAVE immersive virtual environment (The CAVE is a 10x10x10 room in which users stand, with four projection surfaces-right, left, front, floor-that are rapidly updated in response to changes in the user's position, head orientation, and other controls. The tracking device is the Ascension Flock of Birds. See http://www.evl.uic.edu/EVL/VR/systems.html for a full description.). The tracking devices supply the 3-dimensional Euclidean coordinate position as well as azimuth, pitch, and roll information in the form of a set of four 3-dimensional vectors-one for the CAVE-relative XYZ coordinates and three ortho-normal vectors for orientation. Thus, there are 24 real-valued parameters describing the state of each of the two devices at each time step. The sequences themselves were each approximately 15 seconds in length, and positions were recorded at slightly longer than 0.01 second intervals resulting in 1250-1300 time-steps in each sequence. There was some variation in the time intervals both across and within the sequences, most likely due to fluctuations in the processing load on the machines used for recording. Because of this, I decided to include the time stamp itself as the 25th feature of the input vector. An additional consideration was that each sequence did not begin at exactly the same absolute position, though the tracking device coordinates were "zeroed out" to relative positions just before recording began.

Given the constraints of the data and the intentions of this project, several simplifying assumptions had to be made in order to interpret the data in terms of a temporal pattern recognition problem. The principle difficulty lay in identifying the classifications or target instances that a neural network ought to try to match. Unlike the natural language tasks of other temporal pattern recognition networks, there does not exist a grammar or catalog of elements to attempt to recognize. Rather than attempt to devise such a grammar, I elected to view the gesture time-series data as a forecasting problem instead of as a semantic classification problem. As a forecasting problem, the neural network would be expected to output the next (immediate future) state description vector in response to the current state-description, or a series of state descriptions. As in economic or weather forecasting, the state variables change continuously and thus guessing at or near the previous values or based on short extrapolations should produce reasonable approximations to the next state. Thus a simple neural net should be able to further limit its extrapolation to the range observed in training data. The availability of past states should further indicate relevant trends and trajectories, and so allowing access to historical inputs should increase performance. With this framing of the problem, I designed three neural network architectures to compare their performance on this task.

The three architectures implemented correspond to the three discussed in the previous section. The first is a static single-input which receives a single time-step as input as is trained to output the succeeding time-step values. Its only available information regarding time is the time stamp node of the input vector. It was thus configured as a fully connected 25x10x24 network (input X hidden X output, the time stamp was not included in the target vector) initialized to random weights and trained using a standard backpropagation algorithm. The second architecture was a buffered window network which took ten consecutive inputs and was trained to output the vector of the eleventh member of the sequence. It was implemented as an otherwise standard backpropagation network of 250x10x24 nodes. The final architecture implemented was a recurrent architecture in which a context layer was added between the input and hidden layers. The context layer added its own current activation value to the summation of weighted input values during feed-forward operation. The weights between the input and context layer, as well as the weights between the context layer and the hidden layer, were subject to the normal backpropagation rule, but the context layer's recurrent input was neither weighted nor updated with respect to errors-errors were propagated past the context layer to the input layer. The recurrent network was thus a 25x25x10x24, fully connected network with the second layer feeding back on itself (i.e., singly connected to itself).

The neural networks themselves were written in C++. The basic backpropagation algorithm code was taken from Mitchell's example for a face recognition program ( http://www.cs.cmu.edu/~tom/book.html ). The code had to be converted from Unix C to PC C++ and was heavily modified to handle the recurrent layers and networks. The data was supplied by the NCSA's Audio Development Group. It was collected using SGI machines and had to be converted to adjust for differing byte-code lengths. The networks described above and analyzed in the next section were trained and run on Pentium machines and required several hours to process all of the cases.
 
 
 
 
 

III. Testing and Analysis

 

My expectations of the networks performance was that the networks which encoded more temporal information would do better. Thus, that the buffered window and recurrent networks would outperform the single-input network. Training the single-input network was considerably faster, however. For all three networks, training consisted of presenting an entire sequence in order and propagating the error through the network after each input in the sequence. In order to estimate the effects due to unique variations in each of the three sequences, each of the three architectures was tested on different combinations of the three sequences: 1 alone, 1 and 2, 1 and 3, and all three together to judge the optimal potential across the sequences. Training was done for 2,000 and 5,000 epochs for each configuration-thus for three sequences and 5,000 epochs the network was actually trained for 15,000 iterations. For all three networks, the learning rate was set to .05, and the momentum was also set to .05. For short test runs (less than 100 epochs), a learning rate of .2 and momentum of .15 appeared optimal, but I felt these would be too steep for training over thousands of epochs and used the smaller values.

The performance tests consisted of running the network feed-forward over all three of the sequences and recording the absolute value of the difference between the predicted and target values for each of the 24 parameters. As an estimate of overall accuracy of a network over a sequence, these differences were summed over all the features for the entire sequence and divided by the number of time-steps in the sequence to give a mean disparity measure between predicted and actual values. This rather course metric was a useful comparative measure as each network achieved output error rates of less than 3% and many of them 1% or less.

The following charts indicate the error measures for test sequences 1, 2 and 3. Each chart represents six networks trained on the same series, three for 2000 epochs and three for 5000 epochs.
 
 

 
 
 

As can be seen above, the third sequence can be predicted much more readily than the other two regardless of the training set. This may have to do with recording procedures, but it should be investigated whether there were any other difference in the performance procedure. Another general trend to be observed is that all the networks did better at 2000 epochs than at 5000 epochs. This was no doubt the result of over training, and this is verified by the fact that there was improved performance when testing on a single sequence which was used for training at 5000 epochs.

The comparisons between the architectures is not surprising, though I had thought that the recurrent architecture would do better than it did. Overall, the recurrent network did not even do as well as the single-input non-recurrent network. Thus it seems that the context layer introduced noise which interfered with training. The single-input network did quite well and was generally less susceptible to overtraining than the other two. By a narrow margin, the best performer was the 10-input buffered window network.

The actual output error rate for the non-recurrent networks was consistently below 2% in the final stages of training, while the recurrent network hovered from 15-20% error. The poor showing of the recurrent network was probably due to the fact that the recurrent input layer connections were neither weighted nor adjusted during training. It would be good to implement Mozer's focused backpropagation algorithm and compare its results. I'm sure it would resolve this problem.

As for the differences between the single-input and buffered-input architectures, performance was close considering the severe differences in training time for each. An interesting variation to explore for the buffered-input architecture would be to sample historical points at different intervals, rather than just the ten immediately preceding points. This would indicate whether longer trends are more relevant to prediction than short-term trends.
 
 
 

IV. Further Work

 

This work provides some helpful insights into the nature of the problem of temporal gesture recognition based on data from two tracking devices. Much work still needs to be done before a "gesture" can be distinguished and recognized. The network architectures built in this project could be of use in a hybrid system seeking to achieve robust recognition of specific gestures. In such a system, different networks could be trained on different gestures, allowed to make predictions in parallel, and considered as experts by some higher level choice function which would take the best predictor as indicating the correct classification. In order to do this, however, more domain work must be done to further specify the elements and grammar of gesture. Such work is necessary in order to select the appropriate targets for training, but also to identify how the classifications ought to be structured.

The problem of temporal pattern recognition is a very complex one. Certainly better general frameworks could be developed which would allow more robust representations of temporal relationships. Meanwhile, further applied work must proceed by identifying the specific constraints offered by a domain problem and taking advantage of these in system design.
 
 
 

V. References and Code

 

Cleeremans, A., D. Servan-Schreiber & J. L. McClelland (1995). Graded State Machines: The Representation of Temporal Contingencies in Feedback Networks. In Backpropagation: Theory, Architecture, and Applications, Y. Chauvin & D. Rumelhart (Eds.). Lawrence Erlbaum Associates: Hillside, NJ. pp. 273-312.
 

Drossu, R. & Z. Obradovic (1996). Rapid Design of Neural Networks for Time Series Prediction, IEEE Computational Science & Engineering, 3,2: 78-89.
 

Eisenstein, E., I. Kanter, D. A. Kessler, & W. Kinzel (1995). Generation and Prediction of Time Series by a Neural Network, Physical Review Letters, 74,1: 6-9
 

Hinton, G. E. (1989). Connectionist Learning Procedures, Artificial Intelligence, 40: 185-234.
 

Lang, K. J., A. H. Waibel, & G. E. Hinton (1990). A Time-Delay Neural Network Architecture for Isolated Word Recognition. Neural Networks, 3: 33-43.
 

Mitchell, T. M. (1997). Machine Learning. McGraw-Hill: Boston, MA.
 

Mozer, M. (1995). A Focused Backpropagation Algorithm for Temporal Pattern Recognition. In Backpropagation: Theory, Architecture, and Applications, Y. Chauvin & D. Rumelhart (Eds.). Lawrence Erlbaum Associates: Hillside, NJ. pp. 137-169.
 

Pomerleau, D. A. (1993). Knowledge-based Training of Artificial Neural Networks for Autonomous Robot Driving. In Robot Learning, J. Connell & S. Mahavan (Eds.), Kluwer Academic Publishers: Boston, MA, pp. 19-43.
 
 

The code can be found at:

glearn.cpp
backprop.cpp
backprop.h
load.cpp
load.h
sensordata.h