CMP -- United Business Media

Intelligent Enterprise

Better Insight for Business Decisions

UBM
Intelligent Enterprise - Better Insight for Business Decisions
Part of the TechWeb Network
Intelligent Enterprise
search Intelligent Enterprise



May 15, 2000, Volume 3 - Number 8

More for Less

New, more complex neural net design yields better prediction results

There stood MIT professor Marvin Minsky, the great-grandfather of artificial intelligence, admitting he had been wrong for 20 years. At a conference in Boston just this fall, he told the amazed attendees that the methods he had been evangelizing all those years may just not be up to the task of capturing human intelligence. This was a momentous event. Advances in the cognitive sciences and AI almost always directly precede advances in business intelligence technology — data mining, query methods, and so on. And here we had one of the central figures in the history of AI pronouncing a paradigm shift.

“I should have been paying attention to Professor Grossberg all along,” he said. Stephen Grossberg had given a lecture on real-time feedback in visual cortex networks, had been teaching at Boston University (on just the other side of the Charles River from Minsky’s MIT) for a quarter of a century, and had been mathematically modeling the human brain in comparative obscurity all that time — but with far more success.

Minsky for years insisted that reasoning must be symbolic, but for him “symbolic” simply meant operations on strings of text. The concept that thinking consists solely of strings of text is terribly restrictive, just as literal string searches of text bases yield frustratingly inadequate results. In my articles “The Solution Engine” and “Artifacts of the Future” (see Resources), I indicated that a lot of meaning must occur outside the surface verbal realm. I would guess that, at minimum, half of all modern linguists are certain a lot more goes on in the mind than playing with strings of text. Language is but the froth on the waves of experience, said Paul Ricoeur.

With all the excitement, I decided to try solving some common business information processing problems with some structures comparable to the smart, layered, and hierarchical architectures that Grossberg indicated the visual cortex uses. Actually, I had been working on related ideas for a long time with a different but related paradigm of organized multiple units in feedback systems with stabilizing learning laws. Like many others, I had seen the voices of the “symbolic” school drowning out a lot of good thinking.

In particular here, I discuss using these methods to derive value from a common but often problematic kind of survey: the kind with many questions but lousy response quality. On many Internet surveys, for example, respondents fill in only a few answers. The methods I outline seem encouraging in the struggle with similar problems I sketched out in previous articles. Those problems boil down to analyzing limited information in a highly unstructured and perhaps ambiguous environment.

For this survey data problem, I emulated the biological solution to disambiguating multiple meanings: using multiple recognizers operating in concert. Recognizing such issues as the context and personality of the source helps people decide which of a word’s many possible meanings is most likely the one intended, for example.

I experimented in the DSS Lab with using a neural network of neural networks, a hierarchy of learners taught to do specialized tasks. Think of it as a group of students learning Marketing 101 before 606, who then get together to solve some very interesting and useful problems and then feed back their mature knowledge to a new generation of students. The resultant whole gets increasingly smarter. Being able to figure out more from less information is one measure of intelligence.

The marketing data with which I experimented was from a consumer survey that asked many questions. But several questions often went unanswered, and the data was full of errors. The goal was to come up with the best possible forecast of the respondents’ lifestyle characteristics, given this problematic data.

In pursuit of this goal, I created a model that somewhat mimicked the architecture of human cerebral cortex neural anatomy. People in organizations that use neural nets get excited about their very simple three- or four-layer neural nets. But scrutiny reveals that their constructions can disambiguate only one or, at most, a few things well. (And they are far simpler than even a single biological neuron, with its millions of synapses, each with potentially varying values.)

Each of these commercial, three-layer, silicon neural nets can be trained to act as little more than a single keyhole diagram of the type I mentioned in “Learning in Time” (see Resources). Also, they can be hard to program because no deterministic way exists to figure out the proper connection weights. The number of options quickly outstrips what even today’s computers can try. Even with solution methods using genetic algorithms, you are never certain to converge on a good, let alone optimal, solution.

So here, once again, I introduce the idea of my zero algorithm, which offers a deterministic way of finding optimal neural-net solutions and even offers a way to change its knowledge systematically to be in sync with a changing world.

With these layered arrays of neural nets, you can see a lot of the mysteries hidden in the data. The process they use is analogous to the way people can extrapolate from a limited amount of information — say, for example, unpacking information stripped to fit in the limited linear channel of speech. (See “Artifacts of the Future” for examples.)

Figure 1 shows the basic architecture I used, where the nodes can be any of a number of keyhole-producing units, decision trees, neural nets, optimizers — or a zero-algorithm neural net, which is a hybrid of all the former.



FIGURE 1 Overall architecture of the net of neural nets.


The bottom represents the basic input data set; each link represents one column of data. The first-row units are comparable to the dot recognizers in human vision. (See “Multiplicity of Mind,” Resources.) However, here they were not hard-wired, and each was taught to forecast one of the given variables based on all the others. This is how the system starts to fill in the missing information.

The first few layers of the human vision system, which similarly have first access to real-world data, were trained over the course of millions of years of evolution and are now the hard-wired first step of vision processing. Obviously, the more advanced cognitive skills people have to learn during the course of their lives are not hard wired.

What is interesting about the neural net of nets is that one part of one layer gets programmed at a time. This progressive process is like the students I mentioned before who master the 101 subject before moving on to second-level classes, or similar to the brain adding layer after layer over the course of evolution. Global solutions in many circumstances are, for all practical purposes, impossible for the neural net to compute. After all, even people have to learn first things first. Although many forms of neural nets on chips already exist (see Resources) that would have allowed me to do each layer’s computations rapidly in parallel, for this study I used a machine with four standard Intel Xeon processors with software simulating the parallel neural nets.

After the first layer calculations completed, I moved on to layer two, or “Marketing 202” and related classes. In this neural net architecture, the second layer integrated both the raw data and the forecast values from the first layer.

What I find conceptually interesting is that the second layer ended up using forecast variables far more often than the raw data. The forecast versions of variables, created using the context information of all the other variables, seemed to hit the mark about the survey subject more often than the values the individuals actually typed in. I infer that, on at least this data set, the first layer found with its first lessons some categories and ways of conceptualizing its world that very functionally disambiguated the confusion in the raw data. Its projections were far more stable and reliable than the limited initial information was. The neural net seemed to find essential concepts for understanding the customers even when it was forecasting components different from ones directly related to the final goal.

Thus, with its Marketing 101 lessons, it learned the basic concepts. By combining the hints contained in the whole data set, it “connected the dots” — filled in a lot of missing information, just as the human vision system must and as people do on many levels while thinking.

Then it moved on to Marketing 202. The second layer used the first-layer categories to again refine expected values for each of the variables.

The third and final layer allowed all the distinctions obtained so far, as well as the raw data, as potential input. It used this input to forecast the lifestyle categories.

Because I had access to outside commercial sources that had already categorized these customers based on a far more extensive base of information, I had a way to test the results. Assuming that these bases had more or less correctly characterized these individuals, I could assess the accuracy of different forecast approaches by comparing them to the bases’ numbers.

A lone neural net unit got only 19 percent right. The three-layer system using commercially available neural units got 38 percent right. When I added feedback from added units on layer three back to layer one, accuracy rose to 46 percent.

When using zero-algorithm neural nets, the success rate hit 66 percent, which is amazing, considering that nearly 89 percent of the lifestyle data was missing in both the training and testing data sets — offering a comparatively small sample to train on. The rest of the data was estimated to have a 78 percent rate of error or missing values. The experiment certainly exemplifies getting more from less.

But there is a practical limitation to consider. The number of links (or columns of data) you use can get very large. Again, I tried something inspired by the architecture of the human cerebral cortex — which does not link everything to everything. I performed some delicate brain surgery, hacking away at the links, first cutting those that I measured to have the least information content.

I found that the systems stayed reasonably smart for quite a while. This kind of distributed architecture seemed more stable under the knife than you might expect. Figure 2 shows the accuracy rate in relation to the percent of links cut. As you might expect, accuracy decayed from 66 percent back down to zero, but the decay was surprisingly slow. This system organizes the given information so well, it seems, that even a substantially smaller subset of variables can do a respectable job of prediction.



FIGURE 2 This system was so smart that even under brain surgery it still could determine who was whom.


The strength of this process comes from two main components: a distributed hierarchical architecture and the zero algorithm training method. I will have to leave to a future discussion what makes the zero algorithm tick, but for now I would like to mention some of its valuable properties:

•It offers a deterministic and finite solution method

•It can find optimal solutions, not just adequate ones

•It can be set to limit the chance of finding noise rather than information

•It can automatically remove low-information variables

•It is stable and can be set to revise its conclusions appropriately as new information arises

•It can be set to give different credibility weights to different inputs

•It can be set to generalize from a very limited number of examples. (See “Learning in Time”)

•It does not require perfect memory or data, thus facilitating the use of very small yet potentially unstable memory chips (such as the next-generation quantum chip might be).

The zero algorithm also often finds the exact solution to the problem of combining training rules or indicators, a problem I set up in Figure 2 of “Learning in Time.”

I am beginning to see this kind of method being successfully applied in numerous areas, including forecasts of human behavior, and I expect the range of applications to rapidly grow in the very short term. New application areas could easily include neurological, genetic, and biotech research, intelligent systems, cipher decoding, missile defense systems, and numerous business applications — to name a few.

I also think we will need recognizers to be organized in relational arrays (like the ones I described) and as used in all the sensory modalities in order to fully unpack the complex topological relations inherent in language, text bases, the markets, and more.

Perhaps in my lifetime we may even see one of these systems truly understand an English sentence, something Minsky thought would be a cinch some 20 years ago.

Barry Grushkin (bgrushkin@dsslab.com) is the senior lab researcher at the DSS Lab (www.dsslab.com) in Cambridge, Mass.

The analysis described in this article was performed on a terabyte EMC Corp. Clariion Fibre Channel Disk Array and three 450MHz Xeon Dell/Intel four-way servers.

RESOURCES

BioComp Systems old-fashioned silicon neural nets: www.biocompsystems.com

Commercial neural nets on chips: neuralnets.web.cern.ch/NeuralNets/nnwInHepHard.html

Grossberg, Stephen, How Does The Cerebral Cortex Work?:neuralnets.web.cern.ch/NeuralNets/nnwInHepHard.html

Minsky, Marvin: www.ai.mit.edu/people/minsky/minsky.html

Siemens old-fashioned three-layer silicon neural nets: www.senn.sni.de/senn/neunet/neunet_e.htm

Previously published Decision Support articles referenced here can be found at intelligententerprise.com/ports/search_decision.jhtml:

“The Solution Engine” Dec. 21, 1999
“Artifacts of the Future” Feb. 9, 2000
“Learning in Time” March 20, 2000
“Multiplicity of Mind” April 28, 2000







IE Weekly Newsletter
Subscribe to the newsletter
    Email Address