Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Advanced Search
RSS
Webcasts
Digital Library
Subscribe
Home


December 21, 1999, Volume 2 - Number 18


The Solution Engine


“Search engines” keep us searching, but we can do better with technology we have now

What do we really want: search engines… or solution engines? People go to the Net to solve problems, not to read long lists. How can we pull structured answers from unstructured sources and put two and two together? Those sources include the Internet, intranets, text databases, and email — not to mention the massive and growing amounts of other media now available with the advent of audio mining. From our first utterances, we use language to get our needs met. We should not confuse the methods or machinery of communication with its purpose. A study of children in 30 different countries showed that no matter how different or complex the grammatical machinery of each native language, the children first used whatever structures were required and available to get the basics done: “I am hungry.” “I need a hug.” The databases we use for communication are systems we consciously design; we should therefore judge our design of a system by its ability to facilitate getting work done.

Human recall is a good model for what we seek. We are quite good at pulling needed facts from the seemingly unstructured mass of memories in our heads. If you only knew the mess of memories I am drawing upon to write this article!

A very common type of problem people have is the kind for which they at first don’t even know the right questions to ask. Technology has of yet not addressed this large class. In the process of addressing general solution methods for some of these kinds of problems, I will review existing technology, discuss some of the difficulties and complexities of dealing with language and text, and conclude by suggesting how current technology could, in a step or two, deal with some of the more complex issues the initial problem brings up.

Knowledge is Empowerment

This is a true story. A year ago my friend Bev adopted a beautiful baby. After six months the little girl started having epileptic seizures. The experts at the major university medical center were stumped. Bev, an unusually aggressive Web researcher, took matters into her own hands. After many hours of investigation she found that the child’s prophylactic drug for tuberculosis could produce a vitamin B6 deficiency. With more work, she found that a B6 deficiency can cause epilepsy. With a new treatment, the child has been healthy ever since!

Bev had only partial information, symptoms, and personal histories to go on — not enough to know the exact key words she should use to pull to the top of a ranked list the articles central to solving her problem. Often we have some facts but do not know how they fit into a larger schema of knowledge — one reason why live customer help is so much more useful then menu-driven methods.

To solve problems like these in an automated fashion requires some connection making — a chain of connected articles (a path through article space, like a path through Internet nodes to find a target email site), with some of the component articles not even necessarily containing any of the search words. In Bev’s case, an article with technical names for the TB drug and B6 turned out to be one of the key links, with no mention of TB or epilepsy.

Key-Word Search Development

We are all familiar with standard key-word Boolean searches, such as “Find me texts with both the words ‘business’ AND ‘intelligence.’” Net search engine providers keep sites or article key words in in-house databases for quick indexed targeting searches. But this method requires exact matching and hits unless our indexers have added a set of adequate and relevant word variations or descriptors paired with sites.

Straight word search benefits highly when teams of researchers can add associated terms, cull out spam and errors, and add topical categorizations. Human input greatly improves roughshod automated methods — a motif you will encounter again a little later.

The next level of technology takes advantage of a separate database of clusters or related word groups to expand the search. Then the search engine can recognize “OLAP” and “online analytical processing” as the same thing. The nature and use of these word clusters leads to a lot of very complex and sophisticated questions. Understanding these issues, I expect, will be essential in producing the next breakthrough technologies.

The earliest grouping methods used simple thesaurus lists for their comparative word sets. The next approaches added thesaurus-based measures of word closeness. In this way, the closer a found word was to a sought word, the higher the article containing it ranked.

Next, statistical methods of word association were used. Words can be considered close if they tend to appear in the same articles, or even better, in groups of articles constructed by researchers or librarians, or by the statistical methods I discuss later. This approach facilitates key-word matching with even foreign-language variants.

The Concept of Concept

Furthermore, if measures of word importance are added (which generally means rarity), the closeness and importance of related words along with their density and prominence of placement in articles gives the system ways to rank hits. Ranking hits is critical when you start getting millions of them.

When constructed word clusters started to look meaningful, some developers started to call them “concepts” — useful, but this can conceal more then it reveals. That the term “concept” is also used by some software makers to refer to groupings of documents or summary elements within texts evinces how mercurial the concept of “concept” is. Perhaps we should heed Heraclitus on the difficulty of understanding: “The lord whose oracle is at Delphi neither speaks nor conceals, but gives signs.” It’s time for a disclaimer.

We must remember that statistics on word distribution can at best be a very crude proxy for content or meaning. Words, articles, or phrases in sets, fuzzy sets, lists, clusters, or even words in sentences are not concepts. Only people have concepts. Concepts are part of an active, contextually interacting, and evolving human social domain. It is too easy to anthropomorphize and thus project the complexity of our thinking into far simpler entities. Meaning in text arises in use, with a wide range of nonlinguistic mental models and modules and assumptions in play. I overheard this conversation in a bar: “You know that guy. Oh him. I hate when that happens. Ouch. You really had to be there.” Text is often only a mnemonic for events and often context, exterior to the text, fills a great deal.

After hundreds of millions of years, animals evolved with enough precise spatial and motor skills to jump freely in the trees. Language, an evolutionary late comer, depends on many of these kinds of nonverbal mental skills. When we teach how to swing a bat, how we explain the process depends on how we conceptualize experiences of our body in action.

Why do grammar checkers and text summarizers often seem inadequate? Key meaning components can be exterior to the text. Consider: “She placed the secret locket {among or between?} their fingers.” Which word is correct depends on what is in our heads. It depends on whether we are imagining the fingers in groups or pairs. Some recent work is moving in the needed direction. HNC Software Inc.’s work on Cortronic Neural Networks attempts to include some contextual information in neural nets by developing a special language for coding spatial and auditory relations.

Even intratextually, word meaning varies by context: “California” as an adjective can have quite differing meanings when placed before either “maki,” “bagels,” or “girls” — avocado and crab, nuts and raisins, or blond and tan, respectively. To the extent we can make the situational context part of our organizing or searching process, we can “advantage” the chance for useful and meaningful results. For example, our word grouping should have “AMA” equivalent to “American Medical Association” in a medical context but “American Management Association” in a business context.

In his book Women, Fire, and Dangerous Things (University of Chicago Press, 1990), linguist George Lakoff maps out some of the widely varying yet partially overlapping uses of such simple terms as “there” and demonstrates that words have very complex, interacting, culturally dependent topologies. That the geometry of word meaning is very complex can also be seen in Ludwig Wittgenstein’s theory of family resemblance. He shows, for example, that the word “chair” can denote objects with complexly variable elements (dimensions) of similarity and difference. This view is quite different from the hub-and-spoke radial image that might be invoked to compare real things with the pure ideal types of which Plato spoke. Neural nets can deal with complex topologies, but the right form and the right way of programming them has remained illusive. The Haiku system now in development uses genetic algorithms to let you view some of these kinds of high-dimensional, complex, geometric relations. (See Figure 1,)



FIGURE 1 Genetic algorithms allow capturing complex high dimensional inter-relations in the Haiku system from The University of Birmingham.


The Technology at Hand

Back to the technology review of current text technologies. Another idea that has been used is to find distances between articles based on the overlap of important words or related words. By clustering on them, you can sometimes get useful groupings of texts.

In the simplest models, each document can be seen as a vector in word space, the vector components being word counts in the document. In turn, each word can be seen as a vector in document space, the vector components being the number of times that word appears in each of the basis vector documents. Document or word clusters or groupings can be seen as regions in these spaces.



FIGURE 2 Topical document map from Conceptual Dimensions.


Once you have ways of determining text distances, as these vector spaces offer, you can start making article relationship maps, as in Figure 2, or topographic maps of article cluster densities, as in Figure 3 or many other wonderful images. Check out the Spire system for stunning examples.



FIGURE 3 Topographic article density map from the Spire system of Pacific Northwest National Labratory.


Often with some hands-on work, including giving titles and rearranging some assignments, these clusters can make very useful categories. There are many applications: folders that divide search lists into useful subgroupings (check out NorthernLight’s search engine), email questions directed to the right specialist, news stories hot off the wire selected for your interests, people put in contact with others of similar interest, fraud detection, and making massive amounts of corporate information more organized and available.

RESOURCES

Audio Mining, Dragon Systems:

www.dragonsystems.com

Concept Agents, Autonomy:

www.autonomy.com

Concept Maps, Conceptual Dimensions:

www.cdimensions.com

Cortronic Neural Networks, HNC:

www.hnc.com/businessunits/at

Haiku, The University of Birmingham:

www.cs.bham.ac.uk/~anp/haiku

Intelligent Miner for Text, IBM:

www-4.ibm.com/software/data/iminer/fortext

Knowledge Organizer, Verity:

www.verity.com

Search with Folders, NorthernLight:

www.northernlight.com

Spire, Pacific Northwest National

Laboratory:

www.multimedia.pnl.gov:2080/infoviz

Ultraseek Server, InfoSeek:

www.software.infoseek.com

Most text-organizing software pre-arranges texts in hierarchies based either on pure statistical relations, statistical comparisons with user-supplied examples or predesigned taxonomies. Afterward, the user can make modifications. As soon as the organization is reasonably well determined, a lot of new texts find their rightful places. The technology, however, should leave open the possibility of many rightful places.

My work on categorizing companies by clustering on similarity of stock movement showed that even slight changes in samples can seed differing yet interpretable, repeatable, and stable groupings. IBM goes with either export, big, high-tech, computer, consulting, or research companies, and so forth depending on what part of the data relations end up dominating one particular clustering take. This kind of multiplicity is nearly always in the nature of any attempt to summarize complex relationships. I ended up working with fuzzy relations of clustered set variations — useful, but far from capturing the fullness of complex relations.

To relate all this back to Bev’s problem, a solution engine could offer us chains of articles starting with TB and ending, with epilepsy where each overlapped with the others by some “concept” or important word cluster. We’d hope to find TB -> TB drug -> B6 -> Epilepsy with URLs and links to the articles that explicate each of the components. Differing potential paths would perhaps be ranked by their directness, importance levels of overlaps, and number of citations that contain the key elements.

Many of the technological components to create such a solution engine are available. Databases could be explored iteratively in parallel using words associated with articles already encountered until paths of articles linking the parts of the problems are found. Intuitively, we can think of this as a sort of stream of consciousness wherein the concepts in one article trigger thoughts of other potentially valuable articles, until the needed connections are made.

This kind of structuring of the unstructured, this organizing of free associations, is possibly something like how humans put together disparate facts. There is certainly plenty of psychological evidence that humans use context clues in recall. Study for a test where you are going to take it, says some research. Research of psycholinguist Elizabeth Bates tells us that speakers of languages with gendered articles (such as el and la in Spanish) use them to speed recognition of the noun following. Half the potential words are eliminated by their gender type. In fact many researcher have found recall is influenced by mental state, or personal context. Bates has even has suggested evidence that women respond faster to female gendered nouns, even though the genders assigned to objects can be totally arbitrary.

In the technology I am suggesting here, making context restrictions could also speed up the searches: the option to consider only articles of a medical nature, for example. Spreading activation methods used in parallel distributed processing could further speed the required calculations — here, the nodes would be articles; the links, word groups or “concept” overlaps; the link strengths, concept importance and relatedness; inputs, the key words; and output, the sets of articles that bubble up first.

One last note: Specialized language games could create even more speedy and accurate solutions. Games such as Wittgenstein suggested, and for which carefully constructed formal properties follow meaningful properties, could become part of XML conventions. As an example, we could specifically code at the footer of Web sites statements that use verbs that basically show mathematical transitivity. Examples would include owns, causes, implies, correlates, and so on. Thus adding summary statements to articles of the form, “TB drug x correlates with B6 deficiency” and “B6 deficiency correlates with epilepsy,” could be used to rapidly suggest the connection, “TB drug x correlates with epilepsy.”

Whatever technologies actually take hold, we can be assured in the future databases will look a lot different from the way they do today.





Barry Grushkin (bgrushkin@dsslab.com) is a researcher at the DSS Lab (www.dss lab.com) with its founder Erik Thomsen, OLAP Council chairman, in Cambridge, Mass.





IE Weekly Newsletter
Subscribe to the newsletter
    Email Address







InformationWeek Business Technology Network
InformationWeekInformationWeek 500InformationWeek 500 ConferenceInformationWeek AnalyticsInformationWeek CIO
InformationWeek EventsInformationWeek ReportsInformationWeek MagazinebMightyByte and SwitchDark Reading
Digital LibraryIntelligent EnterpriseInternet EvolutionNetwork ComputingNo JitterPlug Into The Cloud
space
Techweb Events Network
InteropVoiceConWeb 2.0 ExpoWeb 2.0 SummitEnterprise 2.0 ConferenceMobile Business ExpoSoftware ConferenceCSI - Computer Security Institute
Black HatGTECEnergy CampMashup CampStartup Camp
space
Light Reading Communications Network
Light ReadingLight Reading EuropeUnstrungLight Reading's Cable Digital NewsConstantinopleInternet EvolutionPyramid Research
Heavy ReadingLight Reading Live!Light Reading InsiderEthernet ExpoOptical ExpoTeleco TVTower Technology Summit
space
Financial Technology Network
Advanced TradingBank Systems & TechnologyInsurance & TechnologyWall Street & TechnologyAccelerating Wall StreetBank Systems & Technology Executive SummitBuyside Trading SummitInsurance & Technology Executive Summit
space
Microsoft Technology Network
MSDN MagazineTechNetThe Architecture Journal
space