|
A very common type of problem people have is the kind for which they at first dont even know the right questions to ask. Technology has of yet not addressed this large class. In the process of addressing general solution methods for some of these kinds of problems, I will review existing technology, discuss some of the difficulties and complexities of dealing with language and text, and conclude by suggesting how current technology could, in a step or two, deal with some of the more complex issues the initial problem brings up.
Knowledge is Empowerment
This is a true story. A year ago my friend Bev adopted a beautiful baby. After six months the little girl started having epileptic seizures. The experts at the major university medical center were stumped. Bev, an unusually aggressive Web researcher, took matters into her own hands. After many hours of investigation she found that the childs prophylactic drug for tuberculosis could produce a vitamin B6 deficiency. With more work, she found that a B6 deficiency can cause epilepsy. With a new treatment, the child has been healthy ever since!
Bev had only partial information, symptoms, and personal histories to go on not enough to know the exact key words she should use to pull to the top of a ranked list the articles central to solving her problem. Often we have some facts but do not know how they fit into a larger schema of knowledge one reason why live customer help is so much more useful then menu-driven methods.
To solve problems like these in an automated fashion requires some connection making a chain of connected articles (a path through article space, like a path through Internet nodes to find a target email site), with some of the component articles not even necessarily containing any of the search words. In Bevs case, an article with technical names for the TB drug and B6 turned out to be one of the key links, with no mention of TB or epilepsy.
Key-Word Search Development
We are all familiar with standard key-word Boolean searches, such as Find me texts with both the words business AND intelligence. Net search engine providers keep sites or article key words in in-house databases for quick indexed targeting searches. But this method requires exact matching and hits unless our indexers have added a set of adequate and relevant word variations or descriptors paired with sites.
Straight word search benefits highly when teams of researchers can add associated terms, cull out spam and errors, and add topical categorizations. Human input greatly improves roughshod automated methods a motif you will encounter again a little later.
The next level of technology takes advantage of a separate database of clusters or related word groups to expand the search. Then the search engine can recognize OLAP and online analytical processing as the same thing. The nature and use of these word clusters leads to a lot of very complex and sophisticated questions. Understanding these issues, I expect, will be essential in producing the next breakthrough technologies.
The earliest grouping methods used simple thesaurus lists for their comparative word sets. The next approaches added thesaurus-based measures of word closeness. In this way, the closer a found word was to a sought word, the higher the article containing it ranked.
Next, statistical methods of word association were used. Words can be considered close if they tend to appear in the same articles, or even better, in groups of articles constructed by researchers or librarians, or by the statistical methods I discuss later. This approach facilitates key-word matching with even foreign-language variants.
The Concept of Concept
Furthermore, if measures of word importance are added (which generally means rarity), the closeness and importance of related words along with their density and prominence of placement in articles gives the system ways to rank hits. Ranking hits is critical when you start getting millions of them.
When constructed word clusters started to look meaningful, some developers started to call them concepts useful, but this can conceal more then it reveals. That the term concept is also used by some software makers to refer to groupings of documents or summary elements within texts evinces how mercurial the concept of concept is. Perhaps we should heed Heraclitus on the difficulty of understanding: The lord whose oracle is at Delphi neither speaks nor conceals, but gives signs. Its time for a disclaimer.
We must remember that statistics on word distribution can at best be a very crude proxy for content or meaning. Words, articles, or phrases in sets, fuzzy sets, lists, clusters, or even words in sentences are not concepts. Only people have concepts. Concepts are part of an active, contextually interacting, and evolving human social domain. It is too easy to anthropomorphize and thus project the complexity of our thinking into far simpler entities. Meaning in text arises in use, with a wide range of nonlinguistic mental models and modules and assumptions in play. I overheard this conversation in a bar: You know that guy. Oh him. I hate when that happens. Ouch. You really had to be there. Text is often only a mnemonic for events and often context, exterior to the text, fills a great deal.
After hundreds of millions of years, animals evolved with enough precise spatial and motor skills to jump freely in the trees. Language, an evolutionary late comer, depends on many of these kinds of nonverbal mental skills. When we teach how to swing a bat, how we explain the process depends on how we conceptualize experiences of our body in action.
Why do grammar checkers and text summarizers often seem inadequate? Key meaning components can be exterior to the text. Consider: She placed the secret locket {among or between?} their fingers. Which word is correct depends on what is in our heads. It depends on whether we are imagining the fingers in groups or pairs. Some recent work is moving in the needed direction. HNC Software Inc.s work on Cortronic Neural Networks attempts to include some contextual information in neural nets by developing a special language for coding spatial and auditory relations.
Even intratextually, word meaning varies by context: California as an adjective can have quite differing meanings when placed before either maki, bagels, or girls avocado and crab, nuts and raisins, or blond and tan, respectively. To the extent we can make the situational context part of our organizing or searching process, we can advantage the chance for useful and meaningful results. For example, our word grouping should have AMA equivalent to American Medical Association in a medical context but American Management Association in a business context.
In his book Women, Fire, and Dangerous Things (University of Chicago Press, 1990), linguist George Lakoff maps out some of the widely varying yet partially overlapping uses of such simple terms as there and demonstrates that words have very complex, interacting, culturally dependent topologies. That the geometry of word meaning is very complex can also be seen in Ludwig Wittgensteins theory of family resemblance. He shows, for example, that the word chair can denote objects with complexly variable elements (dimensions) of similarity and difference. This view is quite different from the hub-and-spoke radial image that might be invoked to compare real things with the pure ideal types of which Plato spoke. Neural nets can deal with complex topologies, but the right form and the right way of programming them has remained illusive. The Haiku system now in development uses genetic algorithms to let you view some of these kinds of high-dimensional, complex, geometric relations. (See Figure 1,)
FIGURE 1 Genetic algorithms allow capturing complex high dimensional inter-relations in the Haiku system from The University of Birmingham.
The Technology at Hand
Back to the technology review of current text technologies. Another idea that has been used is to find distances between articles based on the overlap of important words or related words. By clustering on them, you can sometimes get useful groupings of texts.
In the simplest models, each document can be seen as a vector in word space, the vector components being word counts in the document. In turn, each word can be seen as a vector in document space, the vector components being the number of times that word appears in each of the basis vector documents. Document or word clusters or groupings can be seen as regions in these spaces.
FIGURE 2 Topical document map from Conceptual Dimensions.
Once you have ways of determining text distances, as these vector spaces offer, you can start making article relationship maps, as in Figure 2, or topographic maps of article cluster densities, as in Figure 3 or many other wonderful images. Check out the Spire system for stunning examples.
FIGURE 3 Topographic article density map from the Spire system of Pacific Northwest National Labratory.
Often with some hands-on work, including giving titles and rearranging some assignments, these clusters can make very useful categories. There are many applications: folders that divide search lists into useful subgroupings (check out NorthernLights search engine), email questions directed to the right specialist, news stories hot off the wire selected for your interests, people put in contact with others of similar interest, fraud detection, and making massive amounts of corporate information more organized and available.
RESOURCES
Audio Mining, Dragon Systems: Concept Agents, Autonomy: Concept Maps, Conceptual Dimensions: Cortronic Neural Networks, HNC: Haiku, The University of Birmingham: Intelligent Miner for Text, IBM: www-4.ibm.com/software/data/iminer/fortext Knowledge Organizer, Verity: Search with Folders, NorthernLight: Spire, Pacific Northwest National Laboratory: www.multimedia.pnl.gov:2080/infoviz Ultraseek Server, InfoSeek: |
My work on categorizing companies by clustering on similarity of stock movement showed that even slight changes in samples can seed differing yet interpretable, repeatable, and stable groupings. IBM goes with either export, big, high-tech, computer, consulting, or research companies, and so forth depending on what part of the data relations end up dominating one particular clustering take. This kind of multiplicity is nearly always in the nature of any attempt to summarize complex relationships. I ended up working with fuzzy relations of clustered set variations useful, but far from capturing the fullness of complex relations.
To relate all this back to Bevs problem, a solution engine could offer us chains of articles starting with TB and ending, with epilepsy where each overlapped with the others by some concept or important word cluster. Wed hope to find TB -> TB drug -> B6 -> Epilepsy with URLs and links to the articles that explicate each of the components. Differing potential paths would perhaps be ranked by their directness, importance levels of overlaps, and number of citations that contain the key elements.
Many of the technological components to create such a solution engine are available. Databases could be explored iteratively in parallel using words associated with articles already encountered until paths of articles linking the parts of the problems are found. Intuitively, we can think of this as a sort of stream of consciousness wherein the concepts in one article trigger thoughts of other potentially valuable articles, until the needed connections are made.
This kind of structuring of the unstructured, this organizing of free associations, is possibly something like how humans put together disparate facts. There is certainly plenty of psychological evidence that humans use context clues in recall. Study for a test where you are going to take it, says some research. Research of psycholinguist Elizabeth Bates tells us that speakers of languages with gendered articles (such as el and la in Spanish) use them to speed recognition of the noun following. Half the potential words are eliminated by their gender type. In fact many researcher have found recall is influenced by mental state, or personal context. Bates has even has suggested evidence that women respond faster to female gendered nouns, even though the genders assigned to objects can be totally arbitrary.
In the technology I am suggesting here, making context restrictions could also speed up the searches: the option to consider only articles of a medical nature, for example. Spreading activation methods used in parallel distributed processing could further speed the required calculations here, the nodes would be articles; the links, word groups or concept overlaps; the link strengths, concept importance and relatedness; inputs, the key words; and output, the sets of articles that bubble up first.
One last note: Specialized language games could create even more speedy and accurate solutions. Games such as Wittgenstein suggested, and for which carefully constructed formal properties follow meaningful properties, could become part of XML conventions. As an example, we could specifically code at the footer of Web sites statements that use verbs that basically show mathematical transitivity. Examples would include owns, causes, implies, correlates, and so on. Thus adding summary statements to articles of the form, TB drug x correlates with B6 deficiency and B6 deficiency correlates with epilepsy, could be used to rapidly suggest the connection, TB drug x correlates with epilepsy.
Whatever technologies actually take hold, we can be assured in the future databases will look a lot different from the way they do today.
Barry Grushkin (bgrushkin@dsslab.com) is a researcher at the DSS Lab (www.dss lab.com) with its founder Erik Thomsen, OLAP Council chairman, in Cambridge, Mass.
|
|
|
| |||||||||||||||||||||||||||||||




















