|
|
|||||||||||
|
http://www.intelligententerprise.com/01416/feat3_1.jhtml Matching PatternsPatterns in historical data are the lifeblood of business intelligence and knowledge
By Girish Keshav Palshikar
Making significant business decisions requires extensive knowledge about your business and your collected data, which is often expressed in terms of patterns in the data and its relationship with business phenomena, activities, and decisions. After all, as Hegel said, "Those who do not understand history are condemned to repeat it." To say that detection, measurement, understanding, and effective use of patterns in historical data form a core of business intelligence and knowledge would not be an exaggeration. More fundamentally, the ability to detect, measure, and analyze patterns is a crucial human ability and the basis of perception, understanding, and learning. For example, a manager may offer incentives such as discounts when stock levels gradually increase while demand rapidly dwindles. Or as another example, a surveillance manager in a stock exchange normally expects the trading in a particular security to be rather low during a period when its price is high, or very low when the price is relatively steady. The manager may be interested in a system to define such a pattern (at this level of abstraction, away from the database details) and that will generate an early warning about the exceptions to this normal trading pattern. Then the manager may use further data analysis to either confirm or eliminate the possibility that the detected exception to the pattern constitutes some suspicious activity. Users need tools to express and detect patterns in given databases, particularly because the databases are often quite enormous and manual inspections of data are well-nigh impossible. Management information and decision-support systems in such applications often provide a range of services, typically in the form of canned queries, reports, visualizations, and special-purpose computational and analysis facilities (for example, based on time-series) so that the decision makers can make informed and intelligent decisions consistent with past observations. The often unstated assumption is that these tools will only provide support to the IT experts to detect and understand the hidden patterns. Unfortunately, many of these tools do not really have a clear underlying notion of what patterns are or any special facilities for detecting and measuring them. But the primary responsibility lies with the database users and their skills. In this article, I explore the notion of a pattern, particularly a temporal pattern, and suggest an approach for matching a known pattern against given databases. This approach derives from the research of my organization, Tata Research Development and Design Centre (TRDDC) and its practical applications. TRDDC is the research and development division of Tata Consultancy Services, a software company in India.
What Is a Pattern?For the purpose of understanding historical databases, I define a pattern as a significant, high-level structure present in the data. A pattern condenses and summarizes vast amounts of numeric data. Detecting a pattern (and measuring its attributes) in a given source is a significant observation. You see where, when, and how strongly it occurred, variations from a standard reference pattern, repetitions, other measurements, and so forth. Clearly, a pattern is a valuable piece of knowledge. Describing and recognizing such patterns contributes to knowledge building and knowledge reuse within an intelligent enterprise. If a pattern consists of nothing but a group of records, then queries and reports should be able to find it. If a pattern is a group of data values - for example, most guests who stay for more than three nights are men, corporate executives, and use air travel - then clustering algorithms are generally well equipped to detect them. Data mining algorithms often detect such associations of data values as well. But what about patterns that are temporal in nature, like shapes in electrocardiogram (ECG) signals? Also, most experienced businesspeople know that leopards don't change their spots. That is, the experts often have a repository of well-understood patterns that they wish to compare with the current data. So how does one detect a pattern, which is known a priori and not an unknown pattern? What my company's researchers are saying is that often you may need to match a known pattern against given data, rather than automatically detect an unknown pattern. For example, finance managers often describe the health of a company in approximate symbolic patterns. In the context of accounts receivables (AR), they may say "collection is high," "collection is slow," "collection is not improving," or various combinations like "collections are high, not slow, and improving." The last one indicates a healthy status, which may be described in more detail as, "The AR turnover is increasing and the days' sales outstanding (DSO) is well above the given norm and the average collection period (ACP) remains much higher than the credit terms." Here, AR turnover, DSO, and ACP are domain concepts; their values vary over time and can be computed from sales and AR databases. The pattern is clearly composed from smaller subpatterns. As another example, you may look for periods in which "collections are high and slowing down." In the domain of manufacturing systems, experts often know the possible faults that can occur in the system, and they usually characterize these faults by means of inexact symbolic descriptions: If the temperature steadily increases and no sustained pressure builds up, then the out valve may be leaking. Finally, we at TRDDC ask a more general question: "Is there a way to characterize the patterns more logically to resemble their users' verbal informal description?" Characteristics of Temporal PatternsWhat then are the characteristics of temporal patterns? First and foremost, managers and other decision makers often describe a pattern in terms that are qualitative (nonnumeric or symbolic) in nature. (See the sidebar "Ups and Downs" for specific examples.) A temporal pattern is qualitative in that it typically does not specify actual numeric values, time instants, and intervals but deals with temporal relationships between events. (See the sidebar "Time Factors," page 50.) Symbolic descriptors like high, low, close, rapidly, and so on replace numeric values. A pattern description deals with domain-specific concepts, rather than database columns. A decision maker's descriptions of temporal patterns are conceptually at a very high level of abstraction, removed from table structure, field types, keys, and so on. Users often use inexact, approximate and probabilistic, or fuzzy terms in describing temporal patterns. A pattern is approximate in the sense that its instances may occur several times in a given source and they are usually similar but not identical. If you look at it another way, a pattern is not black and white; it is present in a graded way (described, for example, as a fuzzy degree of truth by a number between 0 and 100) rather than in a binary Boolean (true or false) fashion. Interestingly, a pattern description is composed using smaller, more primitive patterns; typically, the composition operators are logical (
Describing the PatternsFor most users of temporal data, it would be useful to have assistance from a system that performs inferences and reasoning, and queries, automatically identifies, and states patterns in a qualitative temporal pattern language that is easily understood.
But first, how do you formally describe a pattern for comparison with the databases and its detected instances? We decided that the answer was mathematical temporal logic. If you assume that you have a finite, linear-ordered sequence of not necessarily equally separated time instants We assume that a declarative fuzzy proposition can define each concept in the user's domain. Each such fuzzy proposition can have a truth-value from the range [0, 1]; thus it need not be fully true or fully false. In general, a fuzzy proposition has a different truth-value over the instants in time. Thus, each fuzzy proposition defines a time-dependent concept in the user's domain. In the AR example mentioned earlier, some fuzzy propositions that you can define are "collection is high," collection is low," "collection is improving," and "collection is slow." The values of these fuzzy propositions vary over time and come from the sales and AR databases. You can now use the standard logical connectives Here's an example: Let cool, humid, and raining respectively denote the fuzzy propositions that the weather is cool, humid, or raining. Then at a particular instant, truth-values of cool, humid, and raining may be 0.9, 0.4, and 0.1; meaning that the weather is very cold and somewhat humid, with very little rain at that instant. These truth-values may vary over the instants (note that instants are not necessarily adjacent instants, but say daily, hourly, or even weekly readings). You can detect a pattern of how s truth-values such as cool vary over time. However, you can construct more complex and interesting weather patterns using the logical and temporal connectives. Here are some examples of weather patterns described in natural language and also as temporal formulae. I may use special-purpose fuzzy connectives like heavy, fairly, low, high, very, and so on to emphasize the degree of truth.
Detecting the PatternsGiven fuzzy truth-values x and y (as real numbers between 0 to 1), the standard method to define the meaning of the fuzzy logical connectives is as follows:
See the sidebar, "Truth and Time." In Table 2, the first column shows the time instants and the next three columns show the value of the three fuzzy propositions (cool, humid, and rain) at each instant. You ignore the method used to compute the truth-values for the fuzzy propositions at each instant. For example, you can compute the degree of truth of the fuzzy proposition, cool, from the given temperature (T) using the following formula (when T = 9 degreesC, cool = 1.0, when T = 30 degreesC, cool = 0.0 and when T = 20 degreesC, cool = 0.33):
The truth-values of the formulas fairly cool and very humid are shown in the next two columns. The last column shows the truth-value of the formula In another example, Figure 1, shows the price and number of shares traded for a specific company (on the y-axis) over time (on the x-axis). A common normal trading pattern, in natural language, is "the volume traded is low when the price is very high or very low," which is represented as the fuzzy temporal formula Logical PatternsIn this article, I presented one approach to define the meaning of a pattern and described a fuzzy temporal logic where a formula has a truth-value at each instant in time (computed from the given underlying temporal databases). This logic can describe conceptual, high-level, approximate patterns that characterize time-dependent phenomena in various domains and applications. You can define simple algorithms to extract the time intervals where a given fuzzy temporal formula shows a significant presence. To facilitate the easy expression of more types of expert knowledge, you can add interval-based meta-temporal facilities to describe relationships between various time intervals of interest ( I would like to thank Prof. Mathai Joseph and Dr. Manasse Palshikar for their support. Girish Keshav Palshikar (girishp@pune.tcs.co.in) is a scientist at Tata Research Development and Design Centre (TRDDC) in Pune, India. TRDDC is the R&D Division of Tata Consultancy Services, India's largest software company. His areas of work include theory and applications of artificial intelligence. |
|||||||||||