CMP -- United Business Media

Intelligent Enterprise

Better Insight for Business Decisions

UBM
Intelligent Enterprise - Better Insight for Business Decisions
Part of the TechWeb Network
Intelligent Enterprise
search Intelligent Enterprise





January 14, 2002

Symmetry Found

Analytic problem solving and communication are easier with a standard language

Erik Thomsen

One of the main challenges in writing this type of column is that some of the most useful information to convey to the reader deals with the pros and cons of the different approaches to setting up the analytic model in the first place. In order to convey that information, I need a tool-neutral, analytic data definition and manipulation language. Unlike in the relational world where SQL (itself a slowly evolving language) has for many years played the role of lingua franca, no such language beyond algebra exists as a popularly accepted standard for the world of dimensional analysis. However, I have had such a language and model that was published on Dimensional Systems' Web site (now www.dsslab.com) in 1997 — the located contents (LC) model.

The LC model is a superset of all OLAP languages in the marketplace. It provides for full symmetry between dimensions and measures ("Symmetry Lost," March 30, 1999), has logically grounded procedures for processing missing and meaningless data, and supports both ragged and leveled hierarchies, any kind of orderings, and classic procedural programming constructs. By providing mathematical capabilities for its one primitive structure, the LC model can be used to describe statistics, data mining, and decision optimization. Although I haven't done much to publicize it, most of the major vendors are aware of the LC model and a number of them are gradually incorporating its features.

Thus, to have a vehicle for describing the types of dimension, schema, and formula problem-solving challenges that will form the content of future installments, I will use the LC language as a kind of analytic SQL (until such time either a popularly accepted standard exists or some significant threshold of readers objects to my use of the LC model with sufficient force).

This column is not about the LC model. I have left out parts of the model that go beyond traditional analyses, and those analyses that are straightforward in LC but nigh impossible with traditional OLAP tools. Rather, I will focus on topics and areas of application for which currently available decision-oriented analytic technology may be successfully used.

The analytic solutions you learn in LC are applicable to the tools you may ultimately use and will help you understand their strengths and limitations against a more general and vendor-neutral background. For example, when thinking about analytic formulas, although the specific syntax and distribution of the formula will vary from tool to tool, the essence of what you're doing does not vary.

Thus, if you know you need to take the ratio of the top and bottom quartiles of a multivariate indicator in order to derive a certain type of decision-oriented data set, you will need to create a multivariate indicator, calculate the quartile values, and take the ratio of the top and bottom quartiles regardless of the tool you happen to be using.

In this installment, I present the minimum number of constructs in as informal a way as possible to get the analytic solutions ball rolling.

THINKING SYMMETRICALLY

As it turns out, anything that looks like a dimension can also be used as a measure or attribute, anything that looks like a measure can be used as a dimension or attribute, and anything that looks like an attribute can be used as a dimension or a measure. This symmetry is because you only need one kind of primitive structure that I call a type. If it were just a matter of conceptual preference, I wouldn't take the time to introduce a deeper and more unifying view in this column. But hardcoding the distinctions among dimensions, measures, and attributes leads to many problems including unnecessary difficulties trying to support multiple schemata sharing the same data set, create even reasonably sophisticated models such as customer value models, have models refer to or derive from other models, and cleanly support data mining and statistics, not to mention lots of unnecessary administrative overhead.

Even though OLAP products do not currently support full symmetry, all statistics packages do, and you will have an easier time designing analytic solutions if you think symmetrically. (You should not confuse dimensional symmetry with the so-called generic dimensions popularized by some vendors in the early '90s, which are not the least bit symmetrical).

At a high level, the LC model is composed of types, methods of structuring types, and methods of connecting certain structured types called schemas to data sets (the combination of which is a model). All data and schemata are ultimately defined in terms of types. What are called dimensions and variables in any OLAP model from Oracle Express, Microsoft Analysis Services, IBM's Metacube, and so on would be treated as individual types in an LC model. Thus, in a typical situation, both the classic dimensions "Time" and "Store" and the classic measures "Sales" and "Cost" would be types. Types delineate the limits of what can be defined, queried, and calculated. Neighboring concepts include primitive object classes, root domains, and basic categories.

LOCATOR AND CONTENT

Although the LC model has only one primitive structure, it has two primitive use distinctions that I call locator and content. The distinction between locator and content is roughly analogous to that of dimension and measure or subject and predicate. The implication is that the popular distinctions are valid as use distinctions within the context of a query or schema but not in terms of primitive structures or metadata.

A person or compiler can automatically make the distinction between locator and content on an expression-by-expression basis relative to a set of types and a data set. In the same way that you can speak and understand what someone is saying without first labeling words as belonging to a noun or verb phrase, you can define schemata and queries in the LC model without labeling types as locators or contents. This ability is the essence of what is meant by the functional approach.

Even if done automatically, keeping track of which types are being used for which function is important, because the choice of locators and contents affects the processing of empty cells. Empty cells in a locator render the row inapplicable or meaningless (assuming the input data exists in table form). Empty cells in a content have no effect on the rest of the row or the other contents for that location (assuming all the contents are independent) and are treated as either missing if the content applies to that location or meaningless.

Because the distinction between location and content is entirely functional, users don't need to declare types as locators or contents; the distinction is apparent from the form of the query, definition, or calculation. In general, by parsing input tokens (and assuming a textual language) into type names "Tx" (the type named "x") and type instances "in" (the instance named "n"), the contents in any expression are those types with unspecified instances. (For example, if "Time" were a type name, then "1996" would be an instance of that type.)







IE Weekly Newsletter
Subscribe to the newsletter
    Email Address