http://www.intelligententerprise.com/010507/feat3_1.jhtml

The Long View

Trying to create an integrated customer record? This healthcare data warehouse holds valuable lessons

By Karen Shockley

When you're faced with obtaining information from terabytes of data, your first thought may be "We need a data warehouse." But, unless you refocus this term to mean, "We need a data warehouse that can solve a particular business problem," you will waste much time and effort (and expensive resources). Ideally, solving this business problem supports the overall strategic objectives of your company. For example, your company may need to determine specific buyer demographics so that it can implement a plan to increase consumer purchasing in its stores. Here the supported business objective is to grow market share. Only by aligning IT objectives with business objectives can a company hope to have a successful data warehousing venture.

In this article, I'll discuss a data warehouse that my team at EDS created to support decisions related to healthcare objectives, but the same techniques can support many types of business decisions. The key is knowing which business objectives are crucial to the success of the enterprise.

In this healthcare example, the crux of the problem was that the design needed to handle a longitudinal view of the patient, as well as provide trends of services to these patients. The silver lining was that other industries could also use this solution to solve their business needs. In the same way that you view a patient record, users could view the overall buying trends of customers over the life of their relationships with a company based on time and demographics at the time of purchase. Additionally, the user could determine the purchasing habits of particular demographic components of the country or of the world. The user could also determine the amounts of money spent by location or by demographics, or the user could determine spending trends and whether that spending was a single purchase or comprised multi-item buying.

Data Access

The problem seemed straightforward: Create a data warehouse that contains healthcare data on several million beneficiaries and be available to both "novice" and "power" users. The goal of the warehouse was to assist the organization in providing patients quick, convenient access to quality healthcare information, with an efficient expenditure of resources. Then, as long as the warehouse was being created, the client wanted to include the following "challenging" requirements:

  • Load disparate data for easy access by the user.
  • Ensure that users can view the data in two ways:
    1. At a strategic level: a hierarchy starting from the top organization levels and ending with data captured for a specific clinic
    2. At an individual patient level.
  • Load three terabytes of historical data.
  • Ingest more than 50,000 files per month and load the data warehouse during down times.

The data warehouse model addresses the first two challenges, and the information architecture address the second two. A star schema model was selected because users could easily adapt it to form business questions, and it can provide answers to those questions relatively quickly. The real challenge was determining the best method to employ the star schema.

The business questions were divided into four relatively straightforward environments; a set of user-defined metrics separated themselves into the areas of resource utilization, the demographics of the population being served, the specific care being provided, and the cost of that care. In a retail instance, you could replace healthcare service with the product being supplied, purchased, tracked, and so forth. And, of course, all the metrics were a product of time.

Facts and Dimensions

The model was created with four facts: resources, financial, healthcare, and population. These facts were the four main business areas for the operations under analysis. They allowed the most opportunity for cost-saving analysis and resource utilization, and provided access to quality healthcare data and the ability to track the medical needs of the entire population. In addition to allowing queries against data in the four specific areas, these facts also share several dimensions, allowing the user to not only obtain information from one star schema (or set of business questions), but also across star schemas (and across business lines). A description of each of the facts and dimensions follows.

Resources. This fact accessed both the time and organization dimensions and contained measurements for the total number of bed days, number of both outpatient and inpatient visits, the number of admissions and dispositions, and the expenses associated with each hospital unit. The resource utilization data was reported monthly, and the healthcare data was received daily. Therefore, the adopted convention was: The resources available on the first day of the month were those applied to the patient care for that month. By creating queries across the resource and healthcare facts, you can determine which level of hospital resources (or retail employees) were required for each product line.

A comparable business solution would track the number of products purchased from each division in the organization, as well as expenses associated with that division.

Financial. This fact was dedicated to claims processing for the healthcare provided. Measurements included the total number of claims; total amount billed, allowed, and paid; copay amounts; the number of services provided; the number of days to process a claim; and the amount of claim adjustments. A business application could store the total number of transactions, charges, checks, and cash.

Healthcare. One row was loaded for each diagnosis or procedure. This row, however, contained the unique number referring to the encounter. You could then obtain either the total number of procedures and diagnoses for the individual, or all of the procedures and diagnoses for a specific encounter. The measures comprised the number of encounters, procedures, bed days, and number of diagnoses or procedures.

On a retail level, you might store data about each item reflected on an invoice. You could then query all the people that have purchased radios or find out all the items purchased at one time by the same individual. You can then join star schemas and determine the number of times a person made purchases at each store, the specific items purchased, and the demographics at the time of each purchase.

Population. One row was loaded for each enrolled month that the person was a beneficiary in this medical plan. This row was identified by the last day of the month for which the data was valid. Measures consisted of the total population, those enrolled in an HMO, and those eligible to receive care outside of the HMO. These measures allowed a longitudinal view of the patient's locations, marital status, and number of dependents. The population fact can then join the encounter fact to determine specific demographics as to when care was received.

An application in the retail world could tie the products purchased by a consumer enrolled in a "frequent buyer" program and then use this information to tie purchases to the demographics of the people as they aged, moved to different parts of the country, and so on. (Figure 1 depicts the architecture used to implement this solution.

Several dimensions were shared among these facts, letting users query across star schemas for relevant questions. The model included the following dimensions (in order clockwise from the resources star schema, see Figure 2,):

  • Time contained a hierarchy for calendar dates as well as for fiscal dates. The data stored at the lowest level was always a specific day so that the data remains the same, even if the parameters for querying time change.
  • Organization - information on organizational resources - was stored in a hierarchy that moved from the geographic areas that defined the healthcare providers down to the specific clinics used by the patients. A comparable retail scenario would be a business that had several stores or factories, with each store or factory responsible for several product lines.
  • Claim type contained information on the types of claims submitted, the claim's approval status, and the dollar amount of any adjustments. Comparable information in the business world could be details of the transaction: payment by credit card, debit card, or cash.
  • Provider included characteristics of the healthcare provider: provider ID, specialty, the practice's country, state, and zip code.
  • Diagnosis and procedure combined these two types of data into one dimension for easier and faster accessibility by the user. A flag in the fact denoted which data type was applicable.
  • Encounter Type contained information about the particular healthcare received: inpatient or outpatient, same day surgery, and whether their primary care physicians saw the patients. Comparable information in the business world could be the products purchased.
  • Demographics included information about the patient that was unlikely to change: identification number, gender, birth date, and so on. These factors change slowly, if at all, and so kept separate from those attributes that change periodically.
  • Person incorporated the beneficiary's social security number (SSN), as well as the system ID. By separating the SSN into its own dimension and using the system ID in most queries, my team and I ensured the required patient confidentiality. One of the requirements was that a limited set of users must be able to link patient records with actual patient files. These users, and only these users, had access to this dimension.
  • Residence gave the capability to track the care provided according to different areas of the country. The beneficiary's city, state, country, and postal code provided the capability to link care given to the geographic area.
  • Detail Demographics included information about the patient that would change over time - in particular, those items that were crucial to business decisions, such as marital status, number of dependents, employment, and age group. People in the 21-to-30 age group, for example, have differing needs from those in the 0-to-10-year-old population.
Information Architecture

The solution employed a four-tier architecture that consisted of a transactional data store, an extract, transform, and load (ETL) layer, a relational database server, and a metadata processor combined with a user interface OLAP tool.

Tier One: Transactional Data Store (TDS). The TDS loads each file as it arrives into a flat file in the database. Each source file has its own table in the database; each row in the input data corresponds to one row in the target table. Appended to each row in the table is the file date (used to determine which source file was sent each day) and the processing date (used by the ETL to determine which records still need processing). The source files can be deleted after processing or maintained online until the data is deemed stable.

These tables can now be used to implement various stages of data integrity. You have the advantage of having the raw data in a table that you can query. Automated queries can run each day to determine the number of files and the number of records loaded and rejected. These queries can be compared to the number of files and records sent, with anomalies reported to the appropriate level in the organization. One or more combinations of fields uniquely identify these records, so that they can be tracked. The table can also query the raw data to determine which fields are being left "blank" that are vital to the organization, which fields have data values falling outside of the prescribed domain, and if the returned values are indeed part of the data.

Storing the data in tables also makes the ETL process run more efficiently. Depending upon the ETL tool being used, loading data from one table for each data source can be a timesaving technique. Rather than directly processing the several hundred source files, the ETL server only needs to open and access one source.

Tier Two: ETL Layer. The ETL layer extracts data on a periodic basis from the TDS, based on the processing date. The ETL also provides data synchronization, combines data from several different sources, and uses reference tables to ensure transformation of data from two sources so that you can compare "apples to apples."

Having the ETL tier in a separate tier provides several advantages. First, you have unlimited scalability because the ETL is not part of the database engine. As you add more sources or the processing becomes more complex, you can simply add more ETL systems. Second, by moving the bulk of the processing from the database engine platform to a secondary platform, more cycles in the database engine were available to respond to user queries. The ETL tier also performed data integrity functions that would normally be assigned to a Third Normal Form (3NF) database. The ETL processes kept track of unique records, and thus knew whether inserts or updates were required. They also employed reference tables to ensure the data's authenticity.

Tier Three: Database Server. The database engine processed and loaded data in a multiprocessing parallel environment. The database design contained provisions for storing the hierarchy level of each row of data. These levels can then be employed in creating aggregated tables. These aggregate tables were then created or updated after the initial data load and each subsequent data load.

Separating the extraction and loading data into its own layer let the database engine be totally available for responding to user queries. This layer also allowed users to obtain more "bang for the buck." Platforms that run database engines are typically more expensive than the NT systems employed in the ETL tier. Thus, the more significant investment is dedicated to supporting the business solutions.

Tier Four: User Interface. A data warehouse is only as good as the data it provides to the user. The user can only obtain this data if it is easily accessible, so it is crucial to the success of any data warehousing effort to choose a user-friendly tool. The challenge lies in identifying a tool that a novice user can easily employ yet also offers streamlined processing for the more advanced analyst. We selected a user tool that supported the use of OLAP. Additionally, this tool supported the use of an aggregation tool that streamlined requests prior to querying of the database.

As a user query was received, the aggregation tool scanned the SQL statements used to create the aggregate tables to determine if the response was already computed. (Ideally, the aggregate tables are designed so that these tables satisfy 90 percent of the queries.) The full database was only accessed when the answer could not be obtained through the aggregates. The tool was also employed to track the queries being used and provide suggestions for additional or alternate aggregates. Additionally, you can implement tools that track each query by user. This information can determine who is accessing which data and how frequently; it can be used to allocate resources, as well as track costs.

Because a large percentage of the answers were preprocessed, users could receive a relatively quick reply to their queries. This technique leaves maximum CPU time for those power users whose strategic business questions cannot be foretold and who must access the entire data set. By using the point-and-click tool, non-SQL personnel could easily query data to respond to business concerns.

Flexible Solutions

As the previous solution was implemented, users had access to an extremely flexible tool. They had a "galaxy" of four stars, which were designed to work independently or in concert. The user could obtain data by hospital, provider, resources consumed, or patient being served. Demographic data for the entire healthcare population was stored so that calculations concerning the percentage of available patients that consumed resources could be monitored. A further enhancement would be to correlate the demographic information for this beneficiary population with all the population in that geographic area.

Because the demographics of both the insured and uninsured population are then known, you could use treatment information to specifically target those people who are eligible for services but are not obtaining them and those who are registered for services, but aren't receiving preventive healthcare, such as mammograms, eye exams, and so forth. A well-designed data warehouse will continue to evolve. As the users receive more information, they will continually identify new opportunities to use information to meet ongoing and changing business objectives.



Karen Shockley (karen.shockley@eds.com), a senior consultant with the EDS E.-Solutions product line, has been the software development manager for development and deployment of regional integrated database systems and accompanying data marts for the U.S. military's healthcare systems.

Return to Article