Power To The PeopleGetting the right information to the right decision makers is a complex process, but automating the data staging process can helpBy Stacy Bergert In business intelligence, the definitive goal is to turn data into information so that knowledge workers can make decisions that ultimately result in action. The steps to ensuring this success are obtaining, processing, transforming, and loading the data that supports the business. Data staging is the process of pre-positioning transactional data in a way that enables timely, uncomplicated transformation and loading of data into an online analytic processing (OLAP) system. For many business intelligence projects, data staging is often one of the most complicated steps in the extract, transform, and load (ETL) cycle. In a perfect world, the ETL process would execute data staging through real-time connectivity to the source systems, flawlessly extracting and preparing data. But the reality is quite different. If your organization is typical, it has several difficult challenges: How do you transfer thousands of data files quickly enough to get timely, relevant information into your business intelligence process without employing a large staff? How do you keep track of data pushed into the OLAP environment without knowing when or where the data was received? How do you solve these problems while methodically repeating the process for each loading cycle? In this article, I'll explain how to get around such nuisances in order to stage data quickly and effectively, making it ready and available for integrated processing. Facing the FactsAll business intelligence scalability assessments need to account for the volumes of data that will pass through the architecture. In cases of large data volumes, your team has to make choices about how to track the data going into the system and what to do with it. For example, most users want data validated to ensure it is received in its entirety. Users also want to know that the data was processed. Furthermore, the more quickly data "pushes through" through the information architecture, the more it quickly it will attain business value for users. One alternative is to employ a large staff (six people or more) whose sole purpose is to receive data, stage it, and communicate the results back to the rest of the team (and the users). When you make this choice, not only do you burden several people with a mundane job, but you also bring the human error factor into your data staging process. (Trust me, I've seen this strategy in action.) Alternatively, if you can automate a process, you can also automatically correct errors and make the process "learn" from its own mistakes. Another issue many organizations encounter in business intelligence projects is the inability to control the data feeding into the OLAP environment. For various reasons, such as security concerns or lack of direct connectivity to the source location, data is sometimes not directly accessible through traditional means. Some data providers, either entirely external or belonging to other internal organizations, prefer to "own" the transfer process and push the data into the architecture (via FTP, for example). If so, the business intelligence team can lose control of its own data feeds at precisely the point where that control is most important. The data reaching the OLAP environment might "touch down" in an unorganized form (because of the volumes of data received, the lack of control over the data feed, or other reasons), so the best way to organize that data is to automate the data staging process immediately following its arrival into the system. You can meet this goal in five steps: create a data catalog, understand ETL data requests, analyze the data catalog, organize or move the data, and process the data. Creating a Data CatalogThe first step is to come up with a data cataloging strategy. A data catalog is a list of the data that is available for processing and other bits of information related to the available data. (See Table 1.) Depending on the complexity of the data warehouse architecture, various data stores (such as the operational data store, data marts, or data warehouse) may have multiple data sources. Cataloging this information gives you the ability to track specific information back to a particular source of interest. You can store the catalog in an RDBMS table or append it to a text file. The goal is simply to capture the data in an easily available format. The complexity of the solution depends on how you will use the catalog. For example, you can manage the catalog through an interface with the data staging server. This approach is useful when you want "one version of the truth" to be captured and managed in the ETL layer. Perhaps if an ETL tool manages the movement of the data, the tool can feed the catalog with the metadata as well. Another way to create the data catalog is to leverage the expected file naming convention (or other known attributes of the expected data) and use code to scan the data touchdown location, documenting the data information into a data catalog. This method is useful when source file information (such as file naming patterns) are known and you can accurately predict when new data files are available for coding. You may also want to adjust the data touchdown, or the time the data first reaches the business intelligence architecture. Sometimes the touchdown process itself can help organize the data, and in doing so, simultaneously create the catalog. In extreme circumstances, you can create the catalog manually. In the approach I'll describe here, the catalog creation method is irrelevant because the ETL layer (which can, and very likely will, involve multiple servers) becomes the "owner" of the data. This strategy complements the decision to make the ETL process dictate data integration quite nicely; because the ETL server owns the data integration, it makes sense to develop a strategy in which the ETL server can own the data it needs, when it needs it. Understanding ETL Data RequestsHaving multiple ETL server implementations often entails dedicating servers to specific data requirements. Each ETL server might not need to process all the data, even though the data catalog will contain a list of all data available. The ETL server needs a way to learn (or be taught) the data it will process. (See Table 2.)You can accomplish this goal by creating an ETL "Request" job that creates an ETL-controlled lookup file storing information about the data the server needs to process. You should store the default information as it relates to source system and record type, and also develop ETL jobs with as much built-in flexibility as possible. Storing information about the source data can complement an enterprisewide metadata strategy and give individual team members the ability to understand the details (timeliness, source/record name, and size) about the data, some of which can enable more efficient coding techniques. Analyzing the Data CatalogIn this step, you move and analyze the data in the data catalog. To truly automate your ETL server, you need to give them convenient access to information about what data is available. This data (which is really metadata because it contains details about the source data) is stored in the data catalog. To move that metadata, you should integrate commercial-off-the-shelf (COTS) transfer software that has a command line invoked interface that can be called by code (in this case, code within the ETL tool). To read (or analyze via automation) the data catalog, create an ETL job that invokes the COTS transfer software to pull or transfer the catalog. (See Figure 1 .) At this point, knowing what data is available vs. what data the server needs is quite useful. It lets the server, rather than the individual, control the process. Each ETL server reads the catalog and creates a separate lookup file - its own copy of the catalog - that matches the data the server requires (which the server knows from the "Request" job). (See Table 3.) Organizing and Moving DataAt this stage, the ETL server's catalog version transfers, again using the integrated COTS transfer software, the data needed by that ETL server. (See Figure 2.) Some transfer tools allow bulk transfers in which many files transfer in one step. But in my experience, it's a better idea to make the ETL process transfer files one at a time while tracking the assigned transfer number. The COTS product can initiate file transfers one after another without checking the status or waiting for completion, which enables hundreds of file transfers within seconds of each other. By capturing a transfer number for each file, you can check the status of each file as well (after initiating all transfers), and if a transfer has failed, make the ETL job act accordingly by necessary business rules. To manage data movement complexity, you can implement data movement by source system and record type for a specific time period. Furthermore, you can automate this task in the ETL design and ensure flexibility at the same time. For example, as I explained earlier, most projects that require this approach do so because of a lack of control over the data entering the OLAP environment. Using this automated process, they can reduce their data into manageable portions, eliminating the need to understand how to transfer all the data at once. The team probably has limited understanding of whether all the data even reaches the business intelligence environment at the same time, anyway. Another useful extension in the design is to generate automated reports following the transfer of data. By focusing on subjects (one source per time period), you can constrain the output of a data transfer report to manageable amounts. Finally, this approach helps limit the amount of data being transferred at one time to a more easily calculated variable. You can create an element of your scalability assessment for data volumes within a time period for a single source, and guarantee space available for the data one source at a time - which is much more accurate than doing so for a grouping of sources. Furthermore, in an uncontrolled data feed environment, all the data for all the sources probably won't be available at the same time, nor will the loading frequency necessarily be the same for all data sources. You should also design as much flexibility into the data movement step as possible. Although you should pull data by source system and record type, it may also be useful - for instance when loading historical data - to pull all data required by the ETL server at once. Consider the frequency of this possible occurrence when designing your data organization ETL jobs. Other flexibility elements to consider are job overrides, such as those I mentioned in the "request" section. You should strongly consider allowing the job to "re-pull" data based on override values (such as time period, source system, and record type, which are all useful in cases of accidental source data deletion), job status characteristics ("file transfer failed," for example), or other business rules appropriate for the project. Processing the DataThe last step is optional, depending on the needs of your project's architecture. I prefer to implement this step using a Yes/No flag within the ETL environment, which lets me process the data (run the job) if the Yes flag applies. When the Y flag does apply, it invokes the controlling job stored during the data catalog analysis step, and the ETL server, initiates data processing. This step will or won't make sense depending on your architecture needs and ETL design. It's the last step detailed here because it nicely closes the data staging circle by initiating the integration of data into the overall architecture. Integrating data is, after all, the final goal of any ETL process. Clean Up Your ActIdeally, the business intelligence implementations you've run into have not required any of the complexities associated with an advanced data staging process. But when and if you encounter these complexities, perhaps this article will provide some food for thought. Stacy Bergert (stacy.bergert@eds.com) is a consultant with EDS's e-solutions group in its Southeast business intelligence and CRM practice. She has more than seven years of hands-on technical experience related to business intelligence, ranging from very small data marts to very large enterprise data warehouses. |
Most Popular This Week
IE Weekly Newsletter
Subscribe to the newsletter
|
|
|











