0 Share

5.23 K Views

Learning Data Mining And Data Warehousing

Educational & Professional | 8 Chapters

0 Share

5.23 K Views

This book, Learning Data Mining and Data Warehousing covers all aspects of data warehousing and mining in an easy to read and understand manner. It covers a varios topics, such as data warehousing and its benefits; architecture of data warehouse; data mart, data warehousing design strategies, dimensional modeling and features of a good dimensional model; different types of schemas such as star schema, snowflake schema; fact tables and dimension t....

DATA WAREHOUSE ENVIRONMENT AND DESIGN

Objectives

After studying this lesson, you should be able to understand:

Meaning and characteristics of data warehouse
Concept and structure of the data warehouse
Granularity in data warehouse
How data is structured in warehouse
Reporting and the architected environment
How incorrect data is corrected?
Operational data and Process data models
Concept of normalization/denormalization
How to create multiple profile records
Direct and indirect access of data warehouse

1.1 INTRODUCTION DATA WAREHOUSE ENVIRONMENT

A data warehouse is a subject-oriented, integrated, non volatile and time-variant collection of data in support of management’s decisions. The data warehouse contains granular corporate data.

The subject orientation of the data warehouse is shown in Figure 1.1. Classical operations systems are organized around the applications of the company. For an insurance company, the applications may be auto, health, life, and casualty. The major subject areas of the insurance corporation might be customer, policy, premium, and claim. For a manufacturer, the major subject areas might be product, order, vendor, bill of material, and raw goods. For a retailer, the major subject areas may be product, SKU, sale, vendor, and so forth. Each type of company has its own unique set of subjects.

Figure 1.1: An Example of a Subject Orientation of Data

The second salient characteristic of the data warehouse is that it is integrated. Of all the aspects of a data warehouse, integration is the most important. Data is fed from multiple disparate sources into the data warehouse. As the data is fed it is converted, reformatted, resequenced, summarized, and so forth. The result is that data — once it resides in the data warehouse — has a single physical corporate image. Figure 1.2 illustrates the integration that occurs when data passes from the application-oriented operational environment to the data warehouse. Data is entered into the data warehouse in such a way that the many inconsistencies at the application level are undone. For example, in Figure 1.2, as far as encoding of gender is concerned, it matters little whether data in the warehouse is encoded as m/f or 1/0. What does matter is that regardless of method or source application, warehouse encoding is done consistently. If application data is encoded as X/Y, it is converted as it is moved to the warehouse.

Figure 1.2: Issue of Integration

The third important characteristic of a data warehouse is that it is nonvolatile. Figure 1.3 illustrates non volatility of data and shows that operational data is regularly accessed and manipulated one record at a time. Data is updated in the operational environment as a regular matter of course, but data warehouse data exhibits a very different set of characteristics. Data warehouse data is loaded (usually en masse) and accessed, but it is not updated (in the general sense). Instead, when data in the data warehouse is loaded, it is loaded in a snapshot, static format. When subsequent changes occur, a new snapshot record is written. In doing so a history of data is kept in the data warehouse.

Figure 1.3: Issue of Non-volatility

The last salient characteristic of the data warehouse is that it is time variant. Time variance implies that every unit of data in the data warehouse is accurate as of some one moment in time.

1.2 STRUCTURE OF THE DATA WAREHOUSE

Figure 1.4 shows that there are different levels of detail in the data warehouse. There is an older level of detail (usually on alternate, bulk storage), a current level of detail, a level of lightly summarized data (the data mart level), and a level of highly summarized data. Data flows into the data warehouse from the operational environment. Usually significant transformation of data occurs at the passage from the operational level to the data warehouse level. Once the data ages, it passes from current detail to older detail. As the data is summarized, it passes from current detail to lightly summarized data, then from lightly summarized data to highly summarized data.

Figure 1.4: Structure of the Data Warehouse

1.3 GRANULARITY

The single most important aspect of design of a data warehouse is the issue of granularity. Indeed, the issue of granularity permeates the entire architecture that surrounds the data warehouse environment. Granularity refers to the level of detail or summarization of the units of data in the data warehouse. The more detail there is, the lower the level of granularity. The less detail there is, the higher the level of granularity. For example, a simple transaction would be at a low level of granularity. A summary of all transactions for the month would be at a high level of granularity. Granularity of data has always been a major design issue. In early operational systems, granularity was taken for granted. When detailed data is being updated, it is almost a given that data be stored at the lowest level of granularity.

In almost all cases, data comes into the data warehouse at too high a level of granularity. This means that the developer must spend a lot of resources breaking the data apart. Occasionally, though, data enters the warehouse at too low a level of granularity.

1.3.1 Benefits of Granularity

The granular data found in the data warehouse is the key to reusability, because it can be used by many people in different ways.

For example, within a corporation, the same data might be used to satisfy the needs of marketing, sales, and accounting. All three departments look at the basic same data. Marketing may want to see sales on a monthly basis by geographic district, sales may want to see sales by salesperson by sales district on a weekly basis, and finance may want to see recognizable revenue on a quarterly basis by product line. All of these types of information are closely related, yet slightly different. With a data warehouse, the different organizations are able to look at the data as they wish to see it.

l Looking at the data in different ways is only one advantage of having a solid foundation. A related benefit is the ability to reconcile data, if needed. Once there is a single foundation on which everyone relies, if there is a need to explain a discrepancy in analyses between two or more departments, then reconciliation is relatively simple.

Another benefit of granular data is that it contains a history of activities and events across the corporation. And the level of granularity is detailed enough that the data can be reshaped across the corporation for many different needs.
The largest benefit of a data warehouse foundation is that future unknown requirements can be accommodated.

1.3.2 Dual Levels of Granularity

Most of the time, there is a great need for efficiency in storing and accessing data, and for the ability to analyze data in great detail. (In other words, the organization wants to have its cake and eat it, too!) When an organization has lots of data in the warehouse, it makes eminent sense to consider two (or more) levels of granularity in the detailed portion of the data warehouse. In fact, there is such a need for more than one level of granularity that a dual level of granularity design should be the default for almost every shop.

For example, a phone company—fits the needs of most shops. There is a tremendous amount of detail at the operational level. Most of this detail is needed for the billing systems. Up to 30 days of detail is stored in the operational level. The data warehouse in this example contains two types of data-lightly summarized data and “true archival” detail data. The data in the data warehouse can go back 10 years. The data that emanates from the data warehouse is “district” data that flows to the different districts of the telephone company. Each district then analyzes its data independently from other districts. Much heuristic analytical processing occurs at the individual level.

Figure 1.5: The Trade-off with Granularity

1.4 STRUCTURING DATA IN THE DATA WAREHOUSE

So far, we haven’t gone into what the data structures found in the data warehouse really look like. Many kinds of structures are found in the data warehouse. We will look at some of the more common ones now. Perhaps the simplest and most common data structure found in the data warehouse is the simple cumulative structure, shown in Figure 1.6 shows the daily transactions being transported from the operational environment. After that, they are summarized into data warehouse records, which may be by customer, by account, or by any subject area in the data warehouse. The transactions in Figure 1.6 are summarized by day. In other words, all daily activity for a customer for an account are totaled and passed into the data warehouse on a day-by-day basis.

Figure 1.6: The Simplest Form of Data in the Data Warehouse

A rolling summary data structure handles many fewer units of data than does a simple cumulative structuring of data. A comparison of the advantages and the disadvantages of rolling summary versus simple cumulative structuring of data is shown in Figure 1.7

Figure 1.7: Comparing Simple Cumulative Data with Rolling Summary Data

Another possibility for the structuring of data warehouse data is the simple direct file, shown in Figure 1.8. Figure 1.8 shows that data is merely pulled from the operational environment to the data warehouse environment; there is no accumulation.

In addition, the simple direct file is not done on a daily basis. Instead, it is done over a longer period of time, such as a week or a month. As such, the simple direct file represents a snapshot of operational data taken as of one instant in time.

Figure 1.8: Creating a Continuous File from Direct Files

1.5 PURGING WAREHOUSE DATA

Data does not just eternally pour into a data warehouse. It has its own life cycle within the warehouse as well. At some point in time, data is purged from the warehouse. The issue of purging data is one of the fundamental design issues that must not escape the data warehouse designer. In some senses, data is not purged from the warehouse at all. It is simply rolled up to higher levels of summary. There are several ways in which data is purged or the detail of data is transformed, including the following:

1. Data is added to a rolling summary file where detail is lost.

2. Data is transferred to a bulk storage medium from a high-performance medium such as DASD.

3. Data is actually purged from the system.

4. Data is transferred from one level of the architecture to another, such as from the operational level to the data warehouse level.

There are, then, a variety of ways in which data is purged or otherwise transformed inside the data warehouse environment. The life cycle of data — including its purge or final archival dissemination — should be an active part of the design process for the data warehouse.

1.6 REPORTING AND THE ARCHITECTED ENVIRONMENT

It is a temptation to say that once the data warehouse has been constructed all reporting and informational processing will be done from there. That is simply not the case. There is a legitimate class of report processing that rightfully belongs in the domain of operational systems. Figure 1.9 shows where the different styles of processing should be located.

Figure 1.9: Differences between the Two Types of Reporting

Figure 1.9 shows that operational reporting is for the clerical level and focuses primarily on the line item. Data warehouse or informational processing focuses on management and contains summary or otherwise calculated information. In the data warehouse style of reporting, little use is made of line-item, detailed information, once the basic calculation of data is made.

Example: Consider a bank. Every day before going home a teller must balance the cash in his or her window. This means that the teller takes the starting amount of cash, tallies all the day’s transactions, and determines what the day’s ending cash balance should be. In order to do this, the teller needs a report of all the day’s transactions. This is a form of operational reporting. Now consider the bank vice president who is trying to determine how many new ATMs to place in a newly developed shopping center. The banking vice president looks at a whole host of information, some of which comes from within the bank and some of which comes from outside the bank. The bank vice president is making a long-term, strategic decision and uses classical DSS information for his or her decision.

There is then a real difference between operational reporting and DSS reporting. Operational reporting should always be done within the confines of the operational environment.

1.7 INCORRECT DATA IN THE DATA WAREHOUSE

The architect needs to know what to do about incorrect data in the data warehouse. The first assumption is that incorrect data arrives in the data warehouse on an exception basis. If data is being incorrectly entered in the data warehouse on a wholesale basis, then it is incumbent on the architect to find the offending ETL program and make adjustments. Occasionally, even with the best of ETL processing, a few pieces of incorrect data enter the data warehouse environment. How should the architect handle incorrect data in the data warehouse? There are at least three options. Each approach has its own strengths and weaknesses, and none are absolutely right or wrong. Instead, under some circumstances one choice is better than another.

Example: Suppose that on July 1 an entry for $5,000 is made into an operational system for account ABC. On July 2 a snapshot for $5,000 is created in the data warehouse for account ABC. Then on August 15 an error is discovered. Instead of an entry for $5,000, the entry should have been for $750. How can the data in the data warehouse be corrected?

Choice 1: Go back into the data warehouse for July 2 and find the offending entry. Then, using update capabilities, replace the value $5,000 with the value $750. This is a clean and neat solution when it works, but it introduces new issues:

The integrity of the data has been destroyed. Any report running between July 2 and Aug. 16 will not be able to be reconciled.
The update must be done in the data warehouse environment.
In many cases there is not a single entry that must be corrected, but many, many entries that must be corrected.

Choice 2: Enter offsetting entries. Two entries are made on August 16, one for $5,000 and another for _$750. This is the best reflection of the most up-to-date information in the data warehouse between July 2 and August 16. There are some drawbacks to this approach:

Many entries may have to be corrected, not just one. Making a simple adjustment may not be an easy thing to do at all.
Sometimes the formula for correction is so complex that making an adjustment cannot be done.

Choice 3: Reset the account to the proper value on August 16. An entry on August 16 reflects the balance of the account at that moment regardless of any past activity.
An entry would be made for $750 on August 16. But this approach has its own drawbacks:

The ability to simply reset an account as of one moment in time requires application and procedural conventions.
Such a resetting of values does not accurately account for the error that has been made.

Choice 3 is what likely happens when you cannot balance your checking account at the end of the month. Instead of trying to find out what the bank has done, you simply take the bank’s word for it and reset the account balance.

There are then at least three ways to handle incorrect data as it enters the data warehouse. Depending on the circumstances, one of the approaches will yield better results than another approach.

1.8 INTRODUCTION of DATA WAREHOUSE DESIGN

There are two major components to building a data warehouse: the design of the interface from operational systems and the design of the data warehouse itself. Yet, “design” is not entirely accurate because it suggests planning elements out in advance. The requirements for the data warehouse cannot be known until it is partially populated and in use and design approaches that have worked in the past will not necessarily suffice in subsequent data warehouses. Data warehouses are constructed in a heuristic manner, where one phase of development depends entirely on the results attained in the previous phase. First, one portion of data is populated. It is then used and scrutinized by the DSS analyst. Next, based on feedback from the end user, the data is modified and/or other data is added. Then another portion of the data warehouse is built, and so forth. This feedback loop continues throughout the entire life of the data warehouse.

1.9 BEGINNING WITH OPERATIONAL DATA

At the outset, operational transaction-oriented data is locked up in existing legacy systems. Though tempting to think that creating the data warehouse involves only extracting operational data and entering it into the warehouse, nothing could be further from the truth. Merely pulling data out of the legacy environment and placing it in the data warehouse achieves very little of the potential of data warehousing.