I had the pleasure of interviewing Ralph Kimball on one of his many visits to Slovenia, where he was always happy to come both to deliver trainings and to spend time relaxing and enjoying the beautiful Bled region. The interview was first published in Info SRC.SI 2005/Issue 42, p. 5-7.
1. This is your sixth visit to Slovenia. What brings you back?
Slovenia is a beautiful country and pleasant to visit. Also, the IT professionals in Slovenia are interested in data warehousing, which I teach each time I visit Slovenia!
2. You have been involved with data warehousing since its early beginnings. How do you see the evolving data warehouse from its early implementations to its recent widespread adoption? Do you believe data warehousing is fulfilling its promise to make information easier to access? What is still lacking?
In some ways, the goals of data warehousing have changed very little from the earliest days around 1980. The goal of the data warehouse has always been to be a “decision support system (DSS)”. Even though DSS is still a good description, using 2005 vocabulary, we would describe the data warehouse as a “platform for business intelligence”. In other words, our job as data warehouse professionals is to publish the data assets of an organization so as to most effectively serve as the foundation for business intelligence (BI) tools. That means we must prepare clean, reliable, simply formatted, understandable, high performance databases so that the BI tools can do their jobs most easily.
In the last few years, the data warehouse has become noticeably more operational. We are loading more sales data, manufacturing data, inventory data, and financial transaction data than we did in the early years. This data has much larger volumes, and in many cases the operational end users need to see the data more quickly than before. The reduced latency of the data is what we call “real time”. Probably the best definition of real time is “anything that is too fast for your current ETL (extract transform load) system!
Finally, especially in the United States there is an increased need for compliance and transparency in financial reporting. The data warehouse often must “maintain custody” of data from its original creation all the way through to final paper reports. Ironically, this extra burden on the data warehouse has frequently made funding for data warehouse projects easier, because the senior executives of most companies are very aware of the demands for compliance and transparency.
Data warehousing will not fulfill its overall promise to make information easier to access until data warehouse designers fully understand how to design data warehouses for ease of access. That is the foundation for the approach, called dimensional modeling, that I teach. The ETL class here in Slovenia will teach the technical details of how to extract, clean, conform, and deliver data so as to be easy to use.
3. How do you comment the Kimball vs. Inmon approach? Which works?
Bill Inmon created the name “data warehouse”. His architectural approach is called the Corporate Information Factory (CIF). At the highest level, the CIF approach and my approach (the Data Warehouse Bus Architecture) are quite similar. We both extract, transform, and load data into final databases accessed by end users. However, beneath this high level, Bill does not provide technical detail for the CIF approach, and that is what distinguishes my work from his. I provide specific, detailed guidance to data warehouse designers how to approach the central problems of data warehousing including data cleaning, integrating diverse data sources, and handling the natural time variance of data warehouse entities like customers and products.
4. Dimensional modeling is primarily suited for relational database technologies and SQL data querying. How does dimensional modeling fit into OLAP technologies which have been aggressively pushed by Microsoft in the recent years and more recently also by Oracle?
OLAP technologies are closely related to the dimensional approach, and in fact, all OLAP cubes are explicitly dimensional. Both Microsoft (with SQL Server 2005 and Analysis Services) and Oracle (with Oracle Warehouse Builder) have incorporated many of my specific techniques into their ETL tools that build their respective OLAP cubes. For instance, both companies have explicit wizard support for “slowly changing dimensions (SCDs)” built directly into their products when creating OLAP cubes. SCDs are a technique I invented to handle time variance in data warehouse data. I am delighted that both Microsoft and Oracle have incorporated both my techniques and my vocabulary in their main-line products. I will be showing screen shots of both of these products, as well as Informatica and Ascential tools in the ETL class in Slovenia.
5. Your seminar ETL Architecture in Depth discusses the 38 subsystems of ETL. What exactly are these subsystems are all of them necessary for a successful ETL implementation?
You will have to come to the class for a full explanation of the 38 subsystems! They are divided into four familiar groups: Extract (3 subsystems), Transform (5 subsystems), Load (15 subsystems), and Manage (15 subsystems). For example, the three subsystems in the Extract group are data profiling, change data capture, and physical extraction.
Although 38 subsystems sound like a lot, every student has the same reaction when they study the list: each subsystem is already familiar and each subsystem is necessary. For example, one of the Load subsystems is responsible for generating the data warehouse keys for each table. Would you leave that out?
6. How do you comment the current state of ETL technologies? Are ETL tools that are on the market sufficient for complex data warehouse loading or do you believe that custom coded ETL is still the way to go? How about ELT technologies, in your opinion, will they supersede the ETL approach?
In the class I teach a balanced approach, and I mention the advantages of hand coding as well as using vendor tools for ETL development. Today, probably 70% of all ETL is still done with hand coding and simple scripts. But, to be honest, after taking the class, you will probably recognize that the long term solution lies with integrated tools, because there are many aspects of data warehouse development that are usually omitted when coding by hand, such as creating a metadata repository, handling automatic load balancing, providing lineage and impact tracking, generating automatic documentation to support compliance, and scheduling the ETL workflow. The market for the integrated tools is changing, and both Microsoft and Oracle are attempting to set new price points for ETL development tools that are lower than the classic and more mature tools like Informatica and Ascential.
ELT refers to a variation of the normal ETL approach, in which the possibly dirty and poorly administered data is loaded directly into relational tables, and then all the subsequent manipulation is controlled by complex SQL statements, often using modern extensions of the SQL language available on some vendors’ databases. Since the Load takes place before the Transformation, this approach is called ELT. ELT is not a magic bullet that avoids any of the hard work of cleaning and transforming the data, rather it implements the hard work in SQL rather than in processing modules more typical of ETL systems. The advocates of ELT argue that with the advances of relational database engines that allow complex processing, they offer a better price point and more flexibility than the ETL approach. However, ELT has a couple of potential weaknesses. First, if the efficiency of the SQL transforms depends on a single vendor’s proprietary extensions to SQL, then you have just transferred your dependency from an ETL vendor to a relational database and an ELT software vendor. And second, there are many basic ETL processes like sorting and filtering that will never work as well in a relational database as they do using dedicated flat file processes like SyncSort and “grep”.
7. In your practice have you seen an improvement in data quality over the past years, are companies really becoming aware of the importance of data quality? Or is this area still not addressed adequately?
I don’t have statistics to prove it, but my impression is that people are paying more attention to data quality because everyone is now demanding to manage their businesses by looking at data. So “data awareness” has increased. But although some individual data quality has improved, there has been an explosion in the number of separate sources of data people are trying to use together. So, in that sense, there has been a big increase in the number of places that data quality needs to be addressed.
In the ETL class I introduce, for the first time anywhere, a specific architecture for controlling data quality all across the data warehouse. In the ETL class we will build two new data quality subsystems: the Error Event Fact Table, and the Audit Dimension. When a data warehouse has implemented these two back-room ETL subsystems, managers will be able to observe and control data quality. At my last ETL class, one student remarked to me that for the first time he could apply so-called “six sigma” quality methodologies to data warehouse data, using these subsystems.
8. One of the topics of your seminar is real time data warehousing, what are your views on this topic? Do you believe that real time data warehousing is the next great wave in the evolving data warehouse or do you believe that data as it is being analyzed should be static, that there is no pressing need for real time data?
As I stated earlier in this interview, “real time” is anything that is too fast for your ETL. Real time is not a specific up-to-the-second delivery of data. Few organizations need such up-to-the-second visibility of data. But many organizations want to break away from a conventional nightly batch extract of data, and move to multiple downloads of data each day. The key impact of moving to such a real time implementation is that you must change your ETL architecture in some significant way. It doesn’t help to buy a faster computer if you are still constrained to a nightly batch extract! In this case, you might replace the batch extract with an interface to message-gram traffic hosted by an EAI communications broker, like Tibco or Vitria. This fundamental architectural change is typical of altering your data warehouse to be “real time”. In the ETL class we will spend an entire class module on exactly this set of architectural changes, including new developments in EAI, EII, and CTF architectures.
9. How do you see service oriented architecture in relationship with the data warehouse? How will data warehousing or more specifically ETL fit into SOA?
Service oriented architectures are becoming an enormously important way to facilitate data exchange between geographically and organizationally distributed entities. For example, one of my consulting clients is a large systems vendor who maintains purchase order relationships with more than 100,000 suppliers. My client has a data warehouse surrounding this purchase order information. Since each of the 100,000 suppliers has their own computer system, an SOA architecture is an ideal way for my client to interface in a simple, uniform way to all these diverse systems, in order to send and receive standard information about purchase orders and inventory levels.
Service oriented architectures are built around XML documents and higher level protocols such as Web Services Description Language (WSDL). In the ETL class, one of the examples in the student worksheet is an SOA interface feeding data into the data warehouse.
Although SOA is currently only one of several approaches to accessing remote data, it may well become the dominant approach in the next few years, because it is specifically designed for transmitting data over the internet, and it “abstracts (hides) the complexity of the actual remote systems”.
10. What are your personal plans for the future: more books, more lectures, another visit to Slovenia?
Since I just finished my 5th book, the ETL book, I am in a quiet period where I cannot bear to think about another book! I have been enjoying the classes and the students in many interesting places. This September trip includes Iceland and Sweden as well as Slovenia! Certainly if the interest in data warehousing remains strong, I would be delighted to plan another trip to Slovenia.