Has this happened to you before? You spend months designing, building, testing, and delivering a data warehouse solution. You feel a sense of accomplishment, not just because you delivered on time and on budget, but because you delivered something that was needed and that is now being used and appreciated by the end users.
And then some vendor walks in and laughs at the amount of time you spent developing your solution. They say to management: why don’t you just buy our tool, and you could click-click-click have your solution ready in 15 minutes! Sometimes I want to grab such vendors by the neck and shake them: do you really know what you are talking about?
Before expanding on this topic, let’s be realistic: today’s data warehouses are not keeping up with the big data explosion. It takes too long to deliver reports to the business users who may have lost interest in the time it took from initial enthusiasm until they have results in their hands. With all the steps required to build a data warehouse, including gathering requirements, designing the data model, building ETL to populate the data, creating reports and training users how to access them, it is not surprising that there is little room for agility or quick adaptability to change.
15 minutes? Really?
Let’s return to the question: is it really possible to do something in 15 minutes that would otherwise take many months to build? The answer to that is: it depends. It depends on whether you use a tool or whether you want to build that tool from scratch. So yes, there may be situations where you can replace a multi-month effort with a tool that you can set up in 15 minutes.
What about data warehousing? Can you really build a data warehouse in 15 minutes? As a demo, I’m sure you can. I’m not doubting the vendors’ demos. I’m sure you can configure a tool to access your source system data and click-click-click create a report. But is this really a production-ready data warehouse? No, it is not. There is more to the story than setting up a tool.
In source systems, we have data that is designed to support the OLTP process. This type of data is typically not well-suited for reporting. Data for reporting should be restructured to eliminate complexity and to enable users to correctly interpret the data. For example, if the source system supports historization or soft deletes, do users know how to correctly filter only the valid records? What about various statuses during transaction processing, do users know how to eliminate records that do not represent a finalized state of data? The purpose of a data warehouse is to do exactly that: to transform data into a format that is unambiguous and easily accessible to the business users.
Alternatives to ETL
Given the understanding that a data warehouse provides data to the business users that has been transformed, interpreted and cleaned appropriately then you can’t do all that in 15 minutes by just setting up a tool. But you don’t have to build your own custom ETL completely from scratch either. Nowadays you have many tools and technologies at your disposal that enable you to build a data warehouse more efficiently while still applying careful design principles. Here are some considerations:
- Virtualization. Virtualization tools enable you to transform data by applying business rules without having to move it physically across layers. This would take much less time to implement than building custom ETL although it would still require effort to define and implement.
- One single source of truth. How do we ensure that the data warehouse represents one single source of truth? Actually, it is becoming more acceptable that one data warehouse is not necessarily one single source of truth for an organization and it may be fine that each department or business function has its own version of the truth, allowing data warehouses to be built on a smaller scale, allowing for a quicker turnaround.
- Combine data from more than one source system. More often than not, data cannot be easily combined from one source system to another. Organizations rarely have master data management in the sense that there is only one customer record across the organization. Typically each source system tracks customers in its own way and there is no easy way to deduplicate and combine them. Same goes for products, suppliers, even classifications such as type of product or lines of business. Effort is required to define algorithms to correctly combine such data. This effort is not easily replaced with a tool.
- Data freshness. Unless there is batch mode ETL, data that is accessed straight from the source system is changing in real time. Business users are not equipped to deal with real time data changing. They may be are looking at the current month revenue, which changes each time they refresh the report. Despite the fact that they want data as fresh as possible, what they really want is for the data to be stable while they are analyzing it. This means that another layer of abstraction must be introduced.
In addition, we can start thinking about how to take advantage of modern architectures and methodologies that would enable us to build a data warehouse in much less time than the multi-month effort that is currently still the norm:
- Cloud. Cloud architecture takes away management of infrastructure such as administration, system management, tuning and upgrading. This allows us to focus on getting value from the data warehouse rather than on managing system software and hardware. Many cloud vendors offer elasticity which means that the data warehouse can grow over time without having to purchase extensive hardware up-front.
- Self-service. Power users can do their own data transformations given the proper tools. They can do experiments and prototypes before committing to a full solution. With in-memory data solutions, response times are super-fast as compared to accessing data that is stored on disk and users can quickly create their own reports. This saves time in not having to gather user requirements and build custom reports and tweak them to meet expectations.
- Agile. To succeed today, businesses must adapt quickly. The old kind of data warehousing with just a single server where data is stored and retrieved, and managed by one team is not agile enough. Cross-functional teams that engage in continuous evolution are the preferred way to go forward. Such teams can build operational data warehouses to combine data from multiple sources in real time and use their data in day-to-day operations.
In conclusion, is this the future of data warehousing? Will we really be able to build a data warehouse in 15 minutes? Based on the discussion above, I strongly disagree. It would take more than 15 minutes to get it right. Which does not mean that we should be complacent. With all the tools, technologies and methodologies on the market today, we can certainly decrease the time to delivery from months to weeks.