When building your data and analytics program, you must decide whether you need a data warehouse, a data lake, or both. Understanding the difference is the first step.
Anyone building a data and analytics program has run across the terms data warehouse and data lake. You might even have a data warehouse already, serving up dashboards and scorecards to your users. If you are like me, you might question how your data warehouse is different from a data lake and whether both need to be part of your overall analytics strategy.
The answer is these two constructs have different roles in the organization. A data warehouse and a data lake differ in data structure, user base, and governance model. Once you understand these differences, it becomes much easier to evaluate whether you need just one or both in combination.
The first key difference between the data warehouse and the data lake is the structure of the data. The data warehouse has been around since the 1970s and has been described by Bill Inmon as a subject-oriented, integrated, time-variant, nonvolatile collection of data in support of management’s decisions. What does each of these parts mean?
Subject-oriented: The data is structured around a specific business subject area (e.g., sales, inventory, or employees). Often these warehouses grow subject by subject until they can answer the questions the business asks.
Integrated: As an organization’s data is spread across multiple transactional systems, data is merged and integrated to present a holistic view of the subject area. This entails a data quality process and matching and merging data elements to create cohesion among disparate data.
Time-variant: Historical data is maintained in the data warehouse. Changes to the data are maintained to allow reporting to accurately represent historical point-in-time views. An example is keeping past customer addresses to accurately represent sales by location across time and not update those sales as the customer relocates.
Non-volatile: Once data is loaded into the data warehouse, it doesn’t change. This creates stability for reporting. Versioning and slowly changing dimensions are used to show changes over time.
A data lake is much different. It is a repository of data in its raw or natural format. This can include relational, semistructured, unstructured, and even binary data. The idea with a data lake is to have less data restructuring and data cleansing to expedite loading this data quickly from multiple sources. This doesn’t mean the data lake is pure chaos; it is structured for optimal loading and for eventual structuring. The goal is to implement structure on data during the read process, merging and munging the raw data into something valuable on an as-needed basis.
A data lake is also structured to keep all data — both data that is being used and data that could be used in the future. The extensive nature of the data set in the data lake creates an optimal source from which to build out the data warehouse now and in the future. As new subject areas are identified as desirable in the data warehouse, a data lake already has the data necessary to do the initial build and the ongoing incremental loads.
Because a data warehouse is specifically formulated to answer subject-oriented questions, it has the highest usability level. This means a wide audience of information consumers and decision makers from across the organization can pull information from the data warehouse.
One of the most common models for the data warehouse, the star schema, has been integrated into reporting tools such as Microsoft Power BI, Tableau, and QlikView to allow users to drag and drop elements and easily build powerful reports, dashboards, and scorecards. However, if they need new and novel data, it often takes coordination and collaboration with the IT department to build out these data elements in the data warehouse. Once built, they can be used for a myriad of business purposes.
A data lake includes multiple types of structured, semistructured, and unstructured data, so the audience must be more technical to extract information — its users are data scientists, data engineers, and analytics developers who use a variety of tools to perform data discovery on this data. From programming languages such as R, Python, and SAS to graphical data discovery tools such as Knime and RapidMiner, these tools provide many methods to extract different types of data and perform analysis on it.
Users in this environment are given great flexibility in terms of what analysis can be performed and what answers can be discovered. They often extract small portions of the data into a temporary analysis environment and work with just that subset to find their answers.
More advanced users, such as data scientists, will find value in using both the data warehouse and the data lake. They are experienced in balancing the ease of extraction from the source with the nature of their analysis and will switch often between their tools. Often, experiments with the data in the data lake will lead to requests to build permanent features that create long-term enhancements to the data warehouse.
The data warehouse requires up-front planning and controls to ensure it remains accessible for the widest base of users — its governance model is much stronger than the data lake’s. Changes to the data warehouse require development and testing before they are moved into a production environment. When changes happen, data catalogs and business semantic layers have to be updated to ensure the new data elements can be used. This mature governance model is required when the information assets have the greatest ongoing operational value to the organization.
The data lake has much less governance in terms of what can be done within it. This is true for both the addition of new structures and the types of experimentation that can be done using the data in this environment. It is this freedom that allows your data scientists to discover the insights that have the greatest potential to change the business.
This is where a data lake and a data warehouse can interact beautifully. Data engineers and data scientists can perform analysis and experimentation in the data lake. Once the results are satisfactory, they can be systematically pushed through the development process and rolled into production in the data warehouse for a wider audience to leverage. This allows an organization to balance between the tried and true required to keep the business moving forward and the new and novel with the potential to disrupt business and drive it into new spaces.
A Final Word
The data warehouse and data lake are not the same. Each has value to the organization. They can be deployed independently or in concert to meet your business objectives. The key is to understand the role each plays and make deliberate choices based on your data and analytics strategy.
Fuente: Troy Hiltbrand