Data Quality in the Age of Big Data
Traditional data quality best practices and tool functions still apply to big data, but success depends on making the right adjustments and optimizations.
Whether data is big or small, old or new, traditional or modern, on premises or in the cloud, the need for data quality doesn’t change. Data professionals under pressure to get business value from big data and other new data assets can leverage existing skills, teams, and tools to ensure quality for big data. Even so, just because you can leverage existing techniques doesn’t mean that’s all you should do. We must adapt existing techniques to the requirements of the current times.
Data professionals must protect the quality of traditional enterprise data as they adjust, optimize, and extend data quality and related data management best practices to fit the business and technical requirements of big data and similar modern data sets. Unless an organization does both, it may fail to deliver the kind of trusted analytics, operational reporting, self-service functionality, business monitoring, and governance that are expected of all data assets.
Adjustments and Optimizations Make Data Quality Tasks Relevant to Big Data
The good news is that organizations can apply current data quality and other data management competencies to big data. The slightly bad news is that organizations need to understand and make certain adjustments and optimizations. Luckily, familiar data quality tasks and tool functions are highly relevant to big data and other valuable new data assets — from Web applications, social media, the digital supply chain, SaaS apps, and the Internet of Things — as seen in the following examples.
Standardization. A wide range of users expect to explore and work with big data, often in a self-service fashion that depends on SQL-based tools. Data quality’s standardization makes big data more conducive to ad hoc browsing, visualizing, and querying.
Deduplication. Big data platforms invariably end up with the same data loaded multiple times. This skews analytics outcomes, makes metric calculations inaccurate, and wreaks havoc with operational processes. Data quality’s multiple approaches to matching and deduplication can remediate data redundancy.
Matching. Links between data sets can be hard to spot, especially when the data comes from a variety of source systems, both traditional and modern. Data quality’s data matching capabilities help validate diverse data and identify dependencies among data sets.
Profiling and monitoring. Many big data sources — such as e-commerce, Web applications, and the Internet of Things (IoT) — lack consistent standards and evolve their schema unpredictably without notification. Whether profiling big data in development or monitoring it in production, a data quality solution can reveal new schema and anomalies as they emerge. Data quality’s business rule engines and new smart algorithms can remediate these automatically at scale.
Customer data. As if maintaining the quality of traditional enterprise data about customers isn’t challenging enough, many organizations are now capturing customer data from smartphone apps, website visits, third-party data providers, social media, and a growing list of customer channels and touchpoints. For these organizations, customer data is the new big data. All mature data quality tools have functions designed for the customer domain. Most of these tools have been updated recently to support big data platforms and clouds to leverage their speed and scale.
Tool automation. Big data is so big — in size, complexity, origins, and uses — that data professionals and analysts have trouble scaling their work to big data accurately and efficiently. Furthermore, some business users want to explore and profile data, spot quality problems and opportunities, and even remediate data on their own, at scale and in a self-service manner. Both scenarios demand tool automation.
Tools for data quality have long supported business rules to automatically make some development and remediation decisions. Business rules are not going away — multiple types of users still find them useful, and many have a large library of rules they cannot abandon.
Business rules are being joined by new approaches to automation that have recently arrived for a variety of data management tools, including those for data quality. These usually take the form of smart algorithms that apply predictive functions, based on artificial intelligence and machine learning, to automatically determine what the state of data is, which quality function to apply, and how to coordinate these actions with developers and users.
Data Quality Must Adopt the New Paradigms of Modern Data Management
Practices for data quality (and related practices for data integration, metadata management and customer views) must be altered to follow different paradigms. Note that in the following examples most of the paradigm shifts are necessary to meet new requirements in big data analytics.
Ingest big data sooner, improve it later. One of the strongest trends in data management is to store incoming data far sooner so that big data is accessible as early as possible for time-sensitive processes such as operational reporting and real-time analytics. In these scenarios, persisting data takes priority over improving data’s quality. To accelerate the persistence of data to storage, up-front transformations or aggregations of data are minimal or omitted under the assumption that users and processes can make those improvements later when big data is accessed or repurposed.
Big data quality on the fly. The ramification of these paradigm shifts is that data aggregation and quality improvements are increasingly done on the fly — at read time or analysis time. This pushes data quality execution closer to real-time. Furthermore, on-the-fly big data quality functions are sometimes embedded in other solutions, especially those for data integration, reporting, and analytics. To enable embedding and achieve real-time performance, modern tools offer most data quality functions as services. Luckily, today’s fast CPUs, in-memory processing, data pipelining, and MPP data architectures provide the high performance required to execute data quality on the fly at big data scale.
Preserve big data’s arrival (original) state for future repurposing. A newly established best practice with big data is to preserve all the detailed content, structures, conditions, and even anomalies that it has when it arrives from a source. Storing and protecting big data’s arrival state provides a massive data store — usually a data lake — for use cases that demand detailed source information. Use cases include data exploration, data discovery, and discovery-oriented analytics based on mining, clustering, machine learning, artificial intelligence, and predictive algorithms or models.
Furthermore, the store of detailed source data can be repurposed repeatedly for future analytics applications whose data requirements are impossible to know in advance. Data that is aggregated, standardized, and fully cleansed cannot be repurposed as flexibly or broadly as data in its arrival state.
Data quality in parallel. The best practice today with Hadoop, data lakes, and other big data environments is to maintain a massive store of detailed raw data as a kind of source archive. Instead of transforming the source, users make copies of data subsets needing quality improvements and apply data quality functions to the subsets. Similarly, data scientists and analysts create so-called data labs and sandboxes where they improve data for analytics. This “data quality in parallel” is necessary to retain the original value of big data while creating a different kind of value through mature data quality functions.
Context-appropriate data quality. Analytics users today tend to alter big data subsets as little as they can get away with because most approaches to modern analytics tend to work well with original detailed source data, and analytics often depends on anomalies for discoveries. For example, nonstandard data can be a sign of fraud, and outliers may be harbingers of a new customer segment. As another example, detailed source data may be required for the accurate quantification of customer profiles, complete views, and performance metrics.
For More Information
For an in-depth discussion of data quality, read the 2018 TDWI Checklist Report: Optimizing Data Quality for Big Data here. Many of the key points discussed in this article are drawn from that report.
Fuente: Philip Russom
7 lessons to ensure successful machine learning projects
When Michelle K. Lee, ’88, SM ’89, was sworn in as the director of the U.S. Patent and Trademark Agency in 2015, she saw an opportuni
CDO’s Next Major Task: Enabling Data Access for Non-Analysts
The chief data officer (CDO) has taken on far greater digital responsibility than her predecessor has. She spearheaded the digital transf
9 Distance Measures in Data Science
1. Euclidean Distance
We start with the most common distance measure, namely Euclidean distance. It is a distance