A case for managing data uniquely for each form of advanced analytics

From mining to machine learning, each form of advanced analytics has its own requirements for how data must be managed. Failing to satisfy these requirements leads to hamstrung analytics.

We say «analytics» as if it’s a single entity. In reality, it is a collection of technologies and user best practices, including data mining, text mining, clustering, statistics, graph, artificial intelligence, machine learning, self-service, visualization, and so on. The list gets longer if we include techniques that are not very analytical or advanced, such as reporting, dashboarding, and online analytical processing (OLAP). Each of these analytics forms has its own methods, use cases, analytics tools, and — especially — data requirements.

For example, the dimensional modeling of carefully cleansed data required for OLAP differs sharply from the massive volumes of randomly structured raw source data typically required by data mining. This is why traditional data warehouses are ruthlessly structured, whereas the data lake’s primary mandate is to be a repository of unaltered source data, suited to data exploration and discovery-driven analytics.

As another example, most tools for natural language processing (NLP) work best with data in the form of human language text stored in files, which is very different from the relational tables assumed by most tools for reporting and self-service data access, prep, and visualization. Finally, machine learning (in support of predictive analytics) is an extreme case; learning data, training data, and production data are integrated and managed quite differently.

When you put together all the data requirements (for integration, storage, and related tasks) for all the forms of advanced analytics, the list becomes daunting. Data management professionals are under pressure to develop solutions for many more use cases and tool types than ever before. Additional pressure comes from an unrelated trend: the appearance of many new data types (from IoT and SaaS apps) and data platforms (on clouds and open source).

Providing solutions for all these different use cases is time-consuming and expensive to staff, deploy, and administer. Yet it must be done to get full business value and organizational advantage from new analytics and data assets.

Trends in Data Management for Advanced Analytics

Why has tailoring data management to the needs of analytics programs become more urgent?

Modern business demands modern analytics. Many organizations will commit to multiple forms of analytics because each reveals different but valuable insights, opportunities, and solutions. Hence, many enterprises are diversifying their portfolios of analytics to include a wider range of analytics — so they can make better fact-based decisions, plan for an uncertain future, compete on analytics, and grow customer accounts. These high-value business goals demand advanced forms of analytics, which in turn demand use-case-appropriate data management. Without the right data in the right format on the right platform, critical and expensive efforts in advanced analytics have limited business value or return on investment (ROI).

The increasing adoption of advanced analytics tools and practices is forcing changes in data management. Satisfying the diverse data management requirements for such advanced analytics tools is the leading driver behind data warehouse modernization, the adoption of self-service analytics (data prep, visualization), the deployment of new data platforms (Hadoop, NoSQL, clouds, lakes), and multiplatform hybrid data architectures.

Data management must modernize to better support advanced analytics. From a data tooling viewpoint, managing data for advanced analytics involves every form of data integration (ETL/ELT, virtualization, quality processes, etc.), data semantics (metadata, catalogs), database management system (relational, columnar, NoSQL), and new data platform (based on clouds, open source, Hadoop). All these data management tools and platforms must be modernized to address the data requirements of advanced analytics.


Let me conclude by summarizing the assumptions stated here:

  • Each form of advanced analytics has distinct data requirements. For example, self-service analytics works best with subsets of lightly standardized data and business metadata, whereas mining and statistics tend to excel with massive volumes of raw source data and little or no metadata.
  • How well you satisfy the data requirements (for both integration and storage) of a specific form of advanced analytics influences its level of success or failure.
  • Hence, you cannot perform data management for advanced analytics in a single way and expect all implementations of advanced analytics to yield useful and accurate outcomes. Instead, data management solutions must be designed for specific forms of advanced analytics, sometimes down to specific analytics applications.

Fuente: Philip Russom