Machine learning can significantly automate generating insights from big data. Here’s how to get started.
Implementation of machine learning (ML) is often misunderstood, but knowledge of the technological tools and processes that facilitate the generation of data-derived insights is vital. With the increased volume of big data, it is more difficult to generate insights using traditional analytics. The ability of ML to significantly automate this process complements the growth of big data, especially when ML infrastructure is understood.
That means addressing the four key steps to preparing for ML:
Sourcing the data
Establishing a trusted zone or «single source of truth» (SSOT)
Establishing modeling environments
Provisioning model outputs or insights to downstream applications
Step 1: Source the Data
Data sourcing includes surveying accessible data types for inputs to the algorithm, as well as the processes and technologies needed to tap into these sources. Examples of data sources include core transactions, customer-provided information, external databases, market research data, social media, and website traffic.
Step 2: Establish a Trusted Zone
Once data is sourced, it must be curated through an SSOT (which structures the data into a consistent place). It is important to prove data validity and quality as data is handled. Before data can be consumed for ML, it must be aggregated, reconciled, and validated. Key attributes of a trusted zone include:
A central repository of data, aggregated from multiple channels.
Clearly defined and documented data elements and data lineage.
Documentation of assumptions. For example, if hospital data from a previous management system conflicts with elements of the current system, perhaps the most recent data entry prevails. This assumption must be documented.
Protocol for addressing unintended exceptions. Consider the previous example and assume that a patient had conflicting same-date data entries in both systems. The stack should capture such exceptions as a business intelligence report, and data may be manually entered into the trusted zone.
Daily reporting that matches and reconciles counts across systems.
Architecture that expands vertically and horizontally.
The data store that houses the trusted zone should have high availability and be resilient to failure. Lately, more data warehouses are hosted on cloud platforms. Cloud benefits include high availability, cost-effectiveness, and horizontal and vertical scaling. Another trend is increasing adoption of NoSQL databases (such as MongoDB), which provide greater flexibility and better performance to store unstructured data than traditional relational databases.
As with all things digital, regulation and security of data are critical. Data is more intimate today, and privacy and security regulations are more complicated. The data governance team should be part of any ML implementation. Having data lineage that tracks data sourcing is necessary to ensure compliance.
Data collected and held must be protected. Security and risk management teams must be involved to initiate and monitor best practices and to develop security breach response plans. Investment in outsourced assistance is worthwhile for smaller institutions. If cloud vendors are utilized, they must contractually agree that data security is their responsibility. Transmission of data from on premises to the cloud and back must be part of the scope and should be carefully designed to address security risk. Data encryption is valuable before transmittal to the cloud, even when transmission occurs over a secured virtual private network.
Step 3: Creating the ML Modeling Environment
Curated data from the SSOT can then be sourced in a modeling environment created to implement ML algorithms. The modeling environment facilitates creating models that generate meaningful insights in a way that passes model validation and audit requirements. There are three components: modeling infrastructure, development tools, and DevOps. Different options for ML modeling environments include:
Ready-to-use services: These are pretrained general purpose models packaged as ready-to-use services such as text to speech, speech to text, OCR, etc. Examples are Amazon’s Polly and IBM’s Watson.
Automated ML: These are applications with a graphical user interface (GUI) and canned steps or workflows to perform ML. They allow subject matter experts/business users to utilize precooked ML pipelines with very little programming knowledge. They do a decent job for many but not all use cases. One useful example is DataRobot.
ML Workbench: These are prebuilt ML modeling environments with configurable programming tools and DevOps built in. A programmer just needs to configure the tools and start building the models. An example is Amazon’s SageMaker.
Custom-/in-house-built ML modeling environments: All components of a modeling environment, programming tools, and DevOps tools are gathered, created, configured, and maintained by the institution.
A current trend is the movement of modeling platforms to the cloud from in-house implementation of Apache Hadoop. Hadoop-based stacks could have high up-front costs and can be complicated to maintain. Moving to the cloud offers several benefits, including flexibility and minimal up-front capital investment. As the needs for storage and computation change, it adapts seamlessly. Think of it as «pay as you go.» Most major cloud providers also offer ML ready-to-use services and ML workbenches that could be utilized with minimal setup requirements.
ML modeling environments should be set up to facilitate model validation and account for associated challenges. Models must be validated for bias, must be explainable, and must document parameter and method selection. Documentation must be detailed so that a third party could recreate the model without being provided source code. It is therefore important to standardize model development and validation processes.
Assessing model risk is typically required before production. Regulatory guidelines require decision makers to understand the intent for building these models, assumptions made, and limitations. Using a model outside the scope of its initial intent should be avoided. Although ML is great at modeling complicated nonlinear scenarios, it is less transparent than traditional models, making ML model validation challenging. For example, with today’s hospitals overrun by coronavirus, ML-based models can help with triaging equipment based on clinical data. However, they cannot be practically used without documenting that such a model is not unreasonably biased against a certain population group.
The selected model must have conceptual reasoning behind its development and construction. It is important to document why the model was selected, the math behind it, and the feature-selection process. Sourcing of features and data integrity are also essential and more easily accomplished with an SSOT. Special care should be taken when utilizing AutoML because it provides pre-cooked models that must pass for conceptual soundness. Model validation should be closely assessed when selecting any AutoML product.
Step 4: Provisioning Insights from ML
Delivery of insights is categorized as real-time delivery or batch. Real-time insights are required to be processed, generated, and delivered within short time frames or in near real-time, such as detection of fraudulent transactions. Batch delivery is processed and generated in groups. Examples include models that predict customer behavior.
Considerations for designing and hosting the compute tier for real-time models include request frequency and load. If this is unpredictable or highly variable, hosting the compute tier in the cloud is advisable. Creating a web service-based API layer dedicated to this compute tier is also wise. Real-time models should require registration to the API layer, which should enable applications to retrieve information on how to structure API requests and the expected structure of output.
ML models differ from traditional models in that they can be continuously trained. A training feedback loop should be created and should save inputs passed to the model, as well as resulting outputs and whether those outputs are meaningful. Visual analytics can also be used to present insights that are generated from the modeling platform in a meaningful way.
Leveraging Data’s Benefits
By understanding ML technology stack implementation, companies can leverage the benefits of data and generate programming that could transform their businesses. Following the four operational steps described in this article and implementing supportive strategies will result in better efficiency. Early adopters have a better chance of seeing success.
By Ankur Garg