Automation of the Data Lifecycle: Focus on Data Storage

A deeper look into just how automation adds value at each of the phases of the data lifecycle and how automation at this level impacts the business (data) consumer.

In our first article, “Improve Data Lifecycle Efficiency with Automation,” we discussed how and where automation takes place throughout the data lifecycle.  We discussed each phase and summarized how automation has increased the speed and efficiency in how we identify, collect, integrate, and utilize data.  In this piece and in the ones to follow, we promised to take a deeper look into just how automation adds value at each of the phases of the data life cycle and how automation at this level impacts the business (data) consumer.

In our last article, “Automation of the Data Lifecycle: Focus on Data Creation,” we took an in-depth look at the first step of the process – the creation of enterprise data. We discussed the types of data, the sources of this data and touched on some of the major advances automating the acquisition of enterprise data that, when collected and analyzed, can provide near-immediate value.

In the content that follows, we focus on the next (second) step in the data lifecycle – the storage of data.  We focus on just how the use of automation can effectively and efficiently reduce the time it takes to move data to a method of storage that allows for its efficient consumption (the third step that we will provide insights on in a future article.) We also discuss how automation at this phase can help increase the quality of the data that is being stored.

Without stating the obvious, once the data has been created, those that wish to retain and leverage it, must move it to a storage repository – hopefully, one that is secure, well organized, and well-governed.  It is in this step that most of the foundational data management activities (collection, integration, cleansing, quality assurance, etc.) take place.  For purposes of this piece, we will focus less on the type of data repositories available to store large data sets and keep our eyes on the means of automating data collection and storage.

Automation and data storage

The movement of data from the legacy systems to centralized repositories has been a business challenge regardless of the storage solution – Access, EDW, HDFS, or some of the cloud-native platforms offering low-cost, high-volume storage.  Historically, the movement of data into those storage areas was undertaken on a periodic basis and accomplished via batch processing.  While this did allow for some base-level analysis of data in a ‘what happened and how did we get to this point’ perspective, the data used could be days old – at best. Real-time data ingestion to feed powerful analytics solutions demands the movement of high volumes of data from diverse sources without impacting source systems and with sub-second latency.  Today, tools like Apache Kafka offer data professionals high throughput and low latency providing real-time data feeds.  This type of open-source message broker leverages a cluster-centric design that can be elastically scaled without downtime.  Achieving visibility into business operations in real-time allows organizations to identify and capitalize on opportunities, address risk and harness these insights for a strategic advantage.

Many of the early data repositories allowed for reporting – yet were limited to the data that was available within that legacy environment.  Cross-functional analysis could only occur on data stored in warehouses whose data was anything but real-time and often cluttered with once in, never out data sets.  Architectures that are available today allow for the movement of the information from the transactional store to the warehouse/lake almost instantaneously and are part of most present-day architectures.  Solutions like Amazon Kinesis offer a cloud-based service for processing data in real time over extremely large data sets.  Such technologies offer the ability to capture and store terabytes of data from multiple sources, including social media, operational and financial systems, IT logs, clickstreams, and even wearable devices and industrial sensors. When we combine this with the idea of Edge computing, moving the computing power closer to the data itself, we can realize even less latency where real-time can be measured in sub-seconds rather than minutes.

Ensuring Data Quality at the Time and Place of Ingestion and Collection

Automating the ingestion and collection of data without an eye towards Data Quality and Governance may result in large volumes of data that is not fit for purpose.  Today’s data professionals have come to realize that there is a significant benefit in having some checks and balances to ensure that the data which is being ingested is of the quality that the business requires to have a high level of confidence and increase the confidence in the business decisions being made. Forward-thinking data professionals have an eye towards more than the automation of the ingestion process.  They seek to move the information into storage not only quickly and efficiently but correctly and with less human intervention and dramatically fewer errors.

The utilization of Data Governance and Data Quality paradigms at this point decreases the data rejection rate and increases the overall quality, and therefore value, of the data.  The implementation of data standards and governance provide the business rules that are used to cleanse the data, and this differs from the standard data validity checks that are undertaken as the information is entered; numeric versus alpha, valid date formats and ranges, etc.  The business rules look at different data items and compare them to one another based on the situation at hand.  If ‘A’ is present, then ‘B’ must not only be numeric (from the data validity check) but must also be between the values of X and Y.  These can be as simplistic or as complicated as necessary.  The rules are stored in an application which is then utilized as the data is being moved to verify the information and either accept, reject, or request further intervention.  The systems and applications that perform these validations are the engine of Data Quality and include solutions from well-known providers such as Informatica, IBM, and SAP.  The solutions provided by these vendors include a wide range of critical functions, including profiling, parsing, cleansing matching, standardization, and monitoring, amongst others. The basic process is relatively straightforward, as the information is analyzed, the system determines if the information is acceptable, is rejected outright, or sent for further investigation through human intervention.  Further, one can leverage Artificial Intelligence to monitor the decisions made on the rejected/investigated data and use identified patterns to resolve issues with rejected/investigation data.  This then helps to dramatically reduce the percentage of data requiring human intervention and more effectively move that data to the store for analysis as the AI ‘learns’ of the appropriate steps to take to resolve the anomalies.

Automation of the data storage process can help to reduce the potential for human intervention, save significant time and money and provide the information consumer with value-added insights. As we indicated in the first piece from this series, “Organizations that look at data as they do any other critical corporate asset or resource will be the most successful.”  This holds so true in this important part of the data lifecycle as having trusted, secure, high-quality data that can be available in real time can provide the smart data professional with a significant competitive advantage.

By