DataOps is NOT Just DevOps for Data
One common misconception about DataOps is that it is just DevOps applied to data analytics. While a little semantically misleading, the name “DataOps” has one positive attribute. It communicates that data analytics can achieve what software development attained with DevOps. That is to say, DataOps can yield an order of magnitude improvement in quality and cycle time when data teams utilize new tools and methodologies.
The specific ways that DataOps achieves these gains reflect the unique people, processes and tools characteristic of data teams (versus software development teams using DevOps). Here’s our in-depth take on both the pronounced and subtle differences between DataOps and DevOps.
The Intellectual Heritage of DataOps
DevOps is an approach to software development that accelerates the build lifecycle (formerly known as release engineering) using automation. DevOps focuses on continuous integration and continuous delivery of software by leveraging on-demand IT resources (infrastructure as code) and by automating integration, test and deployment of code. This merging of software development and IT operations (“DEVelopment” and “OPerationS”) reduces time to deployment, decreases time to market, minimizes defects, and shortens the time required to resolve issues.
Using DevOps, leading companies have been able to reduce their software release cycle time from months to (literally) seconds. This has enabled them to grow and lead in fast-paced, emerging markets. Companies like Google, Amazon and many others now release software many times per day. By improving the quality and cycle time of code releases, DevOps deserves a lot of credit for these companies’ success.
Optimizing code builds and delivery is only one piece of the larger puzzle for data analytics. DataOps seeks to reduce the end-to-end cycle time of data analytics, from the origin of ideas to the literal creation of charts, graphs and models that create value. The data lifecycle relies upon people in addition to tools. For DataOps to be effective, it must manage collaboration and innovation. To this end, DataOps introduces Agile Development into data analytics so that data teams and users work together more efficiently and effectively.
In Agile Development, the data team publishes new or updated analytics in short increments called “sprints.” With innovation occurring in rapid intervals, the team can continuously reassess its priorities and more easily adapt to evolving requirements. This type of responsiveness is impossible using a Waterfall project management methodology which locks a team into a long development cycle with one “big-bang” deliverable at the end.
Studies show that Agile software development projects complete faster and with fewer defects when Agile Development replaces the traditional Waterfall sequential methodology. The Agile methodology is particularly effective in environments where requirements are quickly evolving — a situation well known to data analytics professionals. In a DataOps setting, Agile methods enable organizations to respond quickly to customer requirements and accelerate time to value.
Agile development and DevOps add significant value to data analytics, but there is one more major component to DataOps. Whereas Agile and DevOps relate to analytics development and deployment, data analytics also manages and orchestrates a data pipeline. Data continuously enters on one side of the pipeline, progresses through a series of steps and exits in the form of reports, models and views. The data pipeline is the “operations” side of data analytics. It is helpful to conceptualize the data pipeline as a manufacturing line where quality, efficiency, constraints and uptime must be managed. To fully embrace this manufacturing mindset, we call this pipeline the “data factory.”
In DataOps, the flow of data through operations is an important area of focus. DataOps orchestrates, monitors and manages the data factory. One particularly powerful lean-manufacturing tool is statistical process control (SPC). SPC measures and monitors data and operational characteristics of the data pipeline, ensuring that statistics remain within acceptable ranges. When SPC is applied to data analytics, it leads to remarkable improvements in efficiency, quality and transparency. With SPC in place, the data flowing through the operational system is verified to be working. If an anomaly occurs, the data analytics team will be the first to know, through an automated alert.
While the name “DataOps” implies that it borrows most heavily from DevOps, it is all three of these methodologies — Agile, DevOps and statistical process control — that comprise the intellectual heritage of DataOps. Agile governs analytics development, DevOps optimizes code verification, builds and delivery of new analytics and SPC orchestrates and monitors the data factory. Figure 2 illustrates how Agile, DevOps and statistical process control flow into DataOps.
You can view DataOps in the context of a century-long evolution of ideas that improve how people manage complex systems. It started with pioneers like Demming and statistical process control — gradually these ideas crossed into the technology space in the form of Agile, DevOps and now, DataOps.
DevOps vs. DataOps — the Human Factor
As mentioned above, DataOps is as much about managing people as it is about tools. One subtle difference between DataOps and DevOps relates to the needs and preferences of stakeholders.
Figure 3: DataOps and DevOps users have different mindsets
DevOps was created to serve the needs of software developers. Dev engineers love coding and embrace technology. The requirement to learn a new language or deploy a new tool is an opportunity, not a hassle. They take a professional interest in all the minute details of code creation, integration and deployment. DevOps embraces complexity.
DataOps users are often the opposite of that. They are data scientists or analysts who are focused on building and deploying models and visualizations. Scientists and analysts are typically not as technically savvy as engineers. They focus on domain expertise. They are interested in getting models to be more predictive or deciding how to best visually render data. The technology used to create these models and visualizations is just a means to an end. Data professionals are happiest using one or two tools — anything beyond that adds unwelcome complexity. In extreme cases, the complexity grows beyond their ability to manage it. DataOps accepts that data professionals live in a multi-tool, heterogeneous world and it seeks to make that world more manageable for them.
DevOps vs. DataOps — Process Differences
We can begin to understand the unique complexity facing data professionals by looking at data analytics development and lifecycle processes. We find that data analytics professionals face challenges both similar and unique relative to software developers.
The DevOps lifecycle is commonly illustrated using a diagram in the shape of an infinite symbol — See Figure 4. The end of the cycle (“plan”) feeds back to the beginning (“create”), and the process iterates indefinitely.
Figure 4: The DevOps lifecycle is often depicted as an infinite loop
The DataOps lifecycle shares these iterative properties, but an important difference is that DataOps consists of two active and intersecting pipelines (Figure 5). The data factory, described above, is one pipeline. The other pipeline governs how the data factory is updated — the creation and deployment of new analytics into the data pipeline.
The data factory takes raw data sources as input and through a series of orchestrated steps produces analytic insights that create “value” for the organization. We call this the “Value Pipeline.” DataOps automates orchestration and, using SPC, monitors the quality of data flowing through the Value Pipeline.
The “Innovation Pipeline” is the process by which new analytic ideas are introduced into the Value Pipeline. The Innovation Pipeline conceptually resembles a DevOps development process, but upon closer examination, several factors make the DataOps development process more challenging than DevOps. Figure 5 shows a simplified view of the Value and Innovation Pipelines.
Figure 5: The DataOps lifecycle — the Value and Innovation Pipelines
DevOps vs. DataOps — Development and Deployment Processes
DataOps builds upon the DevOps development model. As shown in Figure 6, the DevOps process flow includes a series of steps that are common to software development projects:
Develop — create/modify an application
Build — assemble application components
Test — verify the application in a test environment
Deploy — transition code into production
Run — execute the application
DevOps introduces two foundational concepts: Continuous Integration (CI) and Continuous Deployment (CD). CI continuously builds, integrates and tests new code in a development environment. Build and test are automated so they can occur rapidly and repeatedly. This allows issues to be identified and resolved quickly. Figure 6 illustrates how CI encompasses the build and test process stages of DevOps.
Figure 6: Comparing the DataOps and DevOps processes
CD is an automated approach to deploying or delivering software. Once an application passes all qualification tests, DevOps deploys it into production. Together CI and CD resolve the main constraint hampering Agile development. Before DevOps, Agile created a rapid succession of updates and innovations that would stall in a manual integration and deployment process. With automated CI and CD, DevOps has enabled companies to update their software many times per day.
The Duality of Orchestration in DataOps
It’s important to note that “orchestration” occurs twice in the DataOps process shown in Figure 6. As we explained above, DataOps orchestrates the data factory (the Value Pipeline). The data factory consists of a pipeline process with many steps. Imagine a complex directed acyclic graph (DAG). The “orchestrator” could be a software entity which controls the execution of the steps, traverses the DAG, and handles exceptions. For example, the orchestrator might create containers, invoke runtime processes with context-sensitive parameters, transfer data from stage to stage, and “monitor” pipeline execution. Orchestration of the data factory is the second “orchestration” in the DataOps process in Figure 7.
Figure 7: DataOps orchestrates the data factory.
As noted above, the Innovation Pipeline has a representative copy of the data pipeline which is used to test and verify new analytics before deployment into production. This is the orchestration that occurs in conjunction with “testing” and prior to “deployment” of new analytics — as shown in Figure 8.
Orchestration occurs in both the Value and Innovation Pipelines. Similarly, testing fulfills a dual role in DataOps.
Figure 8: DataOps orchestration controls the numerous tools that access, transform, model, visualize and report data.
The Duality of Testing in DataOps
Tests in DataOps have a role in both the Value and Innovation Pipelines. In the Value Pipeline, tests monitor the data values flowing through the data factory to catch anomalies or flag data values outside statistical norms. In the Innovation Pipeline, tests validate new analytics before deploying them.
In DataOps, tests target either data or code. In a recent blog, we discussed this concept using Figure 9. Data that flows through the Value Pipeline is variable and subject to statistical process control and monitoring. Tests target the data which is continuously changing. Analytics in the Value Pipeline, on the other hand, are fixed and change only using a formal release process. In the Value Pipeline, analytics are revision controlled to minimize any disruptions in service that could affect the data factory.
In the Innovation Pipeline code is variable and data is fixed. The analytics are revised and updated until complete. Once the sandbox is set-up, the data doesn’t usually change. In the Innovation Pipeline, tests target the code (analytics), not the data. All tests must pass before promoting (merging) new code into production. A good test suite serves as an automated form of impact analysis that runs on any and every code change before deployment.
Some tests are aimed at both data and code. For example, a test that makes sure that a database has the right number of rows helps your data and code work together. Ultimately both data tests and code tests need to come together in an integrated pipeline as shown in Figure 5. DataOps enables code and data tests to work together so all around quality remains high.
Figure 9: In DataOps, analytics quality is a function of data and code testing
DataOps Complexity — Sandbox Management
When an engineer joins a software development team, one of their first steps is to create a “sandbox.” A sandbox is an isolated development environment where the engineer can write and test new application features, without impacting teammates who are developing other features in parallel. Sandbox creation in software development is typically straightforward — the engineer usually receives a bunch of scripts from teammates and can configure a sandbox in a day or two. This is the typical mindset of a team using DevOps.
Sandboxes in data analytics are often more challenging from a tools and data perspective. First of all, data teams collectively tend to use many more tools than typical software dev teams. There are literally thousands of tools, languages and vendors for data engineering, data science, BI, data visualization, and governance. Without the centralization that is characteristic of most software development teams, data teams tend to naturally diverge with different tools and data islands scattered across the enterprise.
Figure 10: A “sandbox” is an isolated development environment where the data professional can write and test new analytics without impacting teammates.
DataOps Complexity — Test Data Management
In order to create a dev environment for analytics, you have to create a copy of the data factory. This requires the data professional to replicate data which may have security, governance or licensing restrictions. It may be impractical or expensive to copy the entire data, set so some thought and care is required to construct a representative data set. Once a multi-terabyte data set is sampled or filtered, it may have to be cleaned or redacted (have sensitive information removed). The data also requires infrastructure which may not be easy to replicate due to technical obstacles or license restrictions.
Figure 11: The concept of test data management is a first order problem in DataOps.
The concept of test data management is a first order problem in DataOps whereas in most DevOps environments, it is an afterthought. To accelerate analytics development, DataOps has to automate the creation of development environments with the needed data, software, hardware and libraries so innovation keeps pace with Agile iterations.
DataOps Connects the Organization in Two Ways
DevOps strives to help development and operations (information technology) teams work together in an integrated fashion. In DataOps, this concept is depicted in Figure 12. The development team are the analysts, scientists, engineers, architects and others who create data warehouses and analytics.
In data analytics, the operations team supports and monitors the data pipeline. This can be IT, but it also includes customers — the users who create and consume analytics. DataOps brings these groups together so they can work together more closely.
Figure 12: DataOps combines data analytics development and data operations.
Freedom vs. Centralization
DataOps also brings the organization together across another dimension. A great deal of data analytics development occurs in remote corners of the enterprise, close to business units, using self-service tools like Tableau, Alteryx, or Excel. These local teams, engaged in decentralized, distributed analytics creation play an essential role in delivering innovation to users. Empowering these pockets of creativity maintains the enterprise’s competitiveness, but frankly, a lack of top-down control can lead to unmanaged chaos.
Centralizing analytics development under the control of one group, such as IT, enables the organization to standardize metrics, control data quality, enforce security and governance, and eliminate islands of data. The issue is that too much centralization chokes creativity.
Figure 13: DataOps brings together centralized and distributed development
One important benefit of DataOps is its ability to harmonize the back-and-forth between the decentralized and centralized development of data analytics — the tension between centralization and freedom. In a DataOps enterprise, new analytics originate and undergo refinement in the local pockets of innovation. When an idea proves useful or is worthy of wider distribution, it is promoted to a centralized development group who can more efficiently and robustly implement it at scale.
DataOps brings localized and centralized development together enabling organizations to reap the efficiencies of centralization while preserving localized development — the tip of the innovation spear. DataOps brings the enterprise together across two dimensions as shown in Figure 14 — development/operations as well as distributed/centralized development.
Figure 14: DataOps brings teams together across two dimensions — development/operations as well as distributed/centralized development.
DataOps brings three cycles of innovation between core groups in the organization: centralized production teams, centralized data engineering/analytics/science/governance development teams, and groups using self-service tools distributed into the lines business closest to the customer. Figure 15 shows the interlocking cycles of innovation.
Figure 15: DataOps brings three cycles of innovation between production, central data, and self-service teams.
Enterprise Example — Data Analytics Lifecycle Complexity
Having examined the DataOps development process at a high level, let’s look at the development lifecycle in the enterprise context. Figure 15 illustrates the complexity of analytics progression from inception to production. Analytics are first created and developed by an individual and then merged into a team project. After completing unit acceptance testing (UAT), analytics move into production. The goal of DataOps is to create analytics in the individual development environment, advance into production, receive feedback from users and then continuously improve through further iterations. This can be challenging due to the differences in personnel, tools, code, versions, manual procedures/automation, hardware, operating systems/libraries and target data. The columns in Figure 15 show the varied characteristics for each of these four environments.
The challenge of pushing analytics into production across these four quite different environments is daunting without DataOps. It requires a patchwork of manual operations and scripts that are in themselves complex to manage. Human processes are error-prone so data professionals compensate by working long hours, mistakenly relying on hope and heroism for success. All of this results in unnecessary complexity, confusion and a great deal of wasted time and energy. Slow progression through the lifecycle shown in Figure 15 coupled with high-severity errors finding their way into production can leave a data analytics team little time for innovation.
Figure 15: Data Analytics Development Lifecycle Complexities
DataOps simplifies the complexity of data analytics creation and operations. It aligns data analytics development with user priorities. It streamlines and automates the analytics development lifecycle — from the creation of sandboxes to deployment. DataOps controls and monitors the data factory so data quality remains high, keeping the data team focused on adding value.
You can get started with DataOps by implementing these seven steps. You can also adopt a DataOps Platform which will support DataOps methods within the context of your existing tools and infrastructure.
A DataOps Platform automates the steps and processes that comprise DataOps: sandbox management, orchestration, monitoring, testing, deployment, the data factory, dashboards, Agile, and more. A DataOps Platform is built for data professionals with the goal of simplifying all of the tools, steps and processes that they need into an easy-to-use, configurable, end-to-end system. This high degree of automation eliminates a great deal of manual work, freeing up the team to create new and innovative analytics that maximize the value of an organization’s data.
Fuente: Medium de Data Kitchen
9 Distance Measures in Data Science
1. Euclidean Distance
We start with the most common distance measure, namely Euclidean distance. It is a distance
6 Ways AI is Transforming the Finance Industry
Scope of Artificial Intelligence in managing Finances
The sector has been witnessing unprecedented growth in term
Can You Trust Your Data?
The Data Trust Gap
In The State of Data Management — The Impact of Data Distrust, a rec