Why DataOps

Why DataOps?

Always more data... and more data sources

With widespread IoT, machine learning, and artificial intelligence adoption, data is critical to operations. We're constantly generating new data, and since storage is so cheap, we are storing information without fully understanding what all the use cases will be. We need a fast way of ingesting and organizing ALL the data and ongoing changes.

Compliance is critical

At the same time, we are seeing an increase in regulations like GDPR that govern how we acquire, transfer, store, process, and use data. To prove compliance, we need to be able to explain:

Where we got the data
How fresh it is
Who gave us permission to store and process it
What we did with it, including how we changed it
Who has access to it, what they can do to it, and what they have done to it

Data Teams must now maintain data about data and their pipelines while dealing with incoming business change requests from stakeholders. This might be OK if stakeholder demands weren’t becoming more frequent and complex.

Business agility and responsiveness

Stakeholders now want the ability to do things with data at the same speed they've become used to with other technology. To help deliver on these demands, data teams need to be able to:

Acquire and ingest new data with minimal effort and energy
Build out use cases within sandboxes and test their value quickly
Assemble production pipelines of logic quickly, ready for testing
Acquire, load, curate, track, and deliver data to the business faster, so stakeholders can derive value as quickly as possible

Too many data "tools"

In most organizations today, the cost of cleaning, transforming, and integrating data is just too high. Organizations often have too many tools doing the same or similar things, and no one really shares what they create. We must stop the chaos that exists because of too many tools or ungoverned self-service data-prep environments. DataOps is designed to fuel reuse rather than re-invention.

Data is precious

A tension exists between agility and governance in the data world, similar to the software development world. Achieving a healthy balance is even harder in the data world, as the scope of data being managed is generally broader. In the dev world, we have to deal with the code, the logic, and the directly applicable data being processed; the data world encompasses all the code, the components, the logic, AND entire data sets.

This data is often subject to regulatory requirements (e.g., GDPR), whereas the data in application logic is much less likely to be. All data is a precious organizational asset that must be managed, protected, and assured. If data is damaged or lost, you devalue a critical business asset and risk reputational damage or regulatory sanctions. But that risk is higher in the data world, when dealing with large datasets.

Gartner Definition

DataOps is a collaborative data management practice focused on improving the communication, integration and automation of data flows between data managers and data consumers across an organization. The goal of DataOps is to deliver value faster by creating predictable delivery and change management of data, data models and related artifacts. DataOps uses technology to automate the design, deployment and management of data delivery with appropriate levels of governance, and it uses metadata to improve the usability and value of data in a dynamic environment.

Wikipedia Definition

(Note that Wikipedia’s definition assimilates concepts gathered from Data Science Central, Datalytyx, WhatIs.com, DataKitchen, Tamr and Nexla)

DataOps is a set of practices, processes and technologies that combines an integrated and process-oriented perspective on data with automation and methods from agile software engineering to improve quality, speed, and collaboration and promote a culture of continuous improvement in the area of data analytics.^[1] While DataOps began as a set of best practices, it has now matured to become a new and independent approach to data analytics.^[2] DataOps applies to the entire data lifecycle^[3] from data preparation to reporting, and recognizes the interconnected nature of the data analytics team and information technology operations.^[4]

DataOps incorporates the Agile methodology to shorten the cycle time of analytics development in alignment with business goals.^[3]

DevOps focuses on continuous delivery by leveraging on-demand IT resources and by automating test and deployment of software. This merging of software development and IT operations has improved velocity, quality, predictability and scale of software engineering and deployment. Borrowing methods from DevOps, DataOps seeks to bring these same improvements to data analytics.^[4]

DataOps utilizes statistical process control (SPC) to monitor and control the data analytics pipeline. With SPC in place, the data flowing through an operational system is constantly monitored and verified to be working. If an anomaly occurs, the data analytics team can be notified through an automated alert.^[5]

DataOps is not tied to a particular technology, architecture, tool, language or framework. Tools that support DataOps promote collaboration, orchestration, quality, security, access and ease of use.^[6]

Eckerson Group Definition

DataOps, short for “data operations,” brings rigor to the development and management of data pipelines. It promises to turn the creation of analytic solutions from an artisanal undertaking by a handful of developers and analysts to an industrial operation that builds and maintains hundreds or thousands of data pipelines. DataOps not only increases the scale of data analytics development, but it accelerates delivery while improving quality and staff productivity. In short, it creates an environment where “faster, better, cheaper” is the norm, not the exception.

Mike Ferguson Definition

DataOps applies the use of DevOps technologies to the collaborative development and maintenance of data and analytical pipelines in order to produce trusted, reusable and integrated data and analytical products. These products include trusted, reusable datasets, predictive models, prescriptive models, decision services, BI reports and dashboards. The objective is to accelerate the creation and maintenance of these data and analytical products via continuous component based development of data and analytical pipelines that assemble and orchestrate data cleansing, data transformation, data matching and data integration and analytic component-based services. In addition all changes to versions, operating configurations are managed with build, test and deployment automates in order to shorten time to value.

The Benefits of DataOps (or #TrueDataOps)

For anyone familiar with DevOps, the benefits of DataOps will look quite familiar:

TOUCH

Collaborative development

Collaborative development

Collaborative development between business and data teams will increase the agility of the business and prevent business teams from buying their own self-service tools, reinventing pipelines, and creating inconsistent processes that are hard to maintain.

TOUCH

Increased efficiencies

Increased efficiencies

Smaller teams with more powerful tools are faster and more productive. DataOps removes the need for bloated operational teams hand-cranking the management of development, test, and production infrastructure.

TOUCH

Reduced implementation costs

Reduced implementation costs

By shortening the time to production and (where required) recovery, businesses can reduce costs by more than 70%—and that’s before we measure the additional value from their data analytics.

TOUCH

Maintainability and total cost of ownership

Maintainability and total cost of ownership

Account for both short-term implementation costs and long-term code maintenance. DataOps streamlines these aspects, cutting Total Cost of Ownership (TCO) by over 60%.

TOUCH

Simplified orchestration and management

Simplified orchestration and management

The DataOps philosophy transcends vendor-specific limitations. This allows your business to store all data together, creating more flexible, more useful use cases at lower cost.

TOUCH

Faster development

Faster development

Increasing the agility of data processes will help gain access to valuable insights in hours or days rather than weeks and months.

TOUCH

Build once, reuse anywhere

Build once, reuse anywhere

Create logic as small, reusable components and then use/reuse many times. Avoid duplicated code which ends up inconsistent.

TOUCH

Data assurance

Data assurance

Improve the quality of the data you deliver to your business and provide assurances and guarantees your business stakeholders can rely on.

TOUCH

Parallel development

Parallel development

Using technology to enhance collaboration allows data teams to do more—using DataOps we have seen a team of four people complete 200 cycles, 80 commits and 50 pushes to production in just one day.

TOUCH

Improved supply chain

Improved supply chain

Establish a supply chain of data producers. Rather than treating a new data source as a one-time project, enable departments to become ongoing data consumers and producers that can share data across the enterprise.