The What and Why of DataOps?
How the changing market and the many benefits of DataOps make it an essential component of today’s data management practice.
Always more data... and more data sources
With widespread IoT, machine learning, and artificial intelligence adoption, data is critical to operations. We're constantly generating new data, and since storage is so cheap, we are storing information without fully understanding what all the use cases will be. We need a fast way of ingesting and organizing ALL the data and ongoing changes.
Compliance is critical
At the same time, we are seeing an increase in regulations like GDPR that govern how we acquire, transfer, store, process, and use data. To prove compliance, we need to be able to explain:
- Where we got the data
- How fresh it is
- Who gave us permission to store and process it
- What we did with it, including how we changed it
- Who has access to it, what they can do to it, and what they have done to it
Data Teams must now maintain data about data and their pipelines while dealing with incoming business change requests from stakeholders. This might be OK if stakeholder demands weren’t becoming more frequent and complex.
Business agility and responsiveness
Stakeholders now want the ability to do things with data at the same speed they've become used to with other technology. To help deliver on these demands, data teams need to be able to:
- Acquire and ingest new data with minimal effort and energy
- Build out use cases within sandboxes and test their value quickly
- Assemble production pipelines of logic quickly, ready for testing
- Acquire, load, curate, track, and deliver data to the business faster, so stakeholders can derive value as quickly as possible
Too many data "tools"
In most organizations today, the cost of cleaning, transforming, and integrating data is just too high. Organizations often have too many tools doing the same or similar things, and no one really shares what they create. We must stop the chaos that exists because of too many tools or ungoverned self-service data-prep environments. DataOps is designed to fuel reuse rather than re-invention.
Data is precious
A tension exists between agility and governance in the data world, similar to the software development world. Achieving a healthy balance is even harder in the data world, as the scope of data being managed is generally broader. In the dev world, we have to deal with the code, the logic, and the directly applicable data being processed; the data world encompasses all the code, the components, the logic, AND entire data sets.
This data is often subject to regulatory requirements (e.g., GDPR), whereas the data in application logic is much less likely to be. All data is a precious organizational asset that must be managed, protected, and assured. If data is damaged or lost, you devalue a critical business asset and risk reputational damage or regulatory sanctions. But that risk is higher in the data world, when dealing with large datasets.
So, what is DataOps?
There have been several attempts to define the concept of DataOps. Here are examples from Gartner, Wikipedia, Mike Ferguson and the Eckerson Group...
- Gartner Definition
- Wikipedia Definition
- Eckerson Group Definition
- Mike Ferguson Definition
DataOps is a collaborative data management practice focused on improving the communication, integration and automation of data flows between data managers and data consumers across an organization. The goal of DataOps is to deliver value faster by creating predictable delivery and change management of data, data models and related artifacts. DataOps uses technology to automate the design, deployment and management of data delivery with appropriate levels of governance, and it uses metadata to improve the usability and value of data in a dynamic environment.
(Note that Wikipedia’s definition assimilates concepts gathered from Data Science Central, Datalytyx, WhatIs.com, DataKitchen, Tamr and Nexla)
DataOps is a set of practices, processes and technologies that combines an integrated and process-oriented perspective on data with automation and methods from agile software engineering to improve quality, speed, and collaboration and promote a culture of continuous improvement in the area of data analytics. While DataOps began as a set of best practices, it has now matured to become a new and independent approach to data analytics. DataOps applies to the entire data lifecycle from data preparation to reporting, and recognizes the interconnected nature of the data analytics team and information technology operations.
DevOps focuses on continuous delivery by leveraging on-demand IT resources and by automating test and deployment of software. This merging of software development and IT operations has improved velocity, quality, predictability and scale of software engineering and deployment. Borrowing methods from DevOps, DataOps seeks to bring these same improvements to data analytics.
DataOps utilizes statistical process control (SPC) to monitor and control the data analytics pipeline. With SPC in place, the data flowing through an operational system is constantly monitored and verified to be working. If an anomaly occurs, the data analytics team can be notified through an automated alert.
DataOps is not tied to a particular technology, architecture, tool, language or framework. Tools that support DataOps promote collaboration, orchestration, quality, security, access and ease of use.
Eckerson Group Definition
DataOps, short for “data operations,” brings rigor to the development and management of data pipelines. It promises to turn the creation of analytic solutions from an artisanal undertaking by a handful of developers and analysts to an industrial operation that builds and maintains hundreds or thousands of data pipelines. DataOps not only increases the scale of data analytics development, but it accelerates delivery while improving quality and staff productivity. In short, it creates an environment where “faster, better, cheaper” is the norm, not the exception.
Mike Ferguson Definition
DataOps applies the use of DevOps technologies to the collaborative development and maintenance of data and analytical pipelines in order to produce trusted, reusable and integrated data and analytical products. These products include trusted, reusable datasets, predictive models, prescriptive models, decision services, BI reports and dashboards. The objective is to accelerate the creation and maintenance of these data and analytical products via continuous component based development of data and analytical pipelines that assemble and orchestrate data cleansing, data transformation, data matching and data integration and analytic component-based services. In addition all changes to versions, operating configurations are managed with build, test and deployment automates in order to shorten time to value.
As we can see, even accepted authorities do not agree on definitions. And we can expect to see definitions change further as data teams get real experience of delivery using the DataOps philosophy.
Regardless, there’s one thing everyone can agree on…
The Benefits of DataOps (or #TrueDataOps)
For anyone familiar with DevOps, the benefits of DataOps will look quite familiar:
Collaborative development between business and data teams will increase the agility of the business and prevent business teams from buying their own self-service tools, reinventing pipelines, and creating inconsistent processes that are hard to maintain.
Smaller teams with more powerful tools are faster and more productive. DataOps removes the need for bloated operational teams hand-cranking the management of development, test, and production infrastructure.
By shortening the time to production and (where required) recovery, businesses can reduce costs by more than 70%—and that’s before we measure the additional value from their data analytics.
Account for both short-term implementation costs and long-term code maintenance. DataOps streamlines these aspects, cutting Total Cost of Ownership (TCO) by over 60%.
The DataOps philosophy transcends vendor-specific limitations. This allows your business to store all data together, creating more flexible, more useful use cases at lower cost.
Increasing the agility of data processes will help gain access to valuable insights in hours or days rather than weeks and months.
Create logic as small, reusable components and then use/reuse many times. Avoid duplicated code which ends up inconsistent.
Improve the quality of the data you deliver to your business and provide assurances and guarantees your business stakeholders can rely on.
Using technology to enhance collaboration allows data teams to do more—using DataOps we have seen a team of four people complete 200 cycles, 80 commits and 50 pushes to production in just one day.
Establish a supply chain of data producers. Rather than treating a new data source as a one-time project, enable departments to become ongoing data consumers and producers that can share data across the enterprise.
Join the #TrueDataOps movement
Rethink the way you work with data. Join the #TrueDataOps movement and get regular access to DataOps content. Sign up now and get started with the DataOps for Dummies ebook.