Governance or agility?
Do you have to compromise one to get the other? Or is the way we approach data evolving?
The Philosophy of DataOps
LIVE! bi-monthly on Wednesdays | 8 AM PST | 4 PM GMT
AVAILABLE ON-DEMAND IMMEDIATELY FOLLOWING VIA LINKEDIN
Welcome to the #TrueDataOps podcast with your host Kent Graziano, The Data Warrior! In this series, Kent discusses all things DataOps with the people leading the field and bringing dataops to maturity. We will discuss not only the philosophy of #TrueDataOps and the state of data today but also how DataOps is being used and adopted across industries to drive better business value and what the future holds in the world of data.
Register for the podcast newsletter to stay up to date on previous episode show notes & upcoming guests.
The DataOps Manifesto has become a valuable resource to define the WHAT: what we are aiming to achieve with DataOps, and what DataOps will deliver. Our Philosophy of DataOps has been created to define the HOW: the way to deliver DataOps and the core pillars that should underpin any DataOps deployment.
Sign the Philosophy and join the #TrueDataOps Community. Get access to more DataOps content updates each month from members, and get your copy of DataOps for Dummies book. Rethink the way you work with your data.
Having been involved in the delivery of big data analytics projects over the last 10 years, and data integration/data quality projects over the last 20 years, we have come to recognise similar challenges across the data teams in many organisations.
Business users, who 10 or even 5 years ago would have been happy with an answer "next quarter", now expect results within days. Why is this? The rapid improvements made across the Enterprise IT and software spaces have taught business users that day response times are now the norm; business users' expectations have been reset to expect higher levels of service. But data teams have been left behind, with processes and technologies unable to keep up with this expectation. Naturally, this causes disappointment and frustration.
Data teams are overrun with demand from the business that they simply can't deliver. They operate with big backlogs of tasks that have been requested to help the business succeed. The work is so manual. Building integration jobs, testing and retesting jobs, assembling data pipelines, managing dev / test / production environments and maintaining documentation. All these manual tasks are disconnected, slow and prone to error.
Minor changes to any workload that has made it through to a production environment take weeks and months to implement. This is because of the amount of manual effort that must go into discovery, design, build, test and deploy processes around even those minor changes. Current waterfall approaches are having a major impact on agility. Too many data stores (e.g. sources >> ODS >> ETL >>DW >> ETL >> marts >> ETL >> BI tool) means any change early on (e.g. changes a data source) has a domino effect all the way down the line. Minor changes have the potential to impact existing environments significantly; the rollback of any faulty change is even more involved.
Data teams are plagued with comments from the business: "loading new data just takes too long” and “I can't trust the data”. Often over 50% of the team’s time is hijacked, so this is unsurprising. They're side-tracked with work they didn’t plan to do, or diverted to take on new governance requirements around how they acquire, process, secure and deliver data to support the business needs.
The issues above mean data teams get avoided. Ungoverned self-service data preparation creates a ‘wild-west’ of inconsistency and re-invention of data sets throughout the enterprise. The business ends up with various copies of the same fundamental datasets (e.g. customer info) prepared by different teams, all with different results, and no idea which to trust. Additionally, there is no corporate standard around common data names and definitions when producing data for consumption.
The diagram shown here is from The Eckerson Group, and visualises the underbelly of the modern data environment. In it you can see all the elements involved in building and managing a modern data platform. The key thing to note is that it IS complicated and there are many moving parts in modern data environments. The more complicated and governance-heavy these data environments are, the less agile they become under the weight of all that governance, security, and privacy. This is because of the largely manual methods we use today to build and manage these environments. We need to find a way out of this...
Many years ago, software development faced similar challenges – agility versus governance. The problem was solved with the emergence of the DevOps philosophy, to improve software development agility.
DevOps streamlines the process of responding to new or changed customer needs, and develops solutions quickly. At the same time there is clear governance, auditability and maintainability, yet teams are still able to push updates to production in a short amount of time with very low impact to running systems. DevOps successfully achieves a careful balance between governance and agility. Governed agility; agile governance.
The idea of introducing DevOps to data (aka DataOps) was first raised in a 2014 blog post on the IBM Big Data & Analytics Hub entitled "Three reasons why DataOps is essential for big data success". The DataOps concept really began to gain traction in 2017, with a significant growth in ecosystem development, analyst coverage, keyword searches, surveys, publications and open source projects. And its importance was further confirmed when Gartner added it to their 2018 Hype Cycle for Data Management.
With widespread adoption of IoT, machine learning and artificial intelligence, data is critical to operations. We're constantly generating new data . Given that storage is so cheap, we are storing information without fully understanding what all the use cases for it will be. Regardless, we need a fast way of ingesting and organising ALL the data and ongoing changes.
Data Teams are now expected to maintain data about data and their pipelines at the same time as dealing with incoming business change requests from stakeholders. This may not be a problem if stakeholder demands were not becoming more frequent and complex.
Stakeholders now want the ability to do things with data at the same speed they have become used to with other areas of the technology . To help deliver on these demands, data teams need to be able to:
Today, in most organisations, the cost of cleaning, transforming and integrating data is just too high. There are often too many tools in use doing the same or similar things, and no-one really sharing what they create. We must stop the ungoverned chaos that tends to exist as a result of too many tools or ungoverned self-service data-prep environments. DataOps is designed to fuel re-use rather than re-invention.
A tension exists between agility and governance in the data world, similar to the software development world. Achieving a healthy balance is even harder in the data world as the scope of data being managed is generally broader. In the dev world we have to deal with the code, the logic and the directly applicable data being processed; the data world encompasses all the code, the components, the logic AND entire data sets. This data is often subject to regulatory requirements (e.g. GDPR) whereas the data in application logic is much less likely to be. All data is a precious organisational asset that must be managed, protected and assured. As ever, if data is damaged or lost, you devalue a critical business asset and risk reputational damage or regulatory sanctions. But that risk is higher in the data world, when dealing with large datasets.
For anyone familiar with DevOps, the benefits of DataOps will look quite familiar:
Collaborative development between business and data teams will increase the agility of the business and avoid the business teams going and buying their own self-service tools, reinventing pipelines and creating inconsistent processes that are hard to maintain.
Smaller teams with more powerful tools are faster and more productive. DataOps removes the need for bloated operational teams hand-cranking the management of development, test and production infrastructure.
By shortening the time to production and (where required) recovery, businesses can reduce costs by more than 70% - and that’s before we measure the additional value from their data analytics.
Don't just think about the time and cost to implement a feature/requirement. Think, too, about the long term support and maintainability of code and configuration. Think about the effort needed to perform routine tasks. DataOps streamlines all of these, so Total Cost of Ownership (TCO) can be reduced by over 60%.
The DataOps philosophy transcends vendor-specific limitations. This allows your business to store all data together, creating more flexible, more useful use cases at lower cost.
Increasing the agility of data processes will help gain access to valuable insights in hours or days rather than weeks and months.
Create logic as small, reusable components and then use/reuse many times. Avoid duplicated code which ends up inconsistent.
Improve the quality of the data you deliver to your business and provide assurances and guarantees your business stakeholders can rely on.
Using technology to enhance collaboration allows data teams to do more – using DataOps we have seen a team of four people complete 200 cycles, 80 commits and 50 pushes to production in just one day.
Establish a supply chain of data producers. Rather than treating a new data source as a one time project, treat each department as a data user both producing and consuming data products for consumption across the data-driven enterprise.
As with any new technology philosophy, there have been plenty of people jumping onto the DataOps bandwagon. But because everyone has their own motivations and goals, there are 50+ different definitions of what DataOps is – and most of them we disagree with in some part.
Almost all of them start from the perspective of "this is how we do data" and they try to add elements of DevOps and Agile to deliver some of the value of DevOps. This is not #TrueDataOps. Instead #TrueDataOps delivers an order of magnitude improvement in quality and cycle time for data-driven operations by starting with pure DevOps and Agile principles (which have been battle hardened over 20+ years) and determining where they don't meet the demands of Data and adapting accordingly. This is the same approach that the Infrastructure as Code (sometimes called InfraOps) movement adopted nearly a decade ago with great success.
The DataOps Manifesto sets out a great set of high level goals that everyone can agree with and support. Importantly, the manifesto defines the WHAT:
We subscribe wholeheartedly to the DataOps Manifesto.
#TrueDataOps focuses on value led development of pipelines (e.g. reduce fraud, improve customer experience, increase uptake, identify opportunities, etc.) to produce analytical processes that support and enable the achievement of that value.
#TrueDataOps starts with the truest principles of DevOps, Agile, Lean, test-driven development and Total Quality Management. We must then apply these principles to the unique discipline of data: data warehousing, data lakes, data management and data analytics - this encompasses everything from traditional analytics to machine learning, data science and AI.
The tension between governance and agility is the biggest risk to the achievement of value. And it is our view that this tension is not only unhelpful, but it doesn’t even need to exist. Our belief – and experience – indicates that technology has evolved to the point where governance and agility can be combined to deliver sustainable development by #TrueDataOps.
#TrueDataOps is a philosophy and a specific way of doing things. Just like DevOps, companies must embrace the philosophy before they can be successful with any #TrueDataOps project.
Of course just like DevOps, #TrueDataOps is underpinned, accelerated and improved through the use of the right technology at the right time.
Any implementation that lacks one of these pillars cannot call itself #TrueDataOps. More likely it will be another variation of DevOps with data management concepts added on.
Extract, load, transform (ELT), (an alternative to traditional extract, transform, load -ETL) offers several benefits when used with data lake implementations. Where ETL transforms data before it enters the lake, ELT models store information in its original raw format. This enables faster loading times during analytics operations. And because the data is not processed on entry to the data lake, the query and schema do not need to be defined in advance. At the most basic level, ELT is a data pipeline model that offers “lift and shift with minimal changes” – but it’s more than that. By loading data before transformation no details are lost. This approach maximises future data processing and analysis potential by not throwing information away.
Implementing a new data source can be extremely costly, so it is important to do this one once and only once. This may mean loading data even though we don’t know what we will do with it. By focusing on ELT, we take the energy out of the process by loading everything in the format that is available.
ELT also supports the principles of data governance, data auditability and data lineage because information ingested into the data lake or data warehouse is almost exactly what was extracted with zero, or near zero changes. Understanding the provenance and reliability of data derived from this ingested data is therefore far easier. In situations where there are “near zero changes” an ELT model must be an EtLT model; the small “t” reflects the business or regulatory need to reject certain secure or PII data items, or the need to obfuscate or hash data before it is transferred or loaded to any downstream system.
We already do this ALL the time – consider this database table: Employee ID, Employee Name, Salary, Job Title. When we reload the table a week later it looks exactly the same, which is all very standard. But we have also lost information, not by transforming in flight like ETL, but by overwriting valuable data; specifically we have lost all record of the old values and therefore the history. The "Spirit of ELT" takes the the concept of ELT further and advocates that we avoid ALL actions which remove data that could be useful or valuable to someone later, including changes in data. This has been a well understood problem, to which there are well understood solutions, but they are time consuming and expensive - in order to be able to use these everywhere, we need to make the cost of configuration and implementation trivially low.
Continuous Integration/Continuous Delivery in the world of development is the practice of frequently integrating new or changed code with the existing central code repository, building and testing it and having it in a position to deploy on demand (the CD can also stand for Continuous Deployment where at this point it's automatically pushed to production). Under #TrueDataOps integrations should occur frequently enough that no intervening window remains between commit and build, and such that no errors can arise without developers noticing them and correcting them immediately.
To achieve these objectives, continuous integration relies on the use of a revision control system and branches to manage the project's source code. All artefacts required to build the project should be placed in this repository. In this practice and in the revision control community, convention dictates that the system should be buildable from a fresh checkout without requiring additional dependencies.
In the new world of data development, all logic and jobs performed on the data as part of an overall pipeline (or pipelines) must be treated as a single code base. The same principles of management and governance are also applied.
While #TrueDataOps also started as a set of best practices, it is evolving into an independent approach to data lifecycle management. DataOps applies to the entire lifecycle - from integrating with model generation, orchestration, and deployment, to health, diagnostics, governance, and business metrics.
Similar to DevOps code development, prototyping begins with a new branch – including data. Importantly, prototyping can take place within the same system with continuous improvements being made until it is ready to go live. The ability to branch will help you escape the traditional waterfall approach which requires a definitive business goal before development can begin. Instead you can take a general goal, create a new branch (which includes configuration, code AND data i.e. a fully self contained and complete sandbox). You can then build a first version and can keep iterating and testing until you deliver against the stakeholder’s (sometime vague!) requirements without compromising data integrity or quality.
DataOps also seeks to streamline data lifecycle management – or even to make it completely invisible. For example, choosing to build on the Snowflake Cloud Data Platform makes data lifecycle management redundant. Not only is data automatically compressed, but it is built on AWS S3, Google Cloud storage, or Azure blob technologies, the cheapest storage options available today. All cloud data platforms will follow this design pattern over the next few years, making environment management seamless and data lifecycle management effectively redundant.
Cloud computing technologies place (theoretically) infinite resources at the disposal of your developers. Taking advantage of this increase in computing power, #TrueDataOps shifts the attention from CPU time to developer productivity.
Ultimately you pay the same for CPU cycles – you just decide whether code runs faster or slower. Optimising code prematurely is a drag on your developers time, reducing the number of stakeholder changes they are able to process.
In engineering, maintainability is the ease with which a product can be maintained in order to minimise the downtime or impact to the end user. There are principles of component or code design that have been developed and refined over many years that enable a code base to be more maintainable and robust for the long term. Take some of Eric Raymond's rules from The Art of Unix Programming published nearly 20 years ago, but still the seminal work in this area which which suggest that maintainability can be improved by:
Whether its "programs" as it was then, or "components" as it is now, these principles are so valuable.
The #TrueDataOps philosophy favours small, atomic pieces of code that can be reused quickly and easily to reduce overall prototyping and development time. Components in effect. By preferring configuration over code wherever possible, choosing low-code everywhere else and reducing code to small, reusable components, it becomes easier to refine, optimise and replace sections without affecting the end user experience.
Environment management is one of the most complex elements of #TrueDataOps. Organisations doing web development have cracked not just multiple long-lived envrionments (such as PROD, QA, DEV) but also spinning up multiple environments dynamically for specific purpose. These "feature branches" enable an individual engineer to do their own complete integration testing. They use technologies such as Kubernetes, where creating a new environment to run a specific version of code can be done in seconds. Typical data environments are a long way away from this; even the most advanced organisations rarely have more than 2 or 3 manually created long-lived environments. These organisations usually spend an inordinate amount of time trying to manage how they are different from each other, since they are manually created and updated and therefore diverge quickly.
Great environment management for #TrueDataOps requires all environments to be built, changed and (where relevant) destroyed automatically. It also requires something not really relevant in the web world: a fast way of making a new environment look just like Production. This could involve the duplication of TB of data.
For example, the Snowflake data platform is one of the only cloud data platforms providing underlying capabilities (such as Zero Copy Cloning) that enable #TrueDataOps environment management facilities today. But many will follow. Cloud data platforms with these unique capabilities, could do for data analytics what Kubernetes has done for microservices.
Done correctly, properly automated Environment Management is not just a massive step forward for the development of an organisation's Data Platform. It also reduces costs since manual steps for creating, destroying and (worst of all) trying to cross check environments, are all automated.
#TrueDataOps requires Governance By Design and Privacy By Design - that is to say, to be effective, a #TrueDataOps Platform needs to have very strong Governance and Privacy models included in the core by design, and not something retrofitted later.
Faced with multi-million pound fines for improper use of data, governance is essential to data operations. Under the #TrueDataOps philosophy, every change is subject to automated testing to catch errors and bugs. The philosophy also demands that there are always two pairs of eyes on everything, ensuring a manual check is completed before each pull request and code merge.
You also need to define your source of truth – the baseline by which data quality and accuracy is measured. When dealing with software, the source of truth is the application’s supporting code repository rather than the application itself. The same principle exists under #TrueDataOps philosophy– the code repository supporting your code and components is the definitive source of truth for your applications. Governance is further strengthened with automatically created audit trails. Every change, test and approval is fully documented and stored forever, allowing you to prove how data has and is being used.
The reality is that, traditionally, the more money you invest into your data platform, the more expensive future work will be. Increasing complexity, platform sprawl, more code: everything combines to make testing more convoluted and resource intensive – and therefore expensive – particularly as you continue to scale upwards.
If a data team already cannot keep up with stakeholder requests, then what chance do they have of completing all required testing? The solution is to use automated data testing to identify issues and relieve the burden on the team.
In the fast-paced agile development environment it is impossible to balance optimism (the belief that a new deployment will work first time) with a lack of development process. Automated testing and monitoring provides a way to counter optimism and detect problems caused by optimism.
Organisations like Amazon, Netflix and Etsy are regarded as world leaders when it comes to rapid testing and deployment. They can successfully deploy new updates incredibly fast (in some cases, every few seconds) with near-perfect availability – particularly impressive when you consider they manage millions of lines of code. Automated testing is the key capability that underpins this, and indeed underpins most of the efficiency improvements in software development over the last two decades.
The mathematics for this is simple - if you want to confidently release, you have to test every part of the system ("total" test coverage). As your system gets bigger, this is more work. If you want to release faster, you have to repeat this far more frequently. The only way to solve the equation where the test sets are getting larger and the frequency of execution is getting orders of magnitude faster is full automation.
Automated testing is about trying to stop "bad stuff" getting into production. Automated monitoring is about accepting that, while automated testing is critically important, it can also never be perfect, and eventually something will slip through, and this is where good monitoring is key. The most advanced DevOps organisations today don't see this as a failing, but as a reality that they can control.
#TrueDataOps monitoring involves more than confirming system availability. It also checks the production quality of all data and artifacts in your analytics processes, and tests new code during deployment. By detecting unexpected variations and generating operations statistics, your data team gathers new insights that can be used to further enhance performance – of the pipeline and the underlying platform.
It is also important to realise that the definition of data availability has been expanded under #TrueDataOps. As well as being able to run queries, #TrueDataOps defines availability as the ability to return data that is valid for decision making. Even if your platform achieves 99.999% uptime, availability may not be as high as you think if it cannot deliver actionable insights. It is perfectly possible that broken pipelines are compromising your data. But without automated testing these failures may be undetected – and the accuracy of your analytics affected.
Collaboration and self-service are the recognisable results and clear benefits of a #TrueDataOps having been delivered. And we can think of this in two ways.
Firstly, collaboration and self-service at a development and operations level. As we have seen throughout this philosophy, there are many different people and teams that contribute to different parts of an overall data and analytics platform. Data engineers, data analysts, data prep teams, data scientists, visualisation teams, machine learning teams, etc. and they all use their own specific tools to do their jobs. Its like a manufacturing line developing data products, and its the customer that receives that product at the end of the line. Today all those teams operate independently, coordination is both challenging and fragile, and product quality is unpredictable at best. #TrueDataOps is designed to coordinate and operate this like a highly performing, predictable, quality orientated, industrial manufacturing line. It expects a heterogeneous tooling environment and is focused on enabling all those teams to collaborate efficiently, and orchestrating the entire lifecycle to deliver the highest quality product to the end customer.
Secondly, collaboration and self-service at a customer and stakeholder level. Most organisations are trying to become more data-driven - they want data insights to inform decision making. This requires pulling data from multiple different parts of the business into one place. The theory goes that once we have all the data in one place, everyone can use it, can answer their own questions and solve their own use cases by joining multiple datasets together. Historically this has been seen as a technological problem, solved with a Data Warehouse where we can assemble all these different data sets together. This is, of course, part of the solution. However the reality for most enterprises has been somewhat different.
The ideal state here has always been that, with enough access to data across the whole organisation, people will be able to create and discover insight that had never occurred to them in advance. The reality today is quite different: usually you need to justify how and why you need the data to get access to it, and this destroys the discovery element that is so core to a data-driven business.
To address these issues, and close the gaps to the original data-driven business goals, #TrueDataOps requires:
In order to uphold the Seven Pillars of #TrueDataOps, any underlying platform must have certain capabilities:
With every system, the bigger and more complex it is, the more testing is needed to ensure that new features and capabilities don’t compromise it. With an old-fashioned, manual approach, this testing burden increases exponentially over time. In theory, it can increase to the point where Development Efficiency (productivity) is effectively zero. However, in reality, organisations do not allow this. Instead, time allocated to testing is capped. As the system gets more complex, the net effect is that less of the system is tested, and thereby major risk is introduced.
At the very start, with a totally new project, the automated Data Regression Testing approach of DataOps is less efficient than the legacy approach...
This is because, in addition to development work, automated tests need to be created. However, as a data team gets familiar with DataOps techniques they become more efficient at creating Data Regression Tests. As a result, their efficiency increases.
The point at which DataOps becomes more efficient than the legacy approach is usually very short.
Very quickly, the efficiency of the #TrueDataOps approach and the legacy approach diverge and a gulf between them is created. The gulf represents the increased cost, reduced efficiency and lost opportunity of the introduced technical debt.
Data environments of the future will be unable to function effectively without DataOps. Those adopting DataOps will have the agility to cope with the data tsunami, the rigour to maintain control and the power to deliver business value. The rest will face ever-increasing costs, reducing value and ultimately create a crisis in which they are replaced.
Over the next 5 years every organisation will move to DataOps. It will be the only way of sensibly building and governing data environments.
#TrueDataOps is the purest philosophy for the way to deliver DataOps, and achieve the greatest ROI from its implementation.