The 7 Pillars of #TrueDataOps
1. ELT (and the Spirit of ELT)
Extract, load, transform (ELT)—an alternative to traditional extract, transform, load (ETL)—offers several benefits when used with data lake implementations. Where ETL transforms data before it enters the lake, ELT models store information in its original raw format. This enables faster loading times during analytics operations. And because the data is not processed on entry to the data lake, the query and schema do not need to be defined in advance.
At the most basic level, ELT is a data pipeline model that offers “lift and shift with minimal changes.” But it’s more than that, too. By loading data before transformation, no details are lost. This approach maximizes future data processing and analysis potential by not throwing information away.
Implementing a new data source can be extremely costly, so it is important to do this once and only once. This may mean loading data even though we don’t know what we will do with it. By focusing on ELT, we take the energy out of the process by loading everything in the format that is available.
ELT also supports the principles of data governance, data auditability, and data lineage because information ingested into the data lake or data warehouse is almost exactly what was extracted with zero, or near-zero changes. Understanding the provenance and reliability of data derived from this ingested data is therefore far easier.
In situations where there are “near zero changes” an ELT model must be an EtLT model; the small “t” reflects the business or regulatory need to reject certain secure or PII data items, or the need to obfuscate or hash data before it is transferred or loaded to any downstream system.
We already do this ALL the time. Consider this database table: Employee ID, Employee Name, Salary, Job Title. When we reload the table a week later, it looks exactly the same, which is standard. But we also may have lost information by overwriting valuable historical data. Even though there was no transformation, we’ve lost the history.
The "Spirit of ELT" takes the concept of ELT further and advocates that we avoid ALL actions which remove data that could be useful or valuable to someone later, including changes in data. This has been a well-understood problem, to which there are well-understood solutions, but they're time-consuming and expensive. To use these solutions everywhere, we need to make the cost of configuration and implementation trivial.
2. Continuous Integration/Continuous Delivery (including Orchestration)
In the world of development, Continuous Integration/Continuous Delivery (CI/CD) is the practice of frequently integrating new or changed code with an existing central code repository, building and testing it, and having it in a position to deploy on demand. The CD can also stand for Continuous Deployment, where at this point it's automatically pushed to production.
Under #TrueDataOps integrations should occur often enough that no intervening window remains between commit and build, and such that no errors can arise without developers noticing them and correcting them immediately.
To achieve these objectives, continuous integration relies on using a revision control system and branches to manage the project's source code. All artifacts required to build the project should be placed in this repository. In this practice and in the revision control community, convention dictates that the system should be buildable from a fresh checkout without requiring additional dependencies.
In the new world of data development, all logic and jobs performed on the data as part of an overall pipeline (or pipelines) must be treated as a single code base. The same principles of management and governance are also applied.
While #TrueDataOps also started as a set of best practices, it is evolving into an independent approach to data lifecycle management. DataOps applies to the entire lifecycle - from integrating with model generation, orchestration, and deployment, to health, diagnostics, governance, and business metrics.
Similar to DevOps code development, prototyping begins with a new branch—including data. Importantly, prototyping can take place within the same system with continuous improvements being made until it is ready to go live. The ability to branch will help you escape the traditional waterfall approach, which requires a definitive business goal before development can begin. Instead, you can take a general goal and create a new branch (which includes configuration, code, AND data, i.e., a fully self-contained and complete sandbox). You can then build a first version and can keep iterating and testing until you deliver against the stakeholder’s (sometimes vague!) requirements without compromising data integrity or quality.
DataOps also seeks to streamline data lifecycle management, or even to make it completely invisible. For example, choosing to build on the Snowflake Data Cloud makes data lifecycle management redundant. Not only is data automatically compressed, but it is built on AWS S3, Google Cloud storage, or Azure blob technologies, the cheapest storage options available today. All cloud data platforms will follow this design pattern over the next few years, making environment management seamless and data lifecycle management effectively redundant.
3. Component design and maintainability
Cloud computing technologies place (theoretically) infinite resources at the disposal of your developers. Taking advantage of this increase in computing power, #TrueDataOps shifts attention from CPU time to developer productivity.
Ultimately, you pay the same for CPU cycles—you just decide whether code runs faster or slower. Optimizing code prematurely is a drag on your developers’ time, reducing the number of stakeholder changes they can process.
In engineering, maintainability is the ease with which a product can be maintained to minimize the downtime or impact to the end user. There are principles of component or code design that have been developed and refined over many years that enable a code base to be more maintainable and robust for the long term.
Take some of Eric Raymond's rules from The Art of Unix Programming. Though published nearly 20 years ago, it’s still the seminal work in this area, which suggests that maintainability can be improved by:
- Building modular programs
- Writing readable programs
- Using composition
- Writing simple programs
- Writing small programs
- Writing transparent programs
- Writing robust programs
- Building on potential users' expected knowledge
- Avoiding unnecessary output
- Writing programs that fail in a way that's easy to diagnose
- Valuing developer time over machine time
- Writing abstract programs that generate code instead of writing code by hand
- Prototyping software before polishing it
- Writing flexible and open programs
- Making the program and protocols extensible
Whether its 'programs' as its was then, or "components" as it is now, these principles are so valuable.
The #TrueDataOps philosophy favors small, atomic pieces of code that can be reused quickly and easily to reduce overall prototyping and development time. In effect, components.
By preferring configuration over code wherever possible, choosing low-code everywhere else, and reducing code to small, reusable components, it becomes easier to refine, optimize, and replace sections without affecting the end user experience.
4. Environment management
Environment management is one of the most complex elements of #TrueDataOps. Organizations doing web development have cracked not just multiple long-lived environments (such as PROD, QA, DEV) but also spinning up multiple environments dynamically for specific purpose.
These "feature branches" enable an individual engineer to do their own complete integration testing. They use technologies such as Kubernetes, where creating a new environment to run a specific version of code can be done in seconds. Typical data environments are a long way away from this; even the most advanced organizations rarely have more than 2 or 3 manually created long-lived environments. These organizations usually spend an inordinate amount of time trying to manage how the environments differ from each other, since they are manually created and updated and therefore diverge quickly.
Great environment management for #TrueDataOps requires all environments to be built, changed, and (where relevant) destroyed automatically. It also requires something not relevant in the web world: a fast way of making a new environment look just like Production. This could involve the duplication of TB of data.
For example, the Snowflake data platform is one of the only cloud data platforms providing underlying capabilities (such as Zero Copy Cloning) that enable #TrueDataOps environment management facilities today. But many will follow. Cloud data platforms with these unique capabilities, could do for data analytics what Kubernetes has done for microservices.
Done correctly, properly automated Environment Management is a massive step forward for the development of an organization's data platform. It also reduces costs by automating manual steps for creating, destroying, and (worst of all) trying to cross-check environments.
5. Governance and change control
#TrueDataOps requires Governance By Design and Privacy By Design. In other words, to be effective, a #TrueDataOps Platform needs to have very strong Governance and Privacy models included in the core by design, and not something retrofitted later.
Faced with multi-million-dollar fines for improper use of data, governance is essential to data operations. Under the #TrueDataOps philosophy, every change is subject to automated testing to catch errors and bugs. The philosophy also demands that there are always two pairs of eyes on everything, ensuring a manual check is completed before each pull request and code merge.
You also need to define your source of truth—the baseline by which data quality and accuracy is measured. When dealing with software, the source of truth is the application’s supporting code repository rather than the application itself.
The same principle exists under #TrueDataOps philosophy. The code repository supporting your code and components is the definitive source of truth for your applications.
Governance is further strengthened with automatically created audit trails. Every change, test and approval is fully documented and stored forever, allowing you to prove how data has and is being used.
6. Automated data testing and monitoring
The reality is that, traditionally, the more money you invest into your data platform, the more expensive future work will be. Increasing complexity, platform sprawl, more code: everything combines to make testing more convoluted and resource intensive-and therefore expensive-particularly as you continue to scale upwards.
If a data team already can't keep up with stakeholder requests, then what chance do they have of completing all required testing? The solution is to use automated data testing to identify issues and relieve the burden on the team.
In a fast-paced agile development environment, it is impossible to balance optimism (the belief that a new deployment will work first time) with a lack of development process. Automated testing and monitoring provides a way to counter optimism and detect problems caused by optimism.
Organizations like Amazon, Netflix, and Etsy are regarded as world leaders when it comes to rapid testing and deployment. They can successfully deploy new updates incredibly fast (in some cases, every few seconds) with near-perfect availability. This is particularly impressive when you consider they manage millions of lines of code. Automated testing is the key capability that underpins this, and indeed underpins most of the efficiency improvements in software development over the last two decades.
The mathematics for this is simple; if you want to confidently release, you have to test every part of the system (“total” test coverage). As your system gets bigger, this is more work. If you want to release faster, you have to repeat this far more frequently. The only way to solve the equation where the test sets are getting larger and the frequency of execution is getting orders of magnitude faster is full automation.
Automated testing is about trying to stop "bad stuff" getting into production. Automated monitoring is about accepting that, while automated testing is critically important, it can also never be perfect, and eventually something will slip through, and this is where good monitoring is key. The most advanced DevOps organizations today don't see this as a failing, but as a reality that they can control.
#TrueDataOps monitoring involves more than confirming system availability. It also checks the production quality of all data and artifacts in your analytics processes, and tests new code during deployment. By detecting unexpected variations and generating operations statistics, your data team gathers new insights that can be used to further enhance performance of the pipeline and the underlying platform.
It is also important to realize that the definition of data availability has been expanded under #TrueDataOps. As well as being able to run queries, #TrueDataOps defines availability as the ability to return valid data for decision-making.
Even if your platform achieves 99.999% uptime, availability may not be as high as you think if it cannot deliver actionable insights. It is perfectly possible that broken pipelines are compromising your data. But without automated testing, these failures may be undetected and negatively affecting the accuracy of your analytics.
7. Collaboration and self-service
Collaboration and self-service are the recognizable results and clear benefits of a #TrueDataOps delivery. And we can think of this in two ways.
#1: Collaboration and self-service at a development and operations level.
As we have seen throughout this philosophy, there are many different people and teams that contribute to different parts of an overall data and analytics platform. Data engineers, data analysts, data prep teams, data scientists, visualization teams, machine learning teams, etc.—and they all use their own specific tools to do their jobs.
It's like a manufacturing line developing data products, and it's the customer that receives that product at the end of the line. Today all those teams operate independently; coordination is both challenging and fragile, and product quality is unpredictable at best.
#TrueDataOps is designed to coordinate and operate like a highly performing, predictable, quality-orientated, industrial manufacturing line. It expects a heterogeneous tooling environment and is focused on enabling all those teams to collaborate efficiently, and orchestrating the entire lifecycle to deliver the highest quality product to the end customer.
#2: Collaboration and self-service at a customer and stakeholder level.
Most organizations are trying to become more data-driven. They want data insights to inform decision-making. This requires pulling data from multiple different parts of the business into one place. The theory goes that once we have all the data in one place, everyone can use it, answer their own questions, and solve their own use cases by joining multiple datasets together.
Historically, this has been seen as a technological problem, solved with a data warehouse where we can assemble all these different data sets together. This is, of course, part of the solution. However, the reality for most enterprises has been somewhat different.
- It's very hard for typical business users to understand the data that's available and how to use it. The more successful an organization is at bringing data sets together, the worse this problem gets.
- Many datasets have some level of data sensitivity. This has meant that while datasets are loaded into the same data warehouse, their access has been restricted to just the department who originated it (and a few select data gurus). Technology silos have been replaced with privacy silos.
- Ingesting new data has been a long, painful, and expensive process and has therefore needed a strong business case. Where one or two users just wanted a simple spreadsheet included so that they could join/analyze some other data already in the data warehouse, this has rarely been possible or cost-effective.
The ideal state here has always been that, with enough access to data across the whole organization, people will be able to create and discover insights that had never occurred to them in advance.
The reality today is quite different: usually you need to justify how and why you need the data to get access to it, and this destroys the discovery element that is so core to a data-driven business.
To address these issues, and close the gaps to the original data-driven business goals, #TrueDataOps requires:
- A business-user-facing data catalog for users to be able to search and understand the data available and how to use it.
- A way of creating different versions of data with different levels of anonymization, so that datasets with some level of sensitivity can still be safely shared with the rest of the organization, allowing the discovery element.
- Very lightweight and fast ways of getting simpler data in from end users. This doesn't necessarily have to follow all the same controls and processes as the main organizational data—what we might call "Community Data."
Data environments of the future will be unable to function effectively without DataOps. Those adopting DataOps will have the agility to cope with the data tsunami, the rigor to maintain control and the power to deliver business value. The rest will face ever-increasing costs, reducing value and ultimately creating a crisis in which they are replaced.
Over the next 5 years, every organization will move to DataOps. It will be the only way of sensibly building and governing data environments.
#TrueDataOps is the purest philosophy for the way to deliver DataOps, and achieve the greatest ROI from its implementation.
Join the #TrueDataOps movement
Rethink the way you work with data. Join the #TrueDataOps movement and get regular access to DataOps content. Sign up now and get started with the DataOps for Dummies ebook.
What any DataOps technology needs to provide
To uphold the Seven Pillars of #TrueDataOps, any underlying platform must have certain capabilities:
- Source control: The ability to branch code during development will accelerate development cycles without compromising existing operations or, more importantly, your data.
- End-to-end orchestration: Automation is key to accelerating development, so you need a platform that offers push-button execution of your data pipelines. Importantly, the platform needs to be able to execute multiple different pipelines against the same data.
- Environment Management: Your chosen platform must provide functionality to create and delete code and data branches quickly and efficiently.
- Automated Testing and Monitoring: You need a platform that identifies code and performance failings before completing a deployment, and monitors for rare production issues and alerts engineers to issues, allowing them to roll-back updates immediately.
- Configuration Control and Governance: To ensure your business is upholding its legal and regulatory obligations, the platform needs to automatically document how data is being accessed, processed, and transformed.
- Collaboration: To open up access to data both from within and out to the rest of the organization without compromising governance.