With tech giants such as Facebook and Google monetizing data to drive innovations and create competitive advantage in the digital economy, data has become the new oil, gold or lifeblood. Affordability to store big data has led to the creation of data swamps and organization are going gaga about it. With massive amounts of data availability in the organizations, more and more business segments want to explore and make data driven decisions. This has led to the growth of “self-serve your data” culture to reduce time-to-market of analytics and enable faster decision making. As individual teams self-serve their data, the analysis often ends-up with analysts’ idiosyncrasies, communicating conflicting insights and exposing the issues of data inconsistency and quality. Providing quality and compliant datasets across a range of disparate sources on multiple deployment options, for on time decision making, is an ongoing problem for the data teams. However, the enormous success of agile approaches and DevOps to software development has opened new doors for its application to the data engineering and bring new solutions for the data teams.
There has been an immense proliferation of methodologies acronyms with the “Ops” suffix in the last few years. Type “ops” in Google and you will find at least 20 established methodologies about operationalizing and standardizing different processes. DataOps has been coined as one such methodology that aims to bring some of the DevOps, lean manufacturing, agile development and deployment practices to data processing and integration pipelines, making data operations work much more efficiently and effectively. Implementing DataOps and enabling strong foundation of robust data pipelines has recently been a key priority for most CDOs to execute their state-of-the-art AI/ML initiatives at scale.
The purpose of this article is to give a better understanding on the challenges faced by data teams, how DataOps can help, and a strategy to implement DataOps at scale.
What is a Data Pipeline?
Moving back to our Oil analogy for data.
To create a valuable entity for customers and gain profits, crude oil needs to be processed in refineries and blending facilities to produce finished goods such as fuel, plastic, chemicals. Similarly, the true value of data is realized when raw data is processed and transformed into an actionable insight for business. A data pipeline is the refinery/blending facility to process raw data into actionable insights.
As shown in Figure 1, a data pipeline is a set of actions that ingests raw data from disparate sources, performs a series of transformation steps using different tools/technologies and delivers the output as an insight in the form of reports, models, and views.
Let’s understand this with an example — a customer recently purchased a product from your company’s website and has shared the feedback on social media. The customer is now browsing some other products on the website and also inquiring about them via customer support channels (chat, call). How soon an insight be derived about the feedback, user interaction, purchase history and customer details for targeted marketing? A robust data pipeline can deliver the analytics on-time here with no manual steps. The sooner you entice the customer about your product features, pricing deals etc., the better chances you have of making a sale and delighting your customer
A data pipeline can be conceptualized in two aspects:
· Value Pipeline (Data in Production)- Creating value for organizations as data flows into production — ingested, stored, integrated, computed, and visualized to obtain an actionable insight. This pipeline in production is the “operations” side of data.
· Innovation pipeline (New Features) — Leading to the new analytics (new features/changes in data pipeline code) that is developed and is deployed to the production pipeline.
DataOps methodology aims to standardize and optimize these data pipelines to ensure these are fast to build, deploy, fault tolerant, adaptive and self-healing.
Challenges faced by Data Teams
According to a Gartner survey in 2020, only 22% of Data Management Teams’ time is spent on innovation. Majority of the team’s time is spent addressing operational issues and fixing errors in data pipelines, instead of generating more business value. As a result, many data teams are unable to meet the business expectations and drown in technical debt. Below are the major data challenges that contribute to this technical debt -
Integrating Data Silos: Massive amount of data is gathered today from customer interactions, equipment usage, IT operations etc. This data resides on myriad platforms in silos, that needs to be integrated in-time with other third-party datasets such as demographics to gain actionable insights. Data in disparate systems are usually not structured in the desired formats. This integration and consolidation is complex and time-taking process that requires a wide range of skills for its efficient implementation.
Data Quality: The quality of analytics delivered depends upon the quality of the underlying data used. “Garbage In Garbage Out” is a common phrase we all would have heard when we encounter data quality issues. Data quality errors can be difficult to trace and hence are time consuming and tedious to resolve.
Data Pipeline Maintenance: Delivering new analytical insights is often constrained by the long development cycles, rigid old processes, and brittle IT systems. Any changes in the data flow such as adding or updating a source, modifying schemas, or enhancing new functionalities trigger a change in a data pipeline. When many processes in the data lifecycle are managed manually, this results in entrenched fatigue that often causes errors. Along with working on the business requirements, the team ends up investing a lot of unnoticed and unrewarded effort to ensure that the changes don’t break the production pipelines.
Shifting Business Demands: A business requirement evolves overtime and the insights need to be refined accordingly. Each insight delivered enables users to understand the business problem in a new way, spurring a series of new asks and demanding more analytics resulting in more iterative changes in a data pipeline.
Understanding Data Context –With an inconsistent data governance such as stale metadata management, it becomes difficult for the data engineers (who are usually not the domain experts) to locate, understand and integrate right datasets from different sources.
These challenges above are not unique but are similar across most organizations.
According to IDC findings in 2021, global data creation and replication will experience a 23% CAGR over the 2020–2025 forecast period. Driven by the steady growth in the amount of data created and replicated, many organizations are struggling to manage the demands and ensure a steady flow of value creation. DataOps is an emerging approach to successfully implement end-to-end, reliable, robust, scalable, and repeatable data solutions through an agile and collaborative approach.
It is an automated, process-oriented methodology, used by data and analytics teams, to enable:
· Faster delivery of insights to customers.
· Collaboration between groups within data organizations, technologies, and environments.
· Reduced inefficiencies and error rates in the current processes.
· Measurement, monitoring and alert mechanism on the operations.
· Culture of rapid experimentation and innovation. (Fail fast Learn fast)
“The Manifesto for Agile Software Development,” which debuted in 2001, and “The DataOps Manifesto,” published in 2017, have shown similar trajectory of adoption in different timelines. A common misconception people have is that DataOps is just DevOps applied to data. However, a key difference is that DataOps also focusses on the quality control of data operations through lean manufacturing principles. in addition to the application of agile principles and DevOps methodology.
Strategy to Implement DataOps
“A goal without a plan is just a wish” — Antoine de Saint-Exupéry
In mid-2018, Gartner placed DataOps on the fastest emerging part of their Hype Cycle curve for Data Management. Around same timeframe, a research firm Pulse Q&A conducted a survey and found that 73% of companies were planning to invest in DataOps initiatives. The shift to adopt DataOps is real, but how can an organization implement DataOps?
DataOps can be transformational as it impacts an organization’s end-to-end analytic lifecycle. Understanding the current maturity level of an organization and the teams’ proficiency is crucial. To start with, a standard maturity assessment will be beneficial for an organization to understand where they are, and how capable are their teams’ capabilities to deliver business-ready data in-time. Figure-3 provides a Maturity model that can be leveraged for the assessment. The characteristics mentioned in the model can be used as a reference to gauge the current maturity level.
DataOps takes a holistic transformational approach to the people, processes, and technology required to build and automate robust data pipelines. The methodology focuses on the six key tenets (figure — 4) in the lifecycle of DataOps adoption.
Each stage of a data pipeline is defined by source code / scripts, configuration files, parameter files, containers etc. All these artifacts are represented as one code, responsible for converting raw content into meaningful information. As in software development, the pipeline should be maintained in a version control system such as Git. This centralized repository would help team to organize and manage the changes and revisions to code.
Continuous Integration — Major productivity boost is achieved when developers can work parallelly on a feature rather than on a monolithic code base. Branching and merging allows the team to make changes on the same code in parallel without slowing down the development velocity.
DevOps teams should be able to quickly spin up and down test data environments as they iterate on new code. Automating data delivery into the CI/CD toolchain breaks the data bottleneck, so continuous integration can truly scale.
Continuous Deployment — To reduce conflicts and dependencies in different environments, an organization should provide a repeatable and consistent deployment pipeline for the code.
A sustainable way of producing robust data pipelines is to break it down into smaller reusable components and containerize.
Many organizations have very fractured data environments, for example- data is stored across disparate repositories in hybrid environments and needs to be merged with other third-party data. There are dependencies on different processes such as FTP and a manual data preparation process using multiple tools to deliver the analytics. Due to the tribal knowledge of the data and processes of such pipelines, they are mostly handled by a single team member, putting it at a high risk. Moving away from cowboy coding personas is crucial for a team to remove dependencies on people and inconsistencies in code.
Use a central hub to design automated end-to-end workflows and its orchestration. The code components need to be parameterized to enable the automation. So that the jobs can be kicked-off automatically, passing data from tool to tool and delivering the final insight. Apache Airflow, AWS Glue and NiFi could be the tools that can be used for these scenarios.
Testing of data, model, business logic must be applied at each stage of the data pipeline. Adding tests is analogous to the statistical process control technique use in manufacturing operations flow, they check the accuracy or potential deviation along with errors before each release of a feature to ensure consistent quality. Manual testing has no scope in high-performance organizations, as it’s error-prone, time-consuming, and laborious. Hence DataOps automates testing where test scripts are written in advance and executed on the on-demand spun-up test environments. Complementing the code components (individual tasks) with parameterization is a good practice that gives flexibility to accommodate different run-time circumstances in an automated testing. The version of a dataset, filtering criteria on data, steps to be considered in a flow, are some of the parameters that can considered for validating concurrent instances of a data pipeline with different values provided at runtime. DataOps employs automated testing to address errors in both the Value and Innovation Pipeline.
· Value Pipeline (data in production)- incorporates lean manufacturing techniques such as Statistical Process Controls (SPC) to optimize the operational efficiency. SPC measures the different operational characteristics, monitors the anomalies in data pipeline, ensuring that the metrics is controlled within acceptable ranges. The quality of a value pipeline solely depends on the quality of data traversing in production.
· Innovation pipeline (New Features) — validates new analytics before deploying them, automating for continuous deployment, and simplifying the release of new features. It reduces the overall cycle time of productionizing the ideas, enabling the data analytics team to focus on innovation by building more analytics. The quality of an innovation pipeline solely depends on the quality of code developed.
Clear measurement and monitoring of the results are a paramount demand to build robust quality data pipelines.
“You can’t manage what you don’t measure” — W. Edward Deming
Data changes in production is unlike the code in software development. These checks are embedded in the Data Operations to ensure any deviance of a metric from the configured threshold is addressed with a notification to a notification to the users and triggering a post remediation process.
An enterprise dashboard can be built for a constant vigilance on teams’ performance, production issues and their impact, data quality, confidence levels of the business etc. As these metrics are evaluated overtime you will be able to justify the value that data pipelines bring to the organization.
Focus on enabling the DataOps culture at an organizational level.
· Educate — your teams on the DataOps Strategy,
· Analyze — understand and document the bottlenecks in your current processes
· Optimize — Iteratively identify the processes to be automated and break them into individual tasks/ components
· Communicate — enable cross-team collaboration by agreeing on the expectations, setting up a cadence with the team, evaluating the issues reported.
An organization centered around strong DataOps culture, ensures a smooth transition to a higher maturity level and better alignment of business goals.
Imagine the next time when marketing team requests for some additional attributes in the customer data to evaluate the customer lifetime value (CLV) for targeting new promotions. With a successful DataOps implementation, the analytics team will agree to the stringent hour timelines and deliver, the business will have a confidence on the quality of the governed data being served and the executives can improve the strategies by bouncing new ideas more rapidly in the market.
Although DataOps is no longer in the nascent stage, it is still being adopted gradually on a global scale. Organizations that have matured in DataOps journey realize that it is not an over-night change but a continuous journey towards agility to optimize the processes, increase cross team collaboration, and build systems of trust. Focusing on the six tenets discussed above may incur some upfront investment but the payoff can be dramatic. It guarantees a trusted, high quality, business-ready, and timely data delivery to the business users.
Apart from technology and process changes, data analytics teams have a huge role to play in evolving the DataOps culture in an organization. It is ideal to establish a practice to champion the DataOps concepts across different teams.
Below are few pointers to start a CoE -
· Assess your Organization’s DataOps maturity and track the organization’s progress on the maturity scale with time. This would help to justify the investments made.
· Plan a roadmap for a gradual transition towards DataOps maturity.
· Identify a pilot project and invest to expand and grow DataOps skills in the organization
· Promote the pilot project’s success and encourage other projects and teams to participate.
· Build a DataOps CoE to provide guidance, suggest best practices, discuss, and share the lessons learned with teams, and standardize the overall implementation approach.
Investments made in building a DataOps practice would lay a strong foundation to execute strategic AI/ML initiatives in future. Keep progressing and soon you will realize the benefits of transitioning from artisanal to industrial-scale processes from developing and running data pipelines.