Alongside with DevOps, which ideally means deploying code into production with the click of a button through a well-defined process, data professionals are thinking about a similar concept for moving data into production. What if a business user wants new data in an analytical application, can we do it with the click of a button?

Despite new technology that can handle vast volumes of data at lightning speed, organizations are still struggling with implementing analytics in a timely manner, with good quality data and before the results become obsolete in the constantly evolving environment.

Roadblocks

There are misconceptions about what it takes to capture, store, organize and analyze data and eventually get it into a productive environment. While tool vendors want you to believe that it all happens at the click of a button, the reality is that data typically can’t be used without prior knowledge and understanding of what it represents. Data often has errors that must be recognized and dealt with, data may be incomplete and some of it is not of a good enough quality to be used for analytics straight from the source.

Integrating data from various sources is another challenge. Technically, we can dump all of the data into a data lake quickly and efficiently. But practically it takes effort and a deep understanding of the data to be able to combine it from different data sources in a meaningful way.

It has been known for a long time that traditional ETL (Extract-Transform-Load) used in loading data warehouses is inefficient. The transformations required as part of ETL are not scalable in a big data environment, especially when new data is constantly flowing in. Manual effort is required to maintain and enhance the ETL process which causes bottlenecks between accessing data and having it available for analytics.

Due to lack of process automation and poor data governance practices, there are often shortcuts in the form of manual processes and data sharing by spreadsheets that have been tinkered with. Such practices are quick and efficient in practice, because it is easy to edit and email a spreadsheet, but they compromise data quality and are questionable in terms of data security.

Enter DataOps

What can we do to support business users in providing them with data and analytics in a timely manner? One possible approach is DataOps. It is an analogy to DevOps in that it is a similar approach with which we want to deliver production-ready data from various data sources to enterprise users quickly, repeatedly and reliably.

Similarly as in DevOps, where we have a continuous build, test and deploy process to release production-ready code frequently, we want a process that allows data to move from the source to the target environment in a repeatable and if possible automated manner. To do this, we require tools that allow functionalities such as: inventory of source data, data lineage from one layer to the next, flexible data models that cater to various data sources and data in various formats, allowing feedback loops to improve data quality.

Another aspect of DataOps is getting data science models into production. This process is a workflow where data scientists access source data, create analytical models and then put these models into production, where the results are typically calculated on the fly. This requires collaboration among data scientists, applications developers and architects, data governance and security team members and those in operations where business processes are executed. A fast time to value is achieved only with close collaboration.

DataOps and agile go hand in hand, as both promote short time to delivery and responsiveness to change. As in DevOps, there is a strong emphasis on testing. In DataOps we want to test data for quality and we execute automated regression tests before deploying to production.

Benefits of DataOps

The first benefit of DataOps that comes to mind is agility. Data scientists can rapidly deploy new models to production and improve existing models as new data arrives. Overall productivity is increased because everyone who is involved in the process focuses on their own special skills while at the same time collaborating with everyone else.

Practices such as governance, with a focus on data ownership, transparency, and comprehensive data-lineage tracking through the entire data pipeline are a must for efficient DataOps. They enable better quality of the provided data as compared to traditional data analytics implementations where such practices were often overlooked.

DataOps is a framework that supports agility in getting data to where it is needed. It combines both aspects of data processing, on the one hand sourcing and combining data and on the other hand taking care of data security and accuracy.

The ultimate goal of DataOps is that IT is no longer a bottleneck in procuring data, but rather the facilitator of an efficient flow through a predictable process.

Leave a Reply

Your email address will not be published. Required fields are marked *