Slideshow Getting better value from big data projects with DataOps

Published
  • February 15 2018, 6:30am EST
11 Images Total

Putting data science models into production

Data science and machine learning techniques are playing an increasingly important role in driving value for big data projects. In the following slideshow, Crystal Valentine, vice president of technology strategy for MapR Technologies, introduces the concept of DataOps and how this approach to data science workflows plays a critical role in putting data science models into production.

DataOps: An agile methodology for data-driven organizations

“DataOps is an emerging practice utilized by large organizations with teams of data scientists that need to train models and deploy them to production,” Valentine explains. “The goal of a DataOps methodology is to create an Agile, self-service workflow that fosters collaboration and boosts creativity while respecting data governance policies. The teams that work together to implement a DataOps methodology include data scientists, application developers and architects, data governance and security team members, and those in operations. A successful DataOps practice supports cross-functional collaboration and fast time to value.”

Content Continues Below


Comparison to DevOps

“DataOps is, in some ways, an extension of DevOps practices,” Valentine says. “DevOps is a practice that grew out of the observation that building and deploying applications exhibited certain procedures that were repeatable and could be automated, so developers and operations professionals could collaborate tightly to speed the process. However, DevOps is not particularly amenable to data-intensive applications, such as data science and machine learning, because some of the underlying assumptions that went into formalizing DevOps do not hold for these newer types of applications. Above all, DevOps practices assume that the types of applications being deployed are lightweight or ephemeral and do not have a large reliance on persistent data. DataOps emerged from the realization that data-intensive applications require a different approach from the ground up.”

Different method, different functional groups

“We also have different functional groups involved in DataOps,” Valentine says. “Whereas DevOps practices largely involve the collaboration between Developers, Quality Assrance, and Operations, data-intensive applications cross over a broader group. In particular, with machine learning and data science applications, increasingly we see that data scientists represent a new group that have to be brought into the process. These folks are typically coming from a background in academia and have different skills and use different tools than traditional software developers so they need to be integrated into the workflow. Additionally, whenever data access is central to the workflow, data governance becomes important for the entire process, so IT governance also plays a central role.”

Motivation and Goals

DataOps can be thought of as deriving from a set of axioms that we believe to be true, Valentine says.
Axioms include:
· Data is central to disruptive enterprise applications.
· Lightweight, stateless functions do not represent the majority of workloads.
· Data science and machine learning are an important paradigm.
· Scientists become active users -- no longer just application developers.
· Iterative workflow with different data usage patterns.
· Data volumes continue to grow.
. Moving data is a performance bottleneck

Content Continues Below


The goals of DataOps

According to Valentine, the goals of a DataOps practice are:
· Continuous model deployment
· Promote repeatability
· Promote productivity -- focus on core competencies
· Promote agility
· Promote self-service


The DataOps process

“A DataOps workflow supports cross-functional collaboration and fast time to value,” Valentine says. “With an emphasis on both people and process, as well as the empowering platform technologies that underlie it, a DataOps process allows each collaborating group to increase productivity by focusing on their core competencies while enabling an agile, iterative workflow. A DataOps workflow paradigm must be embraced and implemented by several different functional groups within an organization which all collaborate to deliver business value. At a high level, the process involves several functional components, including: Data engineering and curation, Data governance, Application development, Application testing, Model management, Model deployment and Model monitoring and rescoring. These components are integrated into a streamlined process that promotes repeatability, automation, continuous model deployment, and self-service data access.”

A platform approach

“While DataOps focuses largely on people and process, it also requires an enterprise-grade platform to enable collaboration and the sharing of data and compute resources by the different groups involved,” Valentine says. “Rather than having each group work off of siloed technology and data, a collaborative effort should be able to leverage a single, unified data platform. Using a single platform is key to agility, reduces the need to copy or move large data sets, and supports a holistic approach to data access and security. The technical requirements of a platform that can support a DataOps process include: enterprise-grade reliability, multi-tenancy and resource utilization, native support for any data type to accommodate diverse and evolving data sources, support for distributed architectures and self-service data access through a metadata-driven data marketplace. The platform should empower the security and governance teams to enforce privacy and security policies with granular access control expressions while promoting a self-service, agile data access workflow.”

Content Continues Below


The benefits of DataOps

A DataOps methodology yields many benefits to data-driven organizations, Valentine says, but principle among them are:
• Agility. “Data scientists can iterate rapidly to improve models. The publication of new models can happen independently of the application development work, and new models can be deployed to production without interrupting the production application’s operation.”
• Increased productivity. “Each group can focus on their core competencies. Data scientists do not get bogged down in doing the ‘plumbing’ work of finding, copying, curating, and transforming data. Application developers do not waste time refactoring code written by data scientists so it can run in production.”
• Security. “With a unified data platform, organizational data access and privacy policies can be enforced holistically across organizations. Model development and application deployment activities inherit from the data access policies specified by the governance group.”

A new agile methodology

“A DataOps methodology focuses on improving the business value of data science and machine learning investments by speeding time to market for intelligent applications,” Valentine says. “A DataOps practice makes data consumable in an efficient and agile way while still respecting governance and security policies. While DataOps is still emerging as an enterprise practice for organizing the work of small teams involved in a collaborative process to build data applications, it represents a significant new trend. The recognition that data intensive applications have their own set of considerations when it comes to managing and securing large, complex data sets while still enabling agile access to that data by the people who need it is a paradigm shift.”

Freeing up the data access bottleneck

“As data access is often the bottleneck in supporting data-intensive applications, an intelligent application development practice should consider the use and management of data as an organizing principle,” Valentine says. “Thinking about the management and access to data as part of the development process turns conventional application-centric thinking on its head. New DataOps development practices are a sign that we are moving toward a new reality in which IT is no longer a bottleneck to productivity but instead part of a fluid process that emphasizes self-service and faster time to market while still being enterprise grade.”