How Kubernetes marries IT and data science roles

Register now

There is a clear division between the tasks and challenges IT and data science teams face, including how each impacts the organization. At what point can these two business units unify and present ROI back to the organization as one function? An unlikely answer: Kubernetes.

Today’s IT leaders are challenged with centralizing data science infrastructure in a way that will increase governance without constraining freedom and flexibility. Kubernetes’ strength as an open source container orchestration system is well documented. It has the potential to weave together the tools data scientists need enabling self-service analytical research and achieving scale like never before, but IT leaders should take note of key trends and associated pitfalls.

Adoption of Kubernetes is increasing with little sign of slowdown. Its capabilities will continue to improve over the coming years. IT leaders will want to avoid building or acquiring data science platforms that simply run on Kubernetes in favor of Kubernetes-native data science platforms in which all core components are built from the ground up in Kubernetes. Only platforms that are native-built will be able to take full advantage of the coming Kubernetes innovation and be able to integrate with the data science specific open source breakthroughs that will continue to come.

IT leaders should also avoid tying their platform to a single cloud or infrastructure provider. Providers that charge by the cycle/hour or are tied to a particular infrastructure have little incentive or are simply not able to build certain efficiencies into their solution.

For example, the trend toward multi and hybrid-cloud adoption is strong, but on-prem workloads will continue to be a necessity for many organizations. Moving between cloud providers or requiring some workloads to run on-prem should not require a lift-and-shift of your data science platform or alter the way data scientists do their research and collaborate.

If a data science platform is native-built in Kubernetes it will be able to seamlessly deploy to on-prem Kubernetes clusters or to clusters from any cloud provider resulting in a single system of research across cloud and on-prem assets that allows best practices to take root which fosters scale.

In another example, consider utilization costs. They can be controlled through features such as pluggable clusters that can run on the same hardware as standard workloads. Additionally, automatic demand-based scale down and automatic clean up of resources reduce wasteful utilization. Kubernetes-native platforms that are cloud and infrastructure agnostics are better able to offer these kinds of features.

At the end of the day, the best thing IT can offer data science is the ability to scale. Scale is the key to gaining competitive advantage in today’s market of broad data science awareness. The ways in which a platform like the one we have been discussing can enable scale in organizational data science include:

  • Powerful systems for serving a library of approved, customized, and configurable containers to speed up research and facilitate reproducibility.
  • Access to a variety of compute such as CPUs and GPUs of various sizes and permanent and ephemeral clusters.
    Tracking of project metadata, data, code, and research logic in a thoughtful and systematic way to facility collaboration and knowledge management.
  • Pushing models, reports, and batch jobs into production without the hassle of dev-ops gymnastics.
  • Monitoring of models and other assets in production.
  • Instant, safe access to all data scientists’ favorite analytical packages and IDEs from the open-source world without friction from IT.
  • Assurance that whichever tools and high performance computing paradigms (Spark, Dask, Ray, RAPIDS, etc.) win out over the next 10 years they will be supported.
  • The ability to run proprietary and open-source analytical software under the same research platform.
  • A single platform and work-style for any cloud provider and underlying infrastructure.

Meanwhile, IT enjoys a safe, secure, and repeatable methodology for managing data science research and productionization as it scales. IT benefits include:

  • Eliminating sprawl and shadow IT.
  • Strict container management.
  • Comprehensive security and permissioning.
  • Managerial oversight at the platform, research project, production asset, data asset, team, and user level.
  • Cost and waste reductions through utilization controls.

The beauty of Kubernetes, and one of the main reasons for its skyrocketing popularity, is in its simplicity. It simplifies the world of container orchestration. The requirements to install a full-featured enterprise data science platform that is native-built in Kubernetes can fit on one page.

IT can breathe a sigh of relief. Simplification enables security, flexibility, and efficiency. Which equates to competitive advantage for those keeping score in the C-suite. IT and data scientists may finally have the common ground for which they have been searching.

For reprint and licensing requests for this article, click here.