Story image

Google Cloud rolls out Cloud Dataproc on Kubernetes

11 Sep 2019
Twitter
Facebook

Google Cloud is trialling alpha availability of a new platform for data scientists and engineers through Kubernetes.

Cloud Dataproc on Kubernetes combines open source, machine learning and cloud to help modernise big data resource management.

The alpha availability will first start with workloads on Apache Spark, with more environments to come.

According to Google Cloud product managers Christopher Crosbie and James Malone, Google Cloud Dataproc can provide open source data analytic processing for those who need to process data and train models at scale, faster.

However, as enterprise infrastructure becomes increasingly hybrid in nature, machines can sit idle, single workload clusters continue to sprawl, and open source software and libraries continue to become outdated and incompatible with your stack,” they explain.

“It’s critical that Cloud Dataproc continues to empower data professionals to focus more on workloads than infrastructure by combining the best of cloud and open source.”

The platform will include key benefits such as faster workloads, unified resource management, job isolation, collaboration, and expertise sharing.

Unified resource management will allow data scientists to work with a central view that spans both Kubernetes and YARN cluster management systems.

“Kubernetes has flipped the big data and machine learning open source software (OSS) world on its head, since it gives data scientists and data engineers a way to unify resource management, isolate jobs, and build resilient infrastructures across any environment.”

More resilient infrastructure: A self-healing GKE environment can support the smooth operation of mission critical ETL and machine learning jobs on Spark.

“Data scientists and data engineers don’t have to worry about sizing and building clusters, manipulating Docker files, or messing around with Kubernetes networking configurations. It just works. With leading support from the team that built Kubernetes, enterprises have access to the skills they need to close any Kubernetes skills gap on their team.”

Less time and resource on infrastructure, more on workloads – the development of new applications and models faster at scale

Isolate jobs to accelerate analytics life cycles – users can package up entire jobs in standalone containers to allow for testing, upgrading and patching without breaking underlying cluster.

Collaboration and expertise sharing to close the Kubernetes skills gap – new capabilities, bugs and security issues can be discussed and resolved by open source community

This is the first step in a larger journey to a container-first world. While Apache Spark is the first open source processing engine we will bring to Cloud Dataproc on Kubernetes, it won’t be the last,” comment Crosbie and Malone.

They add that Google Cloud’s data and analytics strategy has always involved open source as a core pillar.

“This alpha announcement of bringing enterprise-grade support, management, and security to Apache Spark jobs on Kubernetes is the first of many as we aim to simplify infrastructure complexities for data scientists and data engineers around the world.”