Photo by Eric Han / Unsplash

Running Apache Spark on OpenShift: A Hands-On Guide from Notebooks to Spark Operator

OpenShift Feb 26, 2026

Author: Motohiro Abe, Monson Xavier

Introduction

Apache Spark is one of the most widely used engines for large-scale data processing. Running Spark in a production environment requires careful consideration of scaling, monitoring, and resource management.

This blog explores one approach: using the Spark Operator on OpenShift, combined with OpenShift AI for notebook-based development. We will walk through a real-world Telecom CDR data pipeline, showing how the same ETL logic moves from a JupyterNotebook in client mode to a production-ready SparkApplication.

The blog is split into two parts, each targeting a different persona:

  • Part 1 is for Data Engineers who want to build and test Spark ETL logic using JupyterNotebook in client mode.

  • Part 2 is for ML Engineers who need a scalable, production-ready pipeline using the Spark Operator.

All manifests, notebooks, and setup instructions are available in the GitHub repository linked in each section.

Acknowledgments

This project learned from and reused content from the following upstream repositories:

What is Apache Spark?

Apache Spark is an open-source distributed computing engine designed for large-scale data processing. It provides high-level APIs in Python, Scala, Java, and R, and supports batch processing, streaming, machine learning, and graph processing within a single framework.

A simple Spark application looks like this:


from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SimpleApp").getOrCreate()

df = spark.read.text("README.md")

num_lines = df.count()

print(f"Total lines: {num_lines}")

spark.stop()

The SparkSession is the entry point for all Spark operations. From here, you can read data, apply transformations, and write results — across a single machine or a distributed cluster without changing your code.

For more details, see the Apache Spark Official Documentation.

So, How Does Spark Work on OpenShift?

Spark applications run as a set of processes distributed across a cluster, coordinated by the Driver program via SparkContext.

When running on OpenShift:

  1. The driver connects to the Kubernetes API.

  2. The Spark Operator creates executor pods based on your SparkApplication Custom Resource (CR).

  3. Application code (JAR or Python files) is sent to the executors.

  4. Executors run the tasks and return results to the driver.

spark-component.png

There are two modes for running Spark on OpenShift covered in this blog:

  • Client Mode — the driver runs inside the Jupyter notebook environment. Good for development and interactive exploration.

  • Cluster Mode — the driver runs inside the cluster as a pod, managed by the Spark Operator. Suitable for production workloads.

Blog Structure & Personas

This blog is split into two parts, each targeting a different persona.

Part 1 — Data Engineer You are building and testing Spark ETL logic. You want to iterate quickly without dealing with Kubernetes complexity. This section uses OpenShift AI JupyterNotebook with Spark in client mode to develop and validate a pipeline against realexamplesample Telecom CDR data.

Part 2 — ML Engineer You are taking a validated pipeline to production. You need scaling, resource management, and monitoring. This section uses the Spark Operator to submit SparkApplication workloads, with Grafana for observability.

Part 1 — Data Engineer: Spark ETL from JupyterNotebook

This section is for Data Engineers who want to build and test Spark ETL logic using OpenShift AI JupyterNotebook in client mode.

What You Will See in the Demo

The demo walks through a complete Spark ETL pipeline against Telecom Call Detail Record (CDR) data:

  • Setting up a Spark session in client mode inside an OpenShift AI JupyterNotebook

  • Extracting CDR records from a MySQL database

  • Transforming and writing Parquet files to object storage (MinIO/S3)

  • Running analytics on the processed data

  • Running the notebooks in sequence using an Elyra pipeline

GitHub Repository

Full notebooks, manifests, and setup instructions are available here: spark-etl-ods-demo

Part 2 — ML Engineer: Production-Ready Spark with Operator

This section is for ML Engineers who need a scalable, production-ready Spark pipeline. The same Telecom CDR ETL logic from Part 1 is packaged and submitted as a SparkApplication using the Spark Operator.

What You Will See in the Demo

  • Submitting the ETL workload as a SparkApplication CRD

  • Mounting Python job files from a ConfigMap — no image rebuild needed for transformation changes

  • Submitting the analytics workload as a second SparkApplication

  • Monitoring job performance and resource usage with Grafana The overall flow looks like this:

spark-etl-flow.svg

GitHub Repository

Full manifests, SparkApplication CRDs, and runbook are available here: spark-etl-operator-demo

Conclusion

In this blog, we walked through two approaches to running Spark on OpenShift — from interactive notebook development to production-ready SparkApplication deployments.

The key takeaway is that the same ETL logic written in a JupyterNotebook can be promoted to a production workload without rewriting your code.

OpenShift AI provides the development environment, and the Spark Operator handles the rest.

As infrastructure and data platforms continue to converge, being able to run and experiment with your own Spark environment on OpenShift is a valuable skill for both Data Engineers and ML Engineers.

We hope this gives you a solid starting point. All code, manifests, and runbooks are available in the GitHub repositories linked throughout this blog.

Tags