Photo by Eric Han / Unsplash

Running Apache Spark on OpenShift: A Hands-On Guide from Notebooks to Spark Operator

OpenShift Feb 26, 2026

Author: Motohiro Abe, Monson Xavier

Introduction

Apache Spark is one of the most widely used engines for large-scale data processing. Running Spark on Kubernetes requires careful consideration of cluster setup, resource management, and job submission.

In this blog, we walk through how to run Spark on OpenShift using the Kubeflow Spark Operator, integrated with OpenShift AI for notebook-based development. We use Jupyter Enterprise Gateway to submit Spark jobs as SparkApplication CRs directly from a JupyterNotebook — running in cluster mode on OpenShift.

To demonstrate the setup, we use a real-world Telecom Call Detail Record (CDR) pipeline — extracting data from MySQL, transforming it, and writing results to object storage.

All manifests, notebooks, and setup instructions are available in the GitHub repository linked throughout this blog.

Acknowledgments

This project learned from and reused content from the following upstream repositories:

What is Apache Spark?

Apache Spark is an open-source distributed computing engine designed for large-scale data processing. It provides high-level APIs in Python, Scala, Java, and R, and supports batch processing, streaming, machine learning, and graph processing within a single framework.

A simple Spark application looks like this:


from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SimpleApp").getOrCreate()

df = spark.read.text("README.md")

num_lines = df.count()

print(f"Total lines: {num_lines}")

spark.stop()

The SparkSession is the entry point for all Spark operations. From here, you can read data, apply transformations, and write results — across a single machine or a distributed cluster without changing your code.

For more details, see the Apache Spark Official Documentation.

So, How Does Spark Work on OpenShift?

Spark applications run as a set of processes distributed across a cluster, coordinated by the Driver program via SparkContext.

When running on OpenShift:

  1. The driver connects to the Kubernetes API.

  2. The Spark Operator creates executor pods based on your SparkApplication Custom Resource (CR).

  3. Application code (JAR or Python files) is sent to the executors.

  4. Executors run the tasks and return results to the driver.

spark-component.png

Integration with Jupyter Enterprise Gateway

To run PySpark code directly from a JupyterNotebook on OpenShift AI, we add Jupyter Enterprise Gateway to the architecture. The notebook forwards execution requests to the gateway, which submits a SparkApplication CR to the Spark Operator on behalf of the user. Results are returned back to the notebook seamlessly.

For more details, see the Kubeflow Spark Operator — Integration with Kubeflow Notebooks.

Blog Structure

This blog is split into two parts:

Part 1 — Spark ETL from JupyterNotebook
You are building and testing Spark ETL logic from a familiar notebook interface. This section uses OpenShift AI JupyterNotebook with Jupyter Enterprise Gateway, which submits Spark jobs as SparkApplication CRs to the Spark Operator — running in cluster mode on OpenShift. We use a real-world Telecom CDR dataset to demonstrate the pipeline.

Part 2 — Spark ETL with Spark Operator
You are running a scalable, repeatable pipeline without a notebook interface. This section submits the same Telecom CDR pipeline directly as a SparkApplication CRD using the Spark Operator. Python job files are mounted from a ConfigMap, and Grafana is used for monitoring job performance and resource usage.

Part 1 — Spark ETL from JupyterNotebook

This section covers running Spark ETL logic from an OpenShift AI JupyterNotebook using Jupyter Enterprise Gateway. The gateway submits Spark jobs as SparkApplication CRs to the Spark Operator — running in cluster mode on OpenShift.

What You Will See in the Demo

The demo walks through a complete Spark ETL pipeline against Telecom Call Detail Record (CDR) data:

  • Setting up Jupyter Enterprise Gateway to submit Spark jobs from the notebook
  • Extracting CDR records from a MySQL database
  • Transforming and writing Parquet files to object storage (MinIO/S3)
  • Running analytics on the processed data

GitHub Repository

Full notebooks, manifests, and setup instructions are available here:
spark-etl-ods-demo

Part 2 — Production-Ready Spark with Operator

This section covers running the same Telecom CDR pipeline as a SparkApplication using the Spark Operator. Instead of submitting jobs from a notebook, workloads are submitted directly as SparkApplication CRDs — giving you more control over scaling and resource management.

What You Will See in the Demo

  • Submitting the ETL workload as a SparkApplication CRD

  • Mounting Python job files from a ConfigMap — no image rebuild needed for transformation changes

  • Submitting the analytics workload as a second SparkApplication

  • Monitoring job performance and resource usage with Grafana The overall flow looks like this:

spark-etl-flow.svg

GitHub Repository

Full manifests, SparkApplication CRDs, and runbook are available here: spark-etl-operator-demo

Conclusion

In this blog, we walked through two approaches to running Spark on OpenShift — from interactive notebook development using Jupyter Enterprise Gateway, to direct SparkApplication submissions with the Spark Operator.

The key takeaway is that the same ETL logic written in a JupyterNotebook can be promoted to a repeatable, scalable pipeline without rewriting your code. OpenShift AI provides the development environment, and the Spark Operator handles the rest.

We hope this gives you a practical starting point for running Spark on OpenShift. All code, manifests, and runbooks are available in the GitHub repositories linked throughout this blog.

Tags