Spark job submitted to yarn (external) from remote jupyter notebook pod

sahul · March 6, 2024, 6:40am

Hi,

We are trying to read iceberg hive tables using Apache Spark from jupyter notebook pod built on kubernetes.

Spark is configured on yarn externally and we are trying to read iceberg hive tables but the job shows Failed when viewed from Yarn application logs.

The spark code we tried is as follows

import os
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import FloatType,IntegerType,StructType,StructField
from pyspark.sql import functions as f
from pyspark.sql import Window

spark = (SparkSession.builder.master(“yarn”).appName(“iceberg_test”)
.config(“spark.jars.packages”, “org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.4.3”)
.config(“spark.jars”, “/usr/hdp/3.1.4.0-315/spark3/jars/iceberg-spark-runtime-3.4_2.12-1.4.3.jar, /usr/hdp/3.1.4.0-315/hive/lib/iceberg-hive-runtime-1.4.3.jar, /usr/hdp/3.1.4.0-315/spark3/jars/hive-serde-2.3.9.jar”)
.config(“spark.sql.catalog.spark_catalog.type”, “hive”)
.config(“spark.sql.catalog.spark_catalog”, “org.apache.iceberg.spark.SparkSessionCatalog”)
.config(“spark.sql.catalog.local”, “org.apache.iceberg.spark.SparkCatalog”)
.config(“spark.sql.catalog.local.type”, “hadoop”)
.config(“spark.sql.catalog.local.warehouse”, “$PWD/warehouse”)
.config(“iceberg.hive.engine.enabled”, “true”)
.enableHiveSupport()
.getOrCreate()
)

test_df = spark.sql(“select * from icebergdb.default”)

test_df.show()

Spark job is submitted with master as “yarn”, so i have kept the iceberg-runtime relevant jar in the Yarn cluster and in the code i have called the jar situated in the remote yarn cluster and not the jupyter pod.

Is this method correct or am i missing any points to be taken into consideration?

Topic		Replies	Views
Help running spark jobs on a cluster that is external to K8 Zero to JupyterHub on Kubernetes help-wanted	2	629	October 30, 2024
Jupyter Notebook connecting to existing Spark/Yarn Cluster General	7	14657	April 1, 2019
How to set PYSPARK_PYTHON/PYSPARK_DRIVER_PYTHON Zero to JupyterHub on Kubernetes jupyterhub , how-to , help-wanted	1	3156	April 13, 2022
Pyspark library is missing from jupyter/pyspark-notebook when running with jupyterhub/zero-to-jupyterhub-k8s Zero to JupyterHub on Kubernetes help-wanted	5	4655	November 12, 2021
Connecting to external cluster JupyterHub jupyterhub , how-to , help-wanted	0	526	February 18, 2021

Spark job submitted to yarn (external) from remote jupyter notebook pod

Related Topics