Hi,
We are trying to read iceberg hive tables using Apache Spark from jupyter notebook pod built on kubernetes.
Spark is configured on yarn externally and we are trying to read iceberg hive tables but the job shows Failed when viewed from Yarn application logs.
The spark code we tried is as follows
import os
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import FloatType,IntegerType,StructType,StructField
from pyspark.sql import functions as f
from pyspark.sql import Window
spark = (SparkSession.builder.master(“yarn”).appName(“iceberg_test”)
.config(“spark.jars.packages”, “org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.4.3”)
.config(“spark.jars”, “/usr/hdp/3.1.4.0-315/spark3/jars/iceberg-spark-runtime-3.4_2.12-1.4.3.jar, /usr/hdp/3.1.4.0-315/hive/lib/iceberg-hive-runtime-1.4.3.jar, /usr/hdp/3.1.4.0-315/spark3/jars/hive-serde-2.3.9.jar”)
.config(“spark.sql.catalog.spark_catalog.type”, “hive”)
.config(“spark.sql.catalog.spark_catalog”, “org.apache.iceberg.spark.SparkSessionCatalog”)
.config(“spark.sql.catalog.local”, “org.apache.iceberg.spark.SparkCatalog”)
.config(“spark.sql.catalog.local.type”, “hadoop”)
.config(“spark.sql.catalog.local.warehouse”, “$PWD/warehouse”)
.config(“iceberg.hive.engine.enabled”, “true”)
.enableHiveSupport()
.getOrCreate()
)
test_df = spark.sql(“select * from icebergdb.default”)
test_df.show()
Spark job is submitted with master as “yarn”, so i have kept the iceberg-runtime relevant jar in the Yarn cluster and in the code i have called the jar situated in the remote yarn cluster and not the jupyter pod.
Is this method correct or am i missing any points to be taken into consideration?