I recon I might be lacking a fundamental piece of understanding here.
So far, I got my own pyspark kernels running in k8s (with my local jupyter lab connected to the eg) and inside I managed to read a parquet file from s3 (i.e. I am using a kernel image with all the necessary jars installed).
However, I thought it would be very useful to have Hive table functionality available.
My first question: Is this the following a misconception or bad practice? I can see that just creating dataframes and then saving to parquet files on s3 (to persist work between sessions and to share between multiple user) is theoretically possible. However, I think that being able to save data as tables and then be able to listTables() seems powerful and a lot more convenient. Is this already a misconception and not the intended use?
Assuming the above is fine, what I tried to do is to select a persistent path for spark.sql.warehouse.dir
, enable hive support, and just give it a try.
conf.set("spark.hadoop.fs.s3a.endpoint", s3_endpoint_loc)
conf.set("spark.hadoop.fs.s3a.access.key", s3_access_key)
conf.set("spark.hadoop.fs.s3a.secret.key", s3_secret_key )
conf.set("spark.hadoop.fs.s3a.path.style.access", "true")
conf.set("spark.hadoop.fs.s3a.connection.maximum", 20)
conf.set("spark.hadoop.fs.s3a.attempts.maximum", 20)
conf.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
warehouse_loc = "s3a://<bucket_name>/"
conf.set("spark.sql.warehouse.dir", warehouse_loc)
spark = SparkSession.builder.appName(app_name).config(conf=conf).enableHiveSupport().getOrCreate()
What happend is that this setup seemed to work initially, i.e. I was able to create and then read back a table. The parquet files were created in my s3 bucket. However, if I run the notebook again, or run another notebook, the state is gone.
In particular, spark.catalog.listTables()
returns an empty list. However, the data is still there in the s3 bucket and I even get an exception if I want to create it again, because the path is already taken.
I assume I am missing a crucial component in my setup. Could you please point me in the right direction and possible how to set that up?