There is a lot to find about talking to hive from Spark on the net. Sadly most of it refers to Spark before version 2 or are not valid for hdp3. You need to use the Hive Warehouse Connector, bundled in HDP3.
This is an example of a minimalistic connection from pyspark to hive on hdp3.
from pyspark.sql import SparkSession from pyspark.conf import SparkConf # Yes, llap even if you do not use it. from pyspark_llap import HiveWarehouseSession settings = [ ('spark.sql.hive.hiveserver2.jdbc.url', 'jdbc:hive2://{your_hiverserver2_url:port}/default'), ] conf = SparkConf().setAppName("Pyspark and Hive!").setAll(settings) # Spark 2: use SparkSession instead of SparkContext. spark = ( SparkSession .builder .config(conf=conf) .master('yarn') # There is no HiveContext anymore either. .enableHiveSupport() .getOrCreate() ) # This is mandatory. Just using spark.sql will not be enough. hive = HiveWarehouseSession.session(spark).build() hive.showDatabases().show() hive.execute("select 2 group by 1 order by 1").show() spark.stop()
You then can run this with the following command:
HDP_VERSION=3.0.1.0-187 \ PYSPARK_PYTHON=python3 \ HADOOP_USER_NAME=hive \ SPARK_HOME=/usr/hdp/current/spark2-client \ spark-submit \ --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.0.1.0-187.jar \ --py-files /usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.0.1.0-187.zip \ {your_python_script.py}
Note:
- HDP_VERSION is needed when you use python 3. If not set, HDP uses a script (/usr/bin/hdp-select) which is python 2 only (although fixing it is trivial).
- PYSPARK_PYTHON is optional, it will default to just python otherwise (which might or might not be python 3 on your server)
- without HADOOP_USER_NAME the script will run as your current user. Alternatively, you could sudo first.
- without SPARK_HOME some jars would not be found and you would end up with an error like py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NoClassDefFoundError: com/sun/jersey/api/client/config/ClientConfig - –jars and –py-files as you can see, there is the hdp version in file names. Make sure you are using the proper one for you.
- there is no –master option, this is handled in the script while building the SparkSession.
There is some doc from Hortonworks you can follow to go further: Integrating Apache Hive with Spark and BI.
Just before I posted this article, a new write-up appeared on Hortonworks.com to describe some use cases for the Hive-Warehouse-Connector.
Excellent post, i have been trying to work around spark and hive for my upcoming certification, all i have is a HDP3.0.1 docker image and when ever i tried to use HWC all i get is an permission denied exception.
could you please let me know the workaround to solve that permission error while trying to access hive from spark
And by the way do we still need to use “enableHiveSupport “?
EnableHiveSupport is indeed is needed.
It’s hard to give a workaround without seeing your error, but usually, this is due to the user you are running the Spark job as (HADOOP_USER_NAME) not having the proper permissions to read some files or directories on HDFS.
Hey,
thanks for such useful post
Can you tell me how can I import pyspark_llap library in my HDFS?
I get an error saying that no module named ‘pyspark_llap’
How can I get around that?
The same issue from my side, how can I install / download this “pyspark_llap” library