There is a lot to find about talking to hive from Spark on the net. Sadly most of it refers to Spark before version 2 or are not valid for hdp3. You need to use the Hive Warehouse Connector, bundled in HDP3.
This is an example of a minimalistic connection from pyspark to hive on hdp3.
from pyspark.sql import SparkSession from pyspark.conf import SparkConf # Yes, llap even if you do not use it. from pyspark_llap import HiveWarehouseSession settings = [ ('spark.sql.hive.hiveserver2.jdbc.url', 'jdbc:hive2://{your_hiverserver2_url:port}/default'), ] conf = SparkConf().setAppName("Pyspark and Hive!").setAll(settings) # Spark 2: use SparkSession instead of SparkContext. spark = ( SparkSession .builder .config(conf=conf) .master('yarn') # There is no HiveContext anymore either. .enableHiveSupport() .getOrCreate() ) # This is mandatory. Just using spark.sql will not be enough. hive = HiveWarehouseSession.session(spark).build() hive.showDatabases().show() hive.execute("select 2 group by 1 order by 1").show() spark.stop()
You then can run this with the following command:
HDP_VERSION=3.0.1.0-187 \ PYSPARK_PYTHON=python3 \ HADOOP_USER_NAME=hive \ SPARK_HOME=/usr/hdp/current/spark2-client \ spark-submit \ --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.0.1.0-187.jar \ --py-files /usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.0.1.0-187.zip \ {your_python_script.py}
Note:
- HDP_VERSION is needed when you use python 3. If not set, HDP uses a script (/usr/bin/hdp-select) which is python 2 only (although fixing it is trivial).
- PYSPARK_PYTHON is optional, it will default to just python otherwise (which might or might not be python 3 on your server)
- without HADOOP_USER_NAME the script will run as your current user. Alternatively, you could sudo first.
- without SPARK_HOME some jars would not be found and you would end up with an error like py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NoClassDefFoundError: com/sun/jersey/api/client/config/ClientConfig - –jars and –py-files as you can see, there is the hdp version in file names. Make sure you are using the proper one for you.
- there is no –master option, this is handled in the script while building the SparkSession.
There is some doc from Hortonworks you can follow to go further: Integrating Apache Hive with Spark and BI.
Just before I posted this article, a new write-up appeared on Hortonworks.com to describe some use cases for the Hive-Warehouse-Connector.