There is a lot to find about talking to hive from Spark on the net. Sadly most of it refers to Spark before version 2 or are not valid for hdp3. You need to use the Hive Warehouse Connector, bundled in HDP3.
This is an example of a minimalistic connection from pyspark to hive on hdp3.
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
# Yes, llap even if you do not use it.
from pyspark_llap import HiveWarehouseSession
settings = [
('spark.sql.hive.hiveserver2.jdbc.url',
'jdbc:hive2://{your_hiverserver2_url:port}/default'),
]
conf = SparkConf().setAppName("Pyspark and Hive!").setAll(settings)
# Spark 2: use SparkSession instead of SparkContext.
spark = (
SparkSession
.builder
.config(conf=conf)
.master('yarn')
# There is no HiveContext anymore either.
.enableHiveSupport()
.getOrCreate()
)
# This is mandatory. Just using spark.sql will not be enough.
hive = HiveWarehouseSession.session(spark).build()
hive.showDatabases().show()
hive.execute("select 2 group by 1 order by 1").show()
spark.stop()
You then can run this with the following command:
HDP_VERSION=3.0.1.0-187 \
PYSPARK_PYTHON=python3 \
HADOOP_USER_NAME=hive \
SPARK_HOME=/usr/hdp/current/spark2-client \
spark-submit \
--jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.0.1.0-187.jar \
--py-files /usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.0.1.0-187.zip \
{your_python_script.py}
Note:
- HDP_VERSION is needed when you use python 3. If not set, HDP uses a script (/usr/bin/hdp-select) which is python 2 only (although fixing it is trivial).
- PYSPARK_PYTHON is optional, it will default to just python otherwise (which might or might not be python 3 on your server)
- without HADOOP_USER_NAME the script will run as your current user. Alternatively, you could sudo first.
- without SPARK_HOME some jars would not be found and you would end up with an error like py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NoClassDefFoundError: com/sun/jersey/api/client/config/ClientConfig - –jars and –py-files as you can see, there is the hdp version in file names. Make sure you are using the proper one for you.
- there is no –master option, this is handled in the script while building the SparkSession.
There is some doc from Hortonworks you can follow to go further: Integrating Apache Hive with Spark and BI.
Just before I posted this article, a new write-up appeared on Hortonworks.com to describe some use cases for the Hive-Warehouse-Connector.