To run spark job on hadoop cluster you need to have uber jar with all dependencies required for your job to be run. However, some of the dependencies are provided by spark itself. It is recomended to mark them as provided in pom.xml.

All dependencies proveded by spark can be found in spark-assembly file. There is file DEPENDENCIES in META-INF folder.

part of this file:

stream-lib (https://github.com/addthis/stream-lib) com.clearspring.analytics:stream:jar:2.7.0 License: Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0.txt)

Kryo (http://code.google.com/p/kryo/) com.esotericsoftware.kryo:kryo:bundle:2.21 License: New BSD License (http://www.opensource.org/licenses/bsd-license.php) …

In case of hortoworks distribution, path of this file is /usr/hdp/hadoop_version/spark/lib/spark-assembly-spark+hadoop_version.jar

Additionaly you do not need to provive in your jar:

spark-core
spark-bagel
spark-mllib
spark-streaming
spark-graphx
spark-sql

Anything else for spark should be added to jar. For example: spark-streaming-kafka

JVM Based War Stories

Assembly spark job (for 1.6)