To run spark job on hadoop cluster you need to have uber jar with all dependencies required for your job to be run. However, some of the dependencies are provided by spark itself. It is recomended to mark them as provided in pom.xml.

All dependencies proveded by spark can be found in spark-assembly file. There is file DEPENDENCIES in META-INF folder.

part of this file:

  • stream-lib (https://github.com/addthis/stream-lib) com.clearspring.analytics:stream:jar:2.7.0 License: Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0.txt)
  • Kryo (http://code.google.com/p/kryo/) com.esotericsoftware.kryo:kryo:bundle:2.21 License: New BSD License (http://www.opensource.org/licenses/bsd-license.php) …

In case of hortoworks distribution, path of this file is /usr/hdp/hadoop_version/spark/lib/spark-assembly-spark+hadoop_version.jar

Additionaly you do not need to provive in your jar:

  • spark-core
  • spark-bagel
  • spark-mllib
  • spark-streaming
  • spark-graphx
  • spark-sql

Anything else for spark should be added to jar. For example: spark-streaming-kafka