Enum in key for Pair Rdd
Be careful using java enums in key for PairRDD
Example: java enum:
public enum KeyType {
ODD,
EVEN
}
and simple spark job that calculate amount of ODD and EVEN numbers between 1 to 10000
sc.parallelize(1 to 10000, 4)
.map { i => (if (i % 2 == 0) KeyType.EVEN else KeyType.ODD, 1) }
.reduceByKey(_ + _)
.foreach{case (key, value) => println(key + " - " + value)}
The output of that spark job is
EVEN - 2500
ODD - 5000
EVEN - 2500
The problem is based on the fact that reduce by key is using equals and hashCode methods of key. However, as spark job is executed in different JVMs, hashcodes of the same enum value are different.