Produce result of Spark Stream job to Kafka
Environment
kafka 0.9 spark 1.6
Producing to kafka according to documentation
Quite often problem: you have spark streaming job and result of it should be pushed to kafka. In most cases when you google for ‘spark streaming prodcue to kafka’, you will find example like:
Problem
This code means, that for every RDD for every microbatch, new kafka producer will be created. This common approach was working for a long time for us. Until we deployed spark job with microbatch size 500ms and 64 partitions. After several days of running, we noticed, that GC on kafka jvm is not performing good (constant major GC).
Heap dump of kafka showed that there were huge amount of JMX metrics for kafka producers. I’m not sure that it is a memory leak in Kafka, as it was able to clean up memory before we significantly increased amount of producers created every second. There is one post in stackoverflow that looks similar to our problem.
Solution
Obvious soltuion: decrease of amount new producers created. That mean that we should reuse the same producer. There are 2 posts for (Marcin Kuthan)[http://mkuthan.github.io/]
- broadcast lazy initialized kafka procuder to each executor
- hold map of producers in object (singleton)
Both solutions are working. However, second one requires less code changes.
Broadcast producer
First post in allergo technical blog.
Usage:
Hold map of producers in object (singleton)
second post. Partial code
Usage:
For more details I suggest to look in github