Versions

Spark - 2.2

Problem

If you read avro files by spark you can notice that in the resulted schema all fields are optional.

For example, if you read avro file with schema to dataframe

{
	"type": "record",
	"name": "user",
	"namespace": "kakfa-avro.test",
	"fields": [
		{
			"name": "id",
			"type": "int"
		}
  ]
}

And then save dataframe to avro file back, then resulted schema will be:

{
	"type": "record",
	"name": "user",
	"namespace": "kakfa-avro.test",
	"fields": [
		{
			"name": "id",
			"type": [
				"int",
				"null"
			]
		}
  ]
}

There is oen difference in the schema: the field id became optional. That is happening because of one line dataSchema = dataSchema.asNullable in DataSource class source code

HadoopFsRelation(
          fileCatalog,
          partitionSchema = partitionSchema,
          dataSchema = dataSchema.asNullable,
          bucketSpec = bucketSpec,
          format,
          caseInsensitiveOptions)(sparkSession)

There is a jira ticket where this topic was discussed. This logic has sense when you need to work with CSV file and there is no way to predict whether the field is nullable or not.

Workaround

Idea: get correct schema from one of avro files and then recreate dataframe based on correct schema.

val avroSchema = getSchema(avroFiles.head, spark)
val dfWithOptionalFields = readDF(spark, avroFiles)
val dfWithCorrectFields = dfWithOptionalFields.sqlContext.createDataFrame(dfWithOptionalFields.rdd, avroSchema)

JVM Based War Stories

Nullable schema if read avro files by spark

Versions

Problem

Workaround