Methods taken into consideration (Spark 2.2.1):
DataFrame.repartition(the two implementations that takepartitionExprs: Column*parameters)DataFrameWriter.partitionBy
Note: This question doesn't ask the difference between these methods
From docs of partitionBy:
If specified, the output is laid out on the file system similar to
Hive's partitioning scheme. As an example, when we partition aDatasetby year and then month, the directory layout would look like:
- year=2016/month=01/
- year=2016/month=02/
From this, I infer that the order of column arguments will decide the directory layout; hence it is relevant.
From docs of repartition:
Returns a new
Datasetpartitioned by the given partitioning expressions, usingspark.sql.shuffle.partitionsas number of partitions. The resultingDatasetis hash partitioned.
As per my current understanding, repartition decides the degree of parallelism in handling the DataFrame. With this definition, behaviour of repartition(numPartitions: Int) is straightforward but the same can't be said about the other two implementations of repartition that take partitionExprs: Column* arguments.
All things said, my doubts are following:
- Like
partitionBymethod, is the order of column inputs relevant inrepartitionmethod too? - If the answer to above question is
- No: Does each chunk extracted for parallel execution contain the same data as would have been in each group had we run a
SQLquery withGROUP BYon same columns? - Yes: Please explain the behaviour of
repartition(columnExprs: Column*)method
- No: Does each chunk extracted for parallel execution contain the same data as would have been in each group had we run a
- What is the relevance of having both
numPartitions: IntandpartitionExprs: Column*arguments in the third implementation ofrepartition?