I used to use df.repartition(1200).write.parquet(...) which created 1200 number of files as specified in the repartion argument. I am now using paritionBy, i.e. df.repartition(1200).write.partitionBy("mykey").parquet(...). This works fine, except that it is now creating 1200 files per bucket of mykey. I would like to have 1200 files over all.
Other posts suggest to repartition across certain keys. The relevant documentation for my spark version (2.4.0) seems to suggest that this feature was added later. Is there any other way to achieve it? I guess I could repartition to 1200/len(unique("mykey"). But that's a bit hacky. Is there a better way to do it? I am also worrying that reducing the number of partitions results in out of memory erros.