I have two pyspark dataframe tdf and fdf, where fdf is extremely larger than tdf. And the sizes of these dataframes are changing daily, and I don't know them. I want to randomly pick data from fdf to compose a new dataframe rdf, where size of rdf is approximately equal to the size of tdf. Currently I have these lines:
tdf = tdf.count()
fdf = fdf.count()
sampling_fraction = float(tdf) / float(fdf)
rdf = fdf(sampling_fraction, SEED)
These lines produce correct result. But when the size of fdf is increasing, the fdf.count() takes a few days to finish. Can you suggest another approach that is faster in PySpark?