My problem is as follows:
I have a large dataframe called
detailscontaining 900K rows and the other one containing 80M rows namedattributes.Both have a column
Aon which I would like to do a left-outer join, the left dataframe beingdeatils.There are only 75K unique entries in column
Ain the dataframedetails. The dataframeattributes80M unique entries in columnA.
What is the best possible way to achieve the join operation?
What have I tried?
The simple join i.e.
details.join(attributes, "A", how="left_outer")just times out (or gives out of memory).Since there are only 75K unique entries in column
Aindetails, we don't care about the rest in the dataframe inattributes. So, first I filter that using:uniqueA = details.select('A').distinct().collect() uniqueA = map(lambda x: x.A, uniqueA) attributes_filtered = attributes.filter(attributes.A.isin(*uniqueA))I thought this would work out because the
attributestable comes down from 80M rows to mere 75K rows. However, it still takes forever to complete thejoin(and it never completes).Next, I thought that there are too many partitions and the data to be joined is not on the same partition. Though, I don't know how to bring all the data to the same partition, I figured repartitioning may help. So here it goes.
details_repartitioned = details.repartition("A") attributes_repartitioned = attributes.repartition("A")The above operation brings down the number of partitions in
attributesfrom 70K to 200. The number of partitions indetailsare about 1100.details_attributes = details_repartitioned.join(broadcast( attributes_repartitioned), "A", how='left_outer') # tried without broadcast too
After all this, the join still doesn't work. I am still learning PySpark so I might have misunderstood the fundamentals behind repartitioning. If someone could shed light on this, it would be great.
P.S. I have already seen this question but that does not answer this question.