I execute a join using a javaHiveContext in Spark.
The big table is 1,76Gb and has 100 millions record.
The second table is 273Mb and has 10 millions record.
I get a JavaSchemaRDD and I call count() on it:
String query="select attribute7,count(*) from ft,dt where ft.chiavedt=dt.chiavedt group by attribute7";
JavaSchemaRDD rdd=sqlContext.sql(query);
System.out.println("count="+rdd.count());
If I force a broadcastHashJoin (SET spark.sql.autoBroadcastJoinThreshold=290000000) and use 5 executor on 5 node with 8 core and 20Gb of memory it is executed in 100 sec.
If i don't force broadcast it is executed in 30 sec.
N.B. the tables are stored as Parquet file.