The underlying Hadoop API that Spark uses to access S3 allows you specify input files using a glob expression.
From the Spark docs:
All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory"), textFile("/my/directory/*.txt"), and textFile("/my/directory/*.gz").
So in your case you should be able to open all those files as a single RDD using something like this:
rdd = sc.textFile("s3://bucket/project1/20141201/logtype1/logtype1.*.gz")
Just for the record, you can also specify files using a comma-delimited list, and you can even mix that with the * and ? wildcards.
For example:
rdd = sc.textFile("s3://bucket/201412??/*/*.gz,s3://bucket/random-file.txt")
Briefly, what this does is:
- The
* matches all strings, so in this case all gz files in all folders under 201412?? will be loaded.
- The
? matches a single character, so 201412?? will cover all days in December 2014 like 20141201, 20141202, and so forth.
- The
, lets you just load separate files at once into the same RDD, like the random-file.txt in this case.
Some notes about the appropriate URL scheme for S3 paths:
- If you're running Spark on EMR, the correct URL scheme is
s3://.
- If you're running open-source Spark (i.e. no proprietary Amazon libraries) built on Hadoop 2.7 or newer,
s3a:// is the way to go.
s3n:// has been deprecated on the open source side in favor of s3a://. You should only use s3n:// if you're running Spark on Hadoop 2.6 or older.