If you know how to parse files using plain scala code, you just need spark to distribute it on executors:
List file names in your s3 bucket, It will result in Seq[String]
Transform them to dataset/dataframe (spark.createDataset
)
Do mapPartitions
operation
Inside mapPartitions
, init s3Client, read file content of files using it. Parse using plain scala code, output as case class
Output of mapPratitions
will be dataset of parsed files
upd:
that is sample of code
I do not know what is rock db files and how to read them via plain scala code
Assuming you can read them with plain scala, without spark
next code sample can be used to parallelize read/parse with spark
val session: SparkSession = SparkSession.builder().config(new SparkConf().setMaster("local")).getOrCreate()
import session.implicits._
def listS3Files(bucket: String, prefix: String): Seq[String] = {
val s3Client = AmazonS3ClientBuilder.standard.build // add credentials, region if needed
val listObjectsRequest = (new ListObjectsV2Request)
.withBucketName(bucket)
.withPrefix(prefix)
s3Client.listObjectsV2(listObjectsRequest).getObjectSummaries.asScala.map(_.getKey).filter(p => !p.endsWith("/"))
}
val bucket = "your-bucket"
val files = listS3Files(bucket, "test")
val result = files
.toDS()
.mapPartitions(fs => {
val s3Client = AmazonS3ClientBuilder.standard.build // add credentials, region if needed
// here got content as string, in your case you need to read rock db files and parse them to case class
val content = fs.map(f => IOUtils.toString(s3Client.getObject(bucket, f).getObjectContent, StandardCharsets.UTF_8))
content
}).cache()
Dataframe
from the files, yes. I've no idea how a SST file is structured but I guess that if you're able to convert it to JSON it should be easily readable to make it aDataframe
. Also note that you could read JSON directly with Spark, no strict need to convert to parquet.