Programatically adding additional Java packages to your Spark Context
When developing Spark jobs, regardless of language, we may face a problem or situation that requires us to use an specific java library and this can be tough if we do not have an automatic deploy pipeline for the Hadoop cluster or if this pipeline is not configurable.
Spark has a built-in functionality that allow us to programatically download and set packages directly into the Spark context, without the need of manually downloading and configuring them. Everything is done by just adding one configuration when instantiating the spark context: spark.jars.packages
.
If you have both Java and Maven installed with their environment variables (JAVA_HOME and MAVEN_HOME) correctly configured then when the context is loaded it will download the desired packages, already adding them to the classpath.
Obs.: The package string must follow the following structure:groupId:packageName:version
and if you need more than one package, just list them separated by comma (just like in the image).
>>> BONUS
If you look at the example image again, you can se that the two libraries I'm configuring are the aws-java-sdk and haddop-aws.
They are responsible for letting you, besides a lot of other cool stuff, load data locally directly from s3 using the s3a
protocol, i.e. spark.read.csv('s3a://bucket/file.csv')
. This can help a lot when developing Spark for AWS projects and you don't want to download data for a simple test or even if you have a cluster outside AWS ecosystem and data stored in S3 buckets.
The snippet source code can be found here.
Feel free to access my other repositories, I post a lot of snippets and personal projects that could help you!
Thanks for reading :)