Programatically adding additional Java packages to your Spark Context

Flávio Teixeira
2 min readJan 12, 2020

When developing Spark jobs, regardless of language, we may face a problem or situation that requires us to use an specific java library and this can be tough if we do not have an automatic deploy pipeline for the Hadoop cluster or if this pipeline is not configurable.

Spark has a built-in functionality that allow us to programatically download and set packages directly into the Spark context, without the need of manually downloading and configuring them. Everything is done by just adding one configuration when instantiating the spark context: spark.jars.packages.

Instantiating the spark context with additional libraries

If you have both Java and Maven installed with their environment variables (JAVA_HOME and MAVEN_HOME) correctly configured then when the context is loaded it will download the desired packages, already adding them to the classpath.

Obs.: The package string must follow the following structure:groupId:packageName:version and if you need more than one package, just list them separated by comma (just like in the image).

>>> BONUS

If you look at the example image again, you can se that the two libraries I'm configuring are the aws-java-sdk and haddop-aws.

AWS libs for spark

They are responsible for letting you, besides a lot of other cool stuff, load data locally directly from s3 using the s3a protocol, i.e. spark.read.csv('s3a://bucket/file.csv') . This can help a lot when developing Spark for AWS projects and you don't want to download data for a simple test or even if you have a cluster outside AWS ecosystem and data stored in S3 buckets.

The snippet source code can be found here.

Feel free to access my other repositories, I post a lot of snippets and personal projects that could help you!

Thanks for reading :)

--

--

Flávio Teixeira

Data engineer, gamer and addicted to technology. Currently working at Riot Games