A guide how to adopt an existing Spark scala library for Spark Connect

Open link in next tab

Spark-Connect: I'm starting to love it!

https://semyonsinchenko.github.io/ssinchenko/post/porting_deequ_to_sparkconnect/

Summary This blog post is a detailed story about how I ported a popular data quality framework, AWS Deequ, to Spark-Connect. Deequ is a very cool, reliable and scalable framework that allows to compute a lot of metrics, checks and anomaly detection suites on the data using Apache Spark cluster. But the Deequ core is a Scala library that uses a lot of low-level Apache Spark APIs for better performance, so it cannot be run directly on any of Spark-Connect environment.

Spark-Connect: I'm starting to love it!

Let me share my post with a detailed step by step guide how an exisiting Spark scala library may be adopted to work with recently introduced Spark Connect. As an example I have chosen a pupular open source data quality tool AWS Deequ. I made all the necessary protobuf messages and a Spark Connect Plugin. I tested it from PySpark Connect 3.5.1 and it works. Of course, all the code is public in git.

Sign in to add comment