Guide for Databricks Certified Associate Developer for Apache Spark 3.0

Carlos Alonso Capilla
4 min readApr 1, 2021

--

In this post I am going to share the resources and methodology I used to pass the “Databricks Certified Associate Developer for Apache Spark 3.0” certification.

First of all, when I took the exam(28/03/2021) the most recent Spark version is 3.1.1, but in the exam is evaluated from the major release 3.0, link.

The exam is available in Python or Scala, you will not be tested on your knowledge of these two languages, so use the one you are more comfortable with, you will be tested on Apache Spark, in my case I did it in Python. The exam window is divided into two parts:
-Left part: The exam questions and their possible answers.
-Right part: This part is divided in two, the upper part you will have a giant PDF with the official documentation (link), in which you CAN’T search, it doesn’t have the option to search. The bottom right part is like a kind of Notepad where you can write down your own observations, it’s not a shell or anything like that, it’s just a notepad.

These are my recommendations for the exam.

1. Whether you have enough knowledge of Spark or not, I consider the book Spark The Definitive Guide (book) is a must read, currently there is a book that is a little lighter and that covers version 3.0 but as a personal recommendation I recommend the first one as the fundamentals are practically the same.

2. Make sure you are comfortable searching the PDF of the documentation you are given, this is VERY IMPORTANT. For example if you have to look at what the withColumn method does, you have to know that this method is in the Dataframes package and not in the SparkSession package. This point is very important because the PDF will be very useful for you.

3. The exam is mostly focused on the Dataframe API, so if you only know SQL and don’t know how Dataframes works, don’t take the exam because you will fail!

4. You must know how Spark architecture works and its hierarchy (Jobs, Stages, Tasks, Partitions, Accumulators, Workers, Driver, and so on).

Resources.

There are many resources available on the internet, but here I am going to put some links where I think there is great content about Spark.

1- https://www.linkedin.com/company/justenough-spark/ . Some of their test publications are very similar to those of the exam and they usually have very good content.

2- If you have been working with Spark for a long time, I am sure that sooner or later you have come across a post by Jacek Laskowski, 100% sure. I recommend you to read his book online ( https://jaceklaskowski.github.io/mastering-spark-sql-book/overview/ )

3- Bryan Cafferky, has a Youtube channel where one of his list of videos is about Databricks and Apache Spark, at the time of writing this post, his video series is not finished, but the ones he has at the moment, are very good. In addition, his youtube channel has some very interesting content.

4- Bartosz Konieczny, has a website focused almost entirely on data engineering with advanced and not so advanced topics, it is very worthwhile.

I hope that with this post you have a better understanding of how to approach the certification and remember that if you fail it is not a failure, it is an opportunity to know where you should spend more time learning.

Update: 03/04/2021

I’m going to add some more tips after a few days of thinking about how to help people get certified.

5- Spark memory tuning. You should have a general idea about how memory works in Spark, be careful that this topic is very wide but you should have a minimum knowledge, everything that comes in the following link should sound familiar to you, if you want to go deeper go ahead! but at least some basic notions you should have.

6- Regarding the documentation that they give you for the exam (the PDF), as I said before you don’t have to memorize all the functions, you just need to know in which section they are (SparkSession, Dataframes, functions, Row, and so on). Another thing you must take into account is that the links that are in the PDF are disabled, that is, for example, in the function to_date the links that appear in the online documentation do not work in the PDF of the exam.

7- Read the questions very slowly. Some of the questions are negated, so you have to read very slowly and very well what they are asking you. Some questions are tricky because for example they say: “return a new dataframe which has a column with the average salary” and the possible options are to use the withColumn or avg method, very careful here because remember that the withColumn method ADD one more column to the original dataframe, and in this case they are asking for a dataframe with one column, when they want you to use the withColumn method they usually give you hints of “in addition to the current columns, we want a new column with xxxxx”, maybe right now when you are reading this you say “bro, this is super easy I would never fall into this kind of traps” but when you are doing the exam you are nervous, you are in a hurry to finish (even if you have plenty of time) so pay attention to what they are asking you!

Finally, if you have any doubts, leave me a comment here or on Linkedin (link) and I’ll try to solve them! Add me to your network if you want, I’ll accept the request without problem!

--

--