Google’s data engineering certification is one of the new and emerging certifications in the industry right now for Data Engineering and Machine Learning. The exam will test on your ability to design, build and maintain data processing system and also analyze data, to gain insight into business outcomes and enable machine learning. I was able to pass the data engineer exam in Jan 2018. Through this blog, I would like to share my experience and the preparation guide, because at the time of writing I did not have enough materials to guide me on the exam.
*Note: The following article is related to the Google Cloud Professional Data Engineer Certification exam v1.(before March, 2019)
Below lists the major offerings from GCP and service definitions.
Type of Exam
- 50 Questions in 2 hrs.
- Question Type: MCQ (Multi-Select)
- Online proctored and should be done in a Kryterion certified location
Exam Materials to follow.
- I will highly recommend the Coursera Big Data Specialization by Valliappa Lakshmanan.
- For people with AWS experience, there is another course focusing on Google Cloud for AWS Professionals could be used Level 1 before taking the Big Data course
- Cloud Academy course by Guy Hummel — Last week refresher. 8-hour course but focuses on important points.
- Udemy course by Loony Corn. Appendix for Hadoop Ecosystem — Must read for folks with little or no Hadoop experience.
- Youtube Playlists — Credit: Hermine Hovhannisyan
As the questions are taken from a pool, the number of questions to expect for each module might be different.
To sum up I took 2 months for preparation and I also did have some prior basic experience in Machine learning and Data science area. So it’s better to choose a timeline based on the experience.
For me, I only followed the Coursera course module and some YouTube content. But do note that, not all the aspects are covered in the Coursera course module such as stack driver, data studio in extent. I was able to answer around half the questions with the help of the Coursera course model, and down below I’m listing down some scenarios that are not extensively covered in the course module.
To be fair the exam is a bit hard. But I really enjoyed writing the exam because almost all question started with a practical scenario so you would get the feeling that you are solving real-world problems. Which also means the questions and very long. I am not a native English speaker and I barely had few minutes to run through the questions set once again. So manage your time wisely.
The examination platform has a revisit question option, if you are not sure about the answer you can mark it for revisiting. As one question shown at a time its a very helpful feature to revise the doubtful questions.
General Big Data Related Questions.
Whether you’re familiar with Big Data topics or not, it is essential to spend some time with the Hadoop Ecosystem, particularly focusing on Apache Pig, Hive, Spark, and Beam. The Hadoop Ecosystem is a huge part of Google’s Big Data and Analytics service offerings. You will encounter questions related to the use cases of these technologies.
Learn some basic usage scenarios, for example, if the on-premise MySQL is shifted to Hadoop based data warehouse how can you let the existing users query the database. To answer this kind of question you need to have an understanding of what HiveQL or MapReduce is, and their use cases.
You should also know the general characteristics of databases. Do get to know what are the advantages and features of each big databases. For example, if you are given a business requirement (high availability, nontransactional nature) you should be able to choose the correct database options.
Do also study and master normal database jargon like Self joins, DB normalizations. Also, know the use cases of MapReduce and how to efficiently use can use them. Try to implement a MapReduce operation and run in the Dataproc infrastructure. For example, if a value is missing in the series MapReduce jobs how would you tackle it, whether you would go by modifying all jobs or add a new job in the series. There are several ways to do it, but you should be able to identify an efficient way to perform the job. If you have at least implemented some use cases only you would know the pain points and what to omit and what are the recommended approaches.
Read through the Case Studies given by Google cloud it accounts for 5–8 questions. So you don’t want to spend time reading the case study in the exam.
I just had 2 or 3 questions related to this. Do know the basic use cases and try to implement a report with Google Big Query integration and also know about query caching and how to prevent them in data studio.
Not much to expect here, I got 1 or 2 questions regarding Cloud SQL. Do some reading on which scenarios you would prefer cloud SQL over other database offerings.
This was quite a surprise I got like 25% of the questions related to big-query. So extensively study this product. In GCP you can derive the same results with different approaches so know the pros and cons of each action.
- Know about how to effectively query large data sets. For example like time-based partitioning sharding. If the requirement is to have only one table and the data is very big you should be able to devise a strategy using these sharding techniques.
- Do also look into creating views and aggregate tables. If you are aggregative multiple days wise tables to get the aggregate you should be able to devise a strategy to overcome when the number of tables to be aggregated increases.
- Do also look into cost-saving options, if you are having the data and letting 3rd party users use this data you should be able to devise a strategy where the 3rd party users can use it without you incurring the cost for querying.
- Also, master the security principles in BQ. If you want to share the data across multiple departments/parties how can you secure them? These kinds of scenarios will involve your knowledge in data sharing based on projects and roles.
- Do also look into regional data storage
- When answering always consider the cost and time involved. If your database has two related columns and an IT developer needs a way to query by concating the results of the two columns, there can be several approaches. Either you can create a new database with the new column or do an update job or write a dataflow job. When answering always think about the cost and effort to do it correctly.
- Know about the SQL anti-patterns and data skew.
Do also look into BQ optimization strategies. At a glance at this image, you should be able to tell what’s going on in this query phase.
To my surprise I encountered some questions in configurations, I was not prepared for this one. The exam always tests your knowledge in practical implementation so go through some basic labs. Learn also about the scenarios where data store would fit in. The Cousera course module discusses these points extensively.
Dataflow is a very important part of Google Data Engineer Exams. It validates the knowledge of the person on ETL and event processing basics. If an error statement is given you should be able to identify the root cause of it. I would strongly advise implementing some pipelines on streaming and batch processing. Master the windowing and triggering concepts and do some labs on this domain.
Machine Learning and Tensorflow
I have read in some blogs that there were a surprising amount of questions in ML, but I encountered like 5–8 questions in this domain. Mostly the questions were targeted in the general ML domain in contrast to hardcore tensor-flow stuff.
Learn the basics of machine learning thoroughly. I was able to answer most of the question with the help of the machine learning course provided in the Cousera’s data engineering specialization series. So if you have a basic understanding of ML the Cousera’s module will be enough to refresh your knowledge. Do look into scenarios where you can increase the training process. Whether to increase training by subsampling or by increasing or reducing the features.
And do get to know what the terms actually mean such as regularization etc.
Learn about overfitting in ML. also learn how to overcome the overfitting in Neural networks.
Do play around with the TF playground. Get some basic understanding of what is feature engineering and how can you introduce new features without adding more layers. These are also extensively covered in the Cousera course module.
IAM – Identity and access management
Apart from the general stuff do look into service accounts and how you can federate and limit access to components in GCP. These are some of the areas, which are not covered extensively in the course module, so do some external reading and best practices.
Some amount of questions is to expected in BigTable (5–8). You should clearly know the difference between BQ and Bigtable and when to choose them appropriately.
- Learn about the data skewness and anti-patterns when designing solutions utilizing the big table. If you observe a delay in read and write you should be able to modify the schema to reduce this skewness.
- Do also look into the appropriate usage based on business requirements.
And as I told before be prepared for very long questions. I’ve encountered several questions with lots of text and providing so much excess information. Skim through the text and find out what is the exact requirement in the question.
The Data Engineer certification exam is a fair assessment of the skills required if you want to be able to demonstrate the ability to work effectively with Google Cloud Platform on analytics, big data, data processing, or machine learning projects. If you have used the majority of these products already on real-world products, the exam should not present you with too many problems. If you haven’t yet used some of the products above, then get studying!
If you have reached up to this point, I have prepared a short note to document based on the course module, please do send me an email if you want to learn more.
 – Data Examination Review