As a response to COVID-19 pandemic, Udacity, the popular tech learning platform is offering one month free on their for various data-centric nanodegrees, and this includes their popular data engineering one as well. Heeren Sharma, data engineer at DAIN Studios, looked at the content of the data engineering and warmly recommends having a look if you are looking to expand your skills from software engineering to data engineering.
This is what Heeren has to say about the content of the data engineering nanodegree. For more information on Udacity and the offer, please visit https://blog.udacity.com/2020/03/one-month-free-on-nanodegrees.html
Like any other Nanodegrees, there are different modules which are nicely structured and logical to navigate. Every module sets the stage for the next module and learnings are nicely transferrable.
The first module is about data modelling and design for both relational and NoSQL databases. This section explores Postgres as a SQL database and Apache Cassandra for NoSQL columnar datastore. Data modelling is a crucial aspect and often overlooked when starting a project. As soon as data size grows, and one needs to create those massive joins, then refactoring of data models is realized. It gets more complicated when your data capabilities are growing, and so does new features integrations. Consequently, it results in updating data models to incorporate additional features. And in no time, those quickly spun data models can prove to be bottleneck not only in adding new features but also overall system performance.
Cloud Data Warehouses
For showcasing data collection, the second module focuses on the data warehouse and precisely the benefits of cloud hosting. For cloud hosting experience and hands-on exercise, the selected cloud platform is Amazon Web Services (AWS). This section provides basic AWS introduction like the creation of EC2 instances, IAM roles, and using boto3 to interact with various AWS resources. Moreover, there is a small practical project where one will build a simple ETL pipeline. Specifically, this exercise involves loading data from S3 to tables in Redshift (a popular AWS data warehouse solution). Are you feeling confused with all of these abbreviated terminologies? Once you advance in this module, things will become much more clearer.
Spark and Data Lakes
It’s time to scale up things, and the third module is exclusively for it. This section introduces what big data is and reasoning behind some of the most famous names you might have heard of like Hadoop and Spark. It also explains quite smoothly some complex concepts of distributed file systems and cluster computing, i.e. when data can’t fit in one machine. In the exercise part, gentle explanation of PySpark is a definite highlight. Furthermore, ETL pipeline development exercise consists of data wrangling with PySpark, data partitioning and deployment of overall Spark process on a cluster in AWS. One key learning to extract from this module is the ability to realize the need for a data lake over a data warehouse introduced in the previous section based on your business needs.
Data Pipelines with Airflow
In real-world big data crunching applications, there are often many data pipelines. And if left unnoticed, you might end up maintaining a lot of pipelines. To make matters worse, if something goes wrong and you have many pipelines which are depending on each other, then it can take your mood to the next levels of frustration. This fourth and final module introduces scheduling, automating and monitoring various data pipelines using Apache Airflow. Associated exercise of this module involves configuring and scheduling data pipelines with Airflow and also to run data quality checks. Apart from the orchestration and monitoring of various pipelines, the introduction of data quality is a key take away from this module.
There is a lot one can learn from this nanodegree. Program is quite well structured and at times, intuitive to follow. Complimentary exercises associated with each module boosts overall practical learnings and gets hands-on experience. It’s essential to keep in mind that intermediate python and SQL programming knowledge is a prerequisite for this nanodegree. So, if you are a software engineer or backend engineer starting your data journey, then you might find it quite useful. There is a final capstone project in this nanodegree which is a fantastic way to put all those learning into one consolidated fashion. Estimated time to complete this nanodegree is five months (5-10 hr/week). However, if you have 3 hours to invest per day, then this can be done calmly in a month.
What can be a better way to train those data engineering brain cells!
Stay safe and enjoy learning!
Heeren Sharma is a Senior Data Engineer at DAIN Studios Munich.