Data Engineering Course in Udacity

Getting started with

As a response to COVID-19 pandemic, Udacity, the popular tech learning platform is offering one month free on their for various data-centric nanodegrees, and this includes their popular data engineering one as well. Heeren Sharma, data engineer at DAIN Studios, looked at the content of the data engineering and warmly recommends having a look if you are looking to expand your skills from software engineering to data engineering.

This is what Heeren has to say about the content of the data engineering nanodegree. For more information on Udacity and the offer, please visit https://blog.udacity.com/2020/03/one-month-free-on-nanodegrees.html

Like any other Nanodegrees, there are different modules which are nicely structured and logical to navigate. Every module sets the stage for the next module and learnings are nicely transferrable.

Data Modelling

The first module is about data modelling and design for both relational and NoSQL databases. This section explores Postgres as a SQL database and Apache Cassandra for NoSQL columnar datastore. Data modelling is a crucial aspect and often overlooked when starting a project. As soon as data size grows, and one needs to create those massive joins, then refactoring of data models is realized. It gets more complicated when your data capabilities are growing, and so does new features integrations. Consequently, it results in updating data models to incorporate additional features. And in no time, those quickly spun data models can prove to be bottleneck not only in adding new features but also overall system performance.

Cloud Data Warehouses

For showcasing data collection, the second module focuses on the data warehouse and precisely the benefits of cloud hosting. For cloud hosting experience and hands-on exercise, the selected cloud platform is Amazon Web Services (AWS). This section provides basic AWS introduction like the creation of EC2 instances, IAM roles, and using boto3 to interact with various AWS resources. Moreover, there is a small practical project where one will build a simple ETL pipeline. Specifically, this exercise involves loading data from S3 to tables in Redshift (a popular AWS data warehouse solution). Are you feeling confused with all of these abbreviated terminologies? Once you advance in this module, things will become much more clearer.

Spark and Data Lakes

It’s time to scale up things, and the third module is exclusively for it. This section introduces what big data is and reasoning behind some of the most famous names you might have heard of like Hadoop and Spark. It also explains quite smoothly some complex concepts of distributed file systems and cluster computing, i.e. when data can’t fit in one machine. In the exercise part, gentle explanation of PySpark is a definite highlight. Furthermore, ETL pipeline development exercise consists of data wrangling with PySpark, data partitioning and deployment of overall Spark process on a cluster in AWS. One key learning to extract from this module is the ability to realize the need for a data lake over a data warehouse introduced in the previous section based on your business needs.

Data Pipelines with Airflow

In real-world big data crunching applications, there are often many data pipelines. And if left unnoticed, you might end up maintaining a lot of pipelines. To make matters worse, if something goes wrong and you have many pipelines which are depending on each other, then it can take your mood to the next levels of frustration. This fourth and final module introduces scheduling, automating and monitoring various data pipelines using Apache Airflow. Associated exercise of this module involves configuring and scheduling data pipelines with Airflow and also to run data quality checks. Apart from the orchestration and monitoring of various pipelines, the introduction of data quality is a key take away from this module.

Final Thoughts

There is a lot one can learn from this nanodegree. Program is quite well structured and at times, intuitive to follow. Complimentary exercises associated with each module boosts overall practical learnings and gets hands-on experience. It’s essential to keep in mind that intermediate python and SQL programming knowledge is a prerequisite for this nanodegree. So, if you are a software engineer or backend engineer starting your data journey, then you might find it quite useful. There is a final capstone project in this nanodegree which is a fantastic way to put all those learning into one consolidated fashion. Estimated time to complete this nanodegree is five months (5-10 hr/week). However, if you have 3 hours to invest per day, then this can be done calmly in a month.

What can be a better way to train those data engineering brain cells!

Stay safe and enjoy learning!

Heeren Sharma is a Senior Data Engineer at DAIN Studios Munich.

Cookie	Duration	Description
__cfduid	1 month	The cookie is used by cdn services like CloudFlare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information.
ARRAffinity	session	ARRAffinity cookie is set by Azure app service, and allows the service to choose the right instance established by a user to deliver subsequent requests made by that user.
ARRAffinitySameSite	session	This cookie is set by Windows Azure cloud, and is used for load balancing to make sure the visitor page requests are routed to the same server in any browsing session.
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-advertisement	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertisement".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-non-necessary	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Non-necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
elementor	never	This cookie is used by the website's WordPress theme. It allows the website owner to implement or change the website's content in real-time.
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
bcookie	1 year	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	1 year	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
li_gc	5 months 27 days	Used to store consent of guests regarding the use of cookies for non-essential purposes.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
tableau_locale	session	We embed Tableau charts and interactivity on some of our pages. These cookies expire at the end of your session.
tableau_public_negotiated_locale	session	We embed Tableau charts and interactivity on some of our pages. These cookies expire at the end of your session.
test_cookie	15 minutes	This cookie is set by doubleclick.net. The purpose of the cookie is to determine if the user's browser supports cookies.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.
VISITOR_INFO1_LIVE	5 months 27 days	This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.

Cookie	Duration	Description
_dc_gtm_UA-111640802-1	1 minute	This cookie is used by Google Tag Manager to support Google Analytics on our Sites. It helps us monitor the use and performance of our Sites.
_ga	2 years	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_ga_JWW0KP3X8Q	2 years	This cookie is installed by Google Analytics 4.
_gat_UA-111640802-1	1 minute	This is a pattern type cookie set by Google Analytics, where the pattern element on the name contains the unique identity number of the account or website it relates to. It appears to be a variation of the _gat cookie which is used to limit the amount of data recorded by Google on high traffic volume websites.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visted in an anonymous form.
ai_session	30 minutes	This is a unique anonymous session identifier cookie set by Microsoft Application Insights software to gather statistical usage and telemetry data for apps built on the Azure cloud platform.
ai_user	1 year	A unique user identifier cookie, set by Microsoft Application Insights software, that enables counting of the number of users accessing the application over time.
AnalyticsSyncHistory	1 month	Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries
prism_252943399	1 month	This cookie is used by Active Campaign for site tracking purposes.
visitorId	1 year	By default, the visitor ID is supplied to Coveo UA using the visitor (string) query parameter and kept in the local storage of the user browser. A third-party cookie can also be used to store the visitor ID if the current user browser accepts these kinds of cookies.
WFESessionId	session	These cookies are used by Microsoft Azure Application Insights, which collects site telemetry information, allowing us to analyze how some of our Sites are performing and to perform optimization.
YSC	session	This cookies is set by Youtube and is used to track the views of embedded videos.

Cookie	Duration	Description
IDE	1 year 24 days	Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.
LinkedIn
muc_ads	2 years	Collects data on user behaviour and interaction in order to optimize the website and make advertisement on the website more relevant.
personalization_id	2 years	Twitter sets this cookie to integrate and share features for social media and also store information about how the user uses the website, for tracking and targeting.

Cookie	Duration	Description
CONSENT	16 years 7 months 20 days 16 hours 15 minutes	No description
GetLocalTimeZone	session	No description
hid	session	No description available.

Data Engineering Course in Udacity

Getting started with

Data Modelling

Cloud Data Warehouses

Spark and Data Lakes

Data Pipelines with Airflow

Final Thoughts

References & more

Details

Computer Vision: Create an API in 60 minutes

Data Governance Roles and Responsibilities

Guiding C-Level Executives Through Business Ethics in the Data and AI Age

DAIN Studios

Studio HELSINKI

Studio BERLIN

Studio MUNICH