Cleaning up Data Messes and Training Citizen Data Scientists

UnTidy by DAIN Studios

Taming the data beast: How to clean up data messes and train citizen data-scientists

Most data starts life as a mess. Turning raw data into tidy information is an integral part of any data science and analytics workflow. Learning the essentials of data cleaning is a must for every employee before they can tap the potential of the data they work with, says Gyorgy Paizs.

On April 1st, DAIN Studios released a Python package called Untidy, which turns clean datasets into messy ones. Our consultants had been looking for a tool that would help demonstrate the multitude of ways data can be corrupted in domain-specific datasets. Data generated by organizations is rarely clean to begin with, and tidying it remains one of the most time-consuming and least automated processes in a typical data workflow. Equipping companies with the necessary skills to do this has become a key mission for DAIN Studios as we help European companies on their journey towards data maturity – the point at which a company becomes proficient at making data-driven decisions at scale across the whole organization.

Why data literacy is important

Organizations with a high level of data maturity distinguish themselves from laggards in a number of ways. One element common to all of them is the wide availability of data skills spread across the organization. A study by McKinsey showed that 62 percent of executives at “high-performing companies” have an understanding of “data concepts”, against 43 percent at “all other organizations” – at 53 to 38 percent, the disparity is similar among managers.

However, understanding basic “data concepts” and their implications is not only a must for data practitioners, executives and managers. Every single employee should have an appreciation for fact-based decision-making. All employees should have at least the basic skills for reading and interpreting data they come across in their daily work. Only broadly anchored data literacy will allow a company to tap the full potential of the data it generates. That does not mean that every employee has to be a data scientist, but that each one can spot a data opportunity – and find someone in the organization able to help them exploit it.

Raw data is as messy as the world it measures and can come riddled with a multitude of problems: values can be missing, the so-called string encoding used to store data points can go wrong, statistical outliers may be lurking among all those variables. Such problems can be the result of technical issues, but also of human error. Crucially, each raw-data set’s deficiencies are slightly different, so tackling them is as much an art as a science.

As the amount of data generated and collected by companies increases exponentially, there is ever more demand for data insights from all areas of an organization. Fostering data literacy beyond the hard core of data experts will reduce the likelihood of bottlenecks arising and new data opportunities being missed. It will increase a company’ chances of becoming truly data driven as its employees start thinking about data as they go about their daily tasks. Like sustainability, data has to be “lived” in and by every employee.

Training the next generation of citizen data scientists

In an increasingly competitive labor market for data talent, most organizations’ best bet will be to help existing employees become (more) data literate. We at DAIN Studios have been helping companies raise employees’ skills in order to increase data literacy and rear the first generation of “citizen data-scientist”. In making tidy data messy again, the Untidy Python package helps expose non-experts to the reality of messy data and the need to unscramble them before tapping the insights they contain. Untidy was designed to replicate the most common problems found in domain-specific datasets. By using real (not generic) data, the package puts trainees in a much better position to understand what data – and how much of it – can actually be saved in the cleaning process.

We are currently using Untidy as part of our data-training offering for a client from the manufacturing industry. The company collects millions of rows of sensor data from equipment across the world as part of normal daily operations. In this case, technical failures such as connection issues can introduce missing values, or a change of operational systems can cause inconsistencies into otherwise complete datasets. With Untidy, we can create and demonstrate the specific types of data problems that are possible in this environment – and how to deal with them. This not only equips employees with the necessary skills, it also builds awareness about the importance of data quality – a concept equally crucial to any data transformation.

The package is now available officially on pypi.org and can be installed just like any other standard python package.

Cookie	Duration	Description
__cfduid	1 month	The cookie is used by cdn services like CloudFlare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information.
ARRAffinity	session	ARRAffinity cookie is set by Azure app service, and allows the service to choose the right instance established by a user to deliver subsequent requests made by that user.
ARRAffinitySameSite	session	This cookie is set by Windows Azure cloud, and is used for load balancing to make sure the visitor page requests are routed to the same server in any browsing session.
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-advertisement	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertisement".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-non-necessary	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Non-necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
elementor	never	This cookie is used by the website's WordPress theme. It allows the website owner to implement or change the website's content in real-time.
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
bcookie	1 year	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	1 year	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
li_gc	5 months 27 days	Used to store consent of guests regarding the use of cookies for non-essential purposes.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
tableau_locale	session	We embed Tableau charts and interactivity on some of our pages. These cookies expire at the end of your session.
tableau_public_negotiated_locale	session	We embed Tableau charts and interactivity on some of our pages. These cookies expire at the end of your session.
test_cookie	15 minutes	This cookie is set by doubleclick.net. The purpose of the cookie is to determine if the user's browser supports cookies.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.
VISITOR_INFO1_LIVE	5 months 27 days	This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.

Cookie	Duration	Description
_dc_gtm_UA-111640802-1	1 minute	This cookie is used by Google Tag Manager to support Google Analytics on our Sites. It helps us monitor the use and performance of our Sites.
_ga	2 years	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_ga_JWW0KP3X8Q	2 years	This cookie is installed by Google Analytics 4.
_gat_UA-111640802-1	1 minute	This is a pattern type cookie set by Google Analytics, where the pattern element on the name contains the unique identity number of the account or website it relates to. It appears to be a variation of the _gat cookie which is used to limit the amount of data recorded by Google on high traffic volume websites.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visted in an anonymous form.
ai_session	30 minutes	This is a unique anonymous session identifier cookie set by Microsoft Application Insights software to gather statistical usage and telemetry data for apps built on the Azure cloud platform.
ai_user	1 year	A unique user identifier cookie, set by Microsoft Application Insights software, that enables counting of the number of users accessing the application over time.
AnalyticsSyncHistory	1 month	Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries
prism_252943399	1 month	This cookie is used by Active Campaign for site tracking purposes.
visitorId	1 year	By default, the visitor ID is supplied to Coveo UA using the visitor (string) query parameter and kept in the local storage of the user browser. A third-party cookie can also be used to store the visitor ID if the current user browser accepts these kinds of cookies.
WFESessionId	session	These cookies are used by Microsoft Azure Application Insights, which collects site telemetry information, allowing us to analyze how some of our Sites are performing and to perform optimization.
YSC	session	This cookies is set by Youtube and is used to track the views of embedded videos.

Cookie	Duration	Description
IDE	1 year 24 days	Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.
LinkedIn
muc_ads	2 years	Collects data on user behaviour and interaction in order to optimize the website and make advertisement on the website more relevant.
personalization_id	2 years	Twitter sets this cookie to integrate and share features for social media and also store information about how the user uses the website, for tracking and targeting.

Cookie	Duration	Description
CONSENT	16 years 7 months 20 days 16 hours 15 minutes	No description
GetLocalTimeZone	session	No description
hid	session	No description available.

Cleaning up Data Messes and Training Citizen Data Scientists

UnTidy by DAIN Studios

Taming the data beast: How to clean up data messes and train citizen data-scientists

Why data literacy is important

Training the next generation of citizen data scientists

About Untidy

Details

Computer Vision: Create an API in 60 minutes

Data Governance Roles and Responsibilities

Guiding C-Level Executives Through Business Ethics in the Data and AI Age

DAIN Studios

Studio HELSINKI

Studio BERLIN

Studio MUNICH