Most data starts life as a mess. Turning raw data into tidy information is an integral part of any data science and analytics workflow. Learning the essentials of data cleaning is a must for every employee before they can tap the potential of the data they work with, says Gyorgy Paizs.
On April 1st, DAIN Studios released a Python package called Untidy, which turns clean datasets into messy ones. Our consultants had been looking for a tool that would help demonstrate the multitude of ways data can be corrupted in domain-specific datasets. Data generated by organizations is rarely clean to begin with, and tidying it remains one of the most time-consuming and least automated processes in a typical data workflow. Equipping companies with the necessary skills to do this has become a key mission for DAIN Studios as we help European companies on their journey towards data maturity – the point at which a company becomes proficient at making data-driven decisions at scale across the whole organization.
Why data literacy is important
Organizations with a high level of data maturity distinguish themselves from laggards in a number of ways. One element common to all of them is the wide availability of data skills spread across the organization. A study by McKinsey showed that 62 percent of executives at “high-performing companies” have an understanding of “data concepts”, against 43 percent at “all other organizations” – at 53 to 38 percent, the disparity is similar among managers.
However, understanding basic “data concepts” and their implications is not only a must for data practitioners, executives and managers. Every single employee should have an appreciation for fact-based decision-making. All employees should have at least the basic skills for reading and interpreting data they come across in their daily work. Only broadly anchored data literacy will allow a company to tap the full potential of the data it generates. That does not mean that every employee has to be a data scientist, but that each one can spot a data opportunity – and find someone in the organization able to help them exploit it.
Raw data is as messy as the world it measures and can come riddled with a multitude of problems: values can be missing, the so-called string encoding used to store data points can go wrong, statistical outliers may be lurking among all those variables. Such problems can be the result of technical issues, but also of human error. Crucially, each raw-data set’s deficiencies are slightly different, so tackling them is as much an art as a science.
As the amount of data generated and collected by companies increases exponentially, there is ever more demand for data insights from all areas of an organization. Fostering data literacy beyond the hard core of data experts will reduce the likelihood of bottlenecks arising and new data opportunities being missed. It will increase a company’ chances of becoming truly data driven as its employees start thinking about data as they go about their daily tasks. Like sustainability, data has to be “lived” in and by every employee.
Training the next generation of citizen data scientists
In an increasingly competitive labor market for data talent, most organizations’ best bet will be to help existing employees become (more) data literate. We at DAIN Studios have been helping companies raise employees’ skills in order to increase data literacy and rear the first generation of “citizen data-scientist”. In making tidy data messy again, the Untidy Python package helps expose non-experts to the reality of messy data and the need to unscramble them before tapping the insights they contain. Untidy was designed to replicate the most common problems found in domain-specific datasets. By using real (not generic) data, the package puts trainees in a much better position to understand what data – and how much of it – can actually be saved in the cleaning process.
We are currently using Untidy as part of our data-training offering for a client from the manufacturing industry. The company collects millions of rows of sensor data from equipment across the world as part of normal daily operations. In this case, technical failures such as connection issues can introduce missing values, or a change of operational systems can cause inconsistencies into otherwise complete datasets. With Untidy, we can create and demonstrate the specific types of data problems that are possible in this environment – and how to deal with them. This not only equips employees with the necessary skills, it also builds awareness about the importance of data quality – a concept equally crucial to any data transformation.
The package is now available officially on pypi.org and can be installed just like any other standard python package.