Exploring the Data Realm: Running a Data Audit

Running a Data Audit

Many companies want to embrace digital transformation and use Artificial Intelligence (AI). They have the ideas, resources, and a vision, but the foundations are missing. That is, a deep knowledge of their data, from how they use and define terminologies, to how they store, use and archive data. Most AI and data analytics use cases will require connecting multiple and different data sources and types to gain deep insights and business optimisation outcomes. However, a lack of knowledge and narrow vision of the data ecosystem can be a threat to the success of these projects.

This post aims to give an overview of the Principles of the Data Audit which will be the first milestone for sustainable Data Management.

Principles

The Data Audit is composed of 3 main iterative steps: the data asset identification, the measurement/evaluation of the data quality and the enrichment of the data through better quality processes.

Identify the data assets

First things first, get familiar with the data ecosystem. There are data and it can be described in many different ways. The characteristics of the data asset need prudence in definition to be fully understood. It is important to ask questions such as:

How to identify a data asset? What is the right level of detail?

To identify a data asset in the most efficient way, it is necessary to know its life cycle.

First, let’s define a data asset.

Any entity that is comprised of data. A data asset may be a system or application output file, database, document, or Web page, also a service that may be provided to access data from an application.
National Institute of Standards and Technology – U.S. Department of Commerce

Every data asset has a life cycle: planning, obtainment, storage and sharing, maintenance, use, and disposal. Let’s explore these details further in the phases:

Planning: Identify objectives, plan information architecture, develop standards and definitions;
Obtainment: Ways to acquire data. For example, by creating records, purchasing data, or loading external files;
Storage and Sharing: Form of storing and sharing data. Data may be stored electronically such as in databases or files. Data can be shared through such means as networks, emails or APIs;
Maintenance: Update, change, and manipulate data. Cleanse, wrangle and transform data. Match and merge records, or optimize data;
Use: Retrieve data. Use information. Use can be completing a transaction, writing a report, making a management decision from a report, and/or running automated processes;
Disposal: Archive information or deleting data or records;

For each phase, four aspects have to be taken in account:

the What: what data is concerned during these phases?.
the How – Processes: how are these phases executed from a procedural point of view?
the Who: who is involved in these phases?
the How – Technology: how are these phases executed from a technological point of view?

This framework gives a holistic view of the analyzed data assets: it will be the pedestal for improving the data asset, imagining new use cases and making data governance implementable.

Measure and Evaluate Data Quality

Now that the data asset can be identified, the next step is to determine if the data is usable? Not all data is useable or relevant to the AI and analytics use cases, and therefore it is important and necessary to assess data quality. To do this, we ask the following questions:

What makes data quality? Is it necessary to have an impeccable data asset?

Data quality relies on metrics that will be confined to business rules. These metrics will be the same for all data assets but it will not be the case for the business rules. The comparison of the two will give the validation of the data quality.

Data Quality Metrics

Data quality can be measured with 6 main metrics:

Timeliness: Describe the latency between the event and the availability of the information – e.g. the contracts data set is refreshed once per year;
Completeness: Percentage of values that are considered as complete – e.g. only 80% of the timestamps are complete;
Validity: Respect of a certain syntax, range or set of rules – e.g. age has to be between 18 and 110 for some cases;
Uniqueness: Presence of duplicates or missing values – e.g. if a business serves 100 clients, there shouldn’t be more than 100 in the data set;
Consistency: Consistency in the way to represent data – e.g. dates are always encoded in the same way;
Accuracy: Exactitude of the data in comparison with the reality – e.g. the billing system should be the same, and consistent with, the financial data sets.

At the exception of timeliness, all the other metrics are measured at the level of fields.

Business Rules

It’s good to be able to quantify the data quality but the final goal is to qualify data reliability. Indeed, not all the fields of a data set have the same importance: inaccuracy, incompleteness, and non-validity can be accepted for some of them.

To qualify this data reliability, one first step is to define business-related data quality rules. These rules are defined by the owner of the data asset and they represent the standard from which data can be qualified as reliable. Each metric we’ve seen previously should have its own business rule but it can happen in cases where there is none.

For example, in a telecom company, there is a data set containing all the mobile customer contracts. Some of the fields are contract ID and email address. Let’s elaborate the business rules for both of them.

Timeliness is the same for the whole data set – it should be refreshed daily.

Contract ID:

Email Address:

The key idea here is that not all the fields have the same value. For those which are central in the core business, the business rules should be strict and carefully defined. However, for secondary fields, it may not be worth spending too much time on the determination of rules.

Data Quality Evaluation

The goal here is to compare the metrics measured and the defined business rules. If all the metrics respect the rules then the field is validated.

In the best case of fields being evaluated, all fields will be validated and therefore the entire data asset is validated too. In the worst but simple case, all fields are wrong and the entire data asset will need correction.

The reality usually presents a mix of both valid and invalid fields in the data quality evaluation with some fields being compliant, and others not. In that case, the mandatory fields should be defined and their compliance required for validating the data asset. This validation should, however, go with some indication about how the problematic fields could affect (or not) the current or future business use cases.

Implement quality processes and enrich data

After the data quality evaluation, the issues are underlined and have to be resolved. The best way to solve these issues is to find the root causes.

Fortunately, the data life cycles of problematic data assets are known. Therefore, it is possible to locate the stage where the problem appears, its technical and/or business reason, and the people to talk with to fix it.

To conclude, a Data Audit is divided into three phases. The first is the identification of the data asset and the investigation on its life cycle. The second is the evaluation of the data quality. For this, it is necessary to measure the 6 main metrics and compare them to the defined business rules. The output of these two phases is a clear view of any problematic data. The third and last phase consists of treating the root causes of these issues. Using the data life cycle enables us to find why and where there is a problem and with whom to interact with for a sustainable data management solution.

Cookie	Duration	Description
__cfduid	1 month	The cookie is used by cdn services like CloudFlare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information.
ARRAffinity	session	ARRAffinity cookie is set by Azure app service, and allows the service to choose the right instance established by a user to deliver subsequent requests made by that user.
ARRAffinitySameSite	session	This cookie is set by Windows Azure cloud, and is used for load balancing to make sure the visitor page requests are routed to the same server in any browsing session.
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-advertisement	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertisement".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-non-necessary	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Non-necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
elementor	never	This cookie is used by the website's WordPress theme. It allows the website owner to implement or change the website's content in real-time.
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
bcookie	1 year	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	1 year	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
li_gc	5 months 27 days	Used to store consent of guests regarding the use of cookies for non-essential purposes.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
tableau_locale	session	We embed Tableau charts and interactivity on some of our pages. These cookies expire at the end of your session.
tableau_public_negotiated_locale	session	We embed Tableau charts and interactivity on some of our pages. These cookies expire at the end of your session.
test_cookie	15 minutes	This cookie is set by doubleclick.net. The purpose of the cookie is to determine if the user's browser supports cookies.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.
VISITOR_INFO1_LIVE	5 months 27 days	This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.

Cookie	Duration	Description
_dc_gtm_UA-111640802-1	1 minute	This cookie is used by Google Tag Manager to support Google Analytics on our Sites. It helps us monitor the use and performance of our Sites.
_ga	2 years	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_ga_JWW0KP3X8Q	2 years	This cookie is installed by Google Analytics 4.
_gat_UA-111640802-1	1 minute	This is a pattern type cookie set by Google Analytics, where the pattern element on the name contains the unique identity number of the account or website it relates to. It appears to be a variation of the _gat cookie which is used to limit the amount of data recorded by Google on high traffic volume websites.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visted in an anonymous form.
ai_session	30 minutes	This is a unique anonymous session identifier cookie set by Microsoft Application Insights software to gather statistical usage and telemetry data for apps built on the Azure cloud platform.
ai_user	1 year	A unique user identifier cookie, set by Microsoft Application Insights software, that enables counting of the number of users accessing the application over time.
AnalyticsSyncHistory	1 month	Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries
prism_252943399	1 month	This cookie is used by Active Campaign for site tracking purposes.
visitorId	1 year	By default, the visitor ID is supplied to Coveo UA using the visitor (string) query parameter and kept in the local storage of the user browser. A third-party cookie can also be used to store the visitor ID if the current user browser accepts these kinds of cookies.
WFESessionId	session	These cookies are used by Microsoft Azure Application Insights, which collects site telemetry information, allowing us to analyze how some of our Sites are performing and to perform optimization.
YSC	session	This cookies is set by Youtube and is used to track the views of embedded videos.

Cookie	Duration	Description
IDE	1 year 24 days	Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.
LinkedIn
muc_ads	2 years	Collects data on user behaviour and interaction in order to optimize the website and make advertisement on the website more relevant.
personalization_id	2 years	Twitter sets this cookie to integrate and share features for social media and also store information about how the user uses the website, for tracking and targeting.

Cookie	Duration	Description
CONSENT	16 years 7 months 20 days 16 hours 15 minutes	No description
GetLocalTimeZone	session	No description
hid	session	No description available.

Exploring the Data Realm: Running a Data Audit

Running a Data Audit

Principles

Identify the data assets

Measure and Evaluate Data Quality

Data Quality Metrics

Business Rules

Data Quality Evaluation

Implement quality processes and enrich data

References & more

Details

Computer Vision: Create an API in 60 minutes

Data Governance Roles and Responsibilities

Guiding C-Level Executives Through Business Ethics in the Data and AI Age

DAIN Studios

Studio HELSINKI

Studio BERLIN

Studio MUNICH