Have you ever wanted to cook something but found that the kitchen is a mess? The sink is filled to bursting with pots and pans and the counters are covered with all manner of food in sticky, multicoloured patches (is that last night’s bolognese congealing on the microwave?) And that’s not even mention the state of the floor. You can’t possibly cook in a kitchen like this – it would be some sort of health hazard. So, what are you going to do? You’re going to grab a sponge and get cleaning.
This is like what trying to start a machine learning project with dodgy data is like. You can’t cook in a dirty kitchen, just like you can’t come up with a brilliant machine learning solution with data of a low quality. The data is a key ingredient of the project: using bad data would be like using bad eggs to bake a cake. It’s just going to go badly for everyone involved.
Using bad data would be like using bad eggs to bake a cake
What is high-quality data?
For us to design the best solution in a project, data must be of a good standard. We consider data to be high-quality if it is easily available to us and in an accurate, credible and complete state.
The importance of data quality assessment
Assessing the quality of the data that we are given is a crucial element of the machine learning solution design process. A poor assessment of data can lead to problems further down the line. For example, a poor assessment of data availability can put pressure on the project timeline, forcing the actual modelling phase to be rushed. Also, if the data relevance is assessed poorly, there can be a mismatch between the potential available data and promised goals. So, before a project even begins, it’s important that we know that the data is:
The success of a project is determined by the quality of the data that we are given. This means that, the better the data is, the better a solution we can create for our client.
The ABCD checklist
That’s why we created a checklist in order to assess the quality of the data. We designed a scale from D to A to grade the data according whether the data is available, useable, reliable and relevant to the project. We ask ourselves:
- Is the data available to the team?
- Is the data in a structure the team can fully comprehend and use?
- Can the team trust the data and can it explain edge cases?
- Does this data help to solve the client’s problem?
If the answer to these four questions is yes, then we know that we have high-quality data.
We use the scale internally, on an informal basis, to allow us to clarify not only the quality of the data, but what we need to do to it and what to ask of the client. That way, when we do start the project, we know that we’re starting off on the right foot, with high-grade useable and reliable data.
Each transition from one level to the other reflects a different skillset and may require someone else than a machine learning specialist to perform, in the same way that you wouldn’t ask an italian food chef to show off their sushi making skills.
- Data collection / creation: going from level D to level C, i.e making the data available.
- Data extraction: going from level C to level B, i.e. making the data usable.
- Data cleaning: going from level B to level A, i.e. ensuring the data is usable for machine learning purposes.
Example: The Football
Let’s take an example. Imagine a client comes along who wants to bet on the 2018 world cup. He needs an algorithm that will predict the winning team out of each match pair for every game during the cup. The inputs we would be using for this project, as defined in our previous article on the Machine Learning Canvas, would be the history of each team in terms of wins and losses and how much they earn; the players that each team will use and the history of each player in terms of: goals; penalties; injuries; salaries; teams played for and leagues played in. The data will come from online football databases, and perhaps also from the football manager game.
We’d first ask ourselves whether this data was relevant for the project. Although we mention the problem definition at the end of the data assessment, it should really be discussed from the very beginning of the project. If the data is of high quality but is irrelevant to our project objective, then we’ve no use for it. So it’s important to remember that the ABCD questions are not sequential, as this table would imply, but should be given equal weight in the decision making process.
If the data is of high quality but is irrelevant to our project objective, then we’ve no use for it.
The relevance of the data should be matched against the project goals. For example, if the goal of the project was to test the ‘home advantage’ hypothesis, but your data only consisted of matches played in the World Cup, where both teams play on foreign soil, then this data is not relevant for your project. For the purposes of this project, which aims to predict the winning team, the data consisting of information about each team and player is relevant. If we decide that the data is relevant, it has met the criteria for grade A data.
Next, we’d think about whether the data was available for us to use. We’d consider whether the data is accessible to us; whether it exists and if not, how long it would take to create some. For this project, we know that the data of previous football matches already exists. New data will be made after every new World Cup match.
We’d also consider whether we were authorised to use this data. Before starting any project, we need to be clear on the degree of privacy of this data. We need to consider whether we were authorised to use this information and we would need to know if the department initiating the project has all the appropriate approvals from the other departments, as well as from their clients. For example, sensitive patient data from a medical organisation will be handled differently to quality inspection data from the steel industry.
For our betting project, authorisation might be a point of consideration, since data from Football Manager or other football databases may not be available for commercial use – but for the sake of this article let’s consider it solved. A possible solution to this problem would be signing a non-disclosure agreement (NDA).
For example, sensitive patient data from a medical organisation will be handled differently to quality inspection data from the steel industry.
Finally, we would need to consider the timeliness of the data by asking ourselves whether the data was stable or growing, and when it will be accessible for use in the project. In this case, part of the data is historical, as it is information from previous matches and so the data is stable. This data will also be accessible instantly for the project. The data will grow through the project, as new information will be available after every new World Cup match. Therefore, as our data has met the requirements of accessibility, authorisation and timeliness it passes the requirements in the D category.
The next step would be to consider the data’s usability. There should be enough processing power for the data to be correctly handled, so we would think about the size of the data and whether our server has enough memory and disk space to handle this data. Also, before starting a project, it’s important to understand the format of the documents and plan ahead for the sort of programs that will be needed to open a specific file. For example files, DICOM files are commonly used in the medical field, but are a more complex format than, say, a JPEG image. In terms of data structure, we would consider how the data tables are related to each other. We would also require the client to provide a schema along with the data files, to make it quicker for us to get to know the structure within the data.
We would also think about documentation; whether all the data is specified and whether every field is documented and unambiguous. This is an important step for being able to spot outliers in the data. For example, if you are collecting temperature data, the unit should be specified, so that if you see a 0 value from data collected in a furnace, you are able to see that this is a strange value. Then, we have to do a bit of detective work, determining whether the value is mistake or correct. If we spot missing values, they can be dropped as long as there is enough data. However, if the data is needed then it’s necessary to ask the client for it or impute it ourselves where possible.
For example, for data about composition of a sports team, if we already know that the team composition stayed the same and it is missing for one particular match, we can go ahead and add that in. We would try not to drop data in limited datasets, such as for our football manager data, since there have been a finite number of football matches played.
In this case, the football data wouldn’t be too big in size for Scyfer’s server to handle. The format and structure would also be fairly simple and easy to understand. Therefore, by determining the data’s structure, format and documentation, and the processing power needed to handle it, it meets the requirements of the C.
It is also important to consider the reliability of the data and who it was collected by. In an ideal world, the data will be able to be checked against a second, external source, for instance, if you’re dealing with finance data you can cross-check it with Bloomberg. Although it is not always possible, in our case we could cross-check the data across several databases, or the data from football manager with any other games like PES or FIFA.
We’d also consider whether the data was objective or subjective. For example, temperature data from a sensor is objective, while a pain-scale is subjective. Objective data is more reliable than subjective data. Another aspect to consider is whether the values actually make sense and reflect the true state of the source information. In terms of consistency we’d consider, in the case where data is split among multiple databases, whether the fields have consistent names and values across these databases, and is the modification to a certain field over time documented?
This is a relevant question to our project, since we’ll likely be extracting data from numerous football databases and football manager. It is probable that the values and fields in these databases are fairly consistent, e.g. ‘number of goals scored’, since in football, words like ‘goal’, ‘player’ and ‘score’ are fixed terms. Therefore, we can say that the data is accurate, consistent, and complete and so it satisfies the elements of the B category. This is the minimum standard of data that we need to start a project.
Therefore, we can say that the data is accurate, consistent, and complete and so it satisfies the elements of the B category.
Finally, after grades D, C and B have been fulfilled, we would conduct a simple statistical analysis to show whether there is a correlation between figures and targets, in order to confirm the gut feeling we had about the data’s relevance at the beginning of the data quality assessment.
It’s best that the client performs the data extraction prior to the project. However, in most cases Scyfer can do the data cleaning, since it’s beneficial for us to get accustomed to the data’s structure.
So, it’s pretty cool how a simple 4-step process can help to grade the quality of the data, right? Making sure that we start off our projects with high quality data is in both our interest and and our client’s, as it ensures that we have the best possible ingredients to design a machine learning solution that fulfills our client’s needs. It’s also an important part of our relationship with our client, as it helps to set out clear and appropriate expectations about what can be achieved during the project.