The Importance of Data Quality in Machine Learning

the importance of data quality in machine learning

We are currently in an exciting area and time, where Machine Learning (ML) is applied across sectors from self driving cars to personalised medicine. Although ML models have been around for a while – for example, the use of algorithmic trading models from the 80’s, Bayes since 1700s – we are still in the nascent stages of productionising ML.

From a technical viewpoint this is ‘Machine Learning Ops’ or MLOPs. MLOPs involves figuring out how to build, deploy via continuous integration and deployment, tracking and monitoring models and data in production. These areas have been captured by a recent Google paper on technical debt.

From a human, risk, regulatory viewpoint we are grappling with big questions around ethical AI (Artificial Intelligence) systems and where and how it should be used. Areas including risk, privacy and security of data, accountability, fairness, adversarial AI and what this means, all come into play in this topic. Additionally, the debate over supervised machine learning, semi supervised learning, and unsupervised machine learning, brings further complexity to the mix.

A lot of the focus is on the models themselves, such as Google’s BERT and OpenAI GPT-3. What differentiates a good deployment is the quality of data; everyone can get their hands on pre-trained models or licensed APIs.

However, the one common theme that underpins all this work, is the rigour required in developing production level systems and especially the data necessary to ensure they are reliable, accurate and trustworthy. This is especially important for ML systems; the role that data and processes play; and the impact of poor quality data on ML algorithms and learning models in the real world.

Data as a common theme 

If we shift our gaze from the model side and model performance to the data side, including:

  • Data management – what processes do I have to manage data end to end, especially generating accurate training data?
  • Data integrity – how am I ensuring I have high quality data throughout?
  • Data cleansing and improvement – what am I doing to prevent bad data from reaching data scientists?
  • Dataset labelling – how am I avoiding the risk of unlabeled data?
  • Data preparation – what steps am I taking to ensure my data is data science ready?

– then far greater understanding on performance and model impact (consequences) could be achieved. However, this is often viewed as less glamorous or exciting work and as such is often unvalued. For example, what is the impetus for companies or individuals to invest at this level (such as regulatory – e.g. BCBS, financial, reputational, law)?

Yet, as well defined in the recent Google paper,

“Data largely determines performance, fairness, robustness, safety, and scalability of AI systems…[yet] In practice, most organizations fail to create or meet any data quality standards, from under-valuing data work vis-a-vis model development.[1]” 

This has a direct impact on people’s lives and society, where “…data quality carries an elevated significance in high-stakes AI due to its heightened downstream impact, impacting predictions like cancer detection, wildlife poaching, and loan allocations”.

What this looks like in practice

We have seen this with the recent exam predictions in the UK when exams were cancelled due to Covid. In this case, teachers predicted the grades of their students, then an algorithm was applied to these predictions to downgrade any potential grade inflation by Office of Qualifications and Examinations Regulation using an algorithm. This algorithm was quite complex and non-transparent in the first instance. When the results were released 39% of grades were downgraded. The algorithm captured distribution of grades from previous years, the predicted distribution of grades for past students and then the current year.

In practice this meant that if you were a candidate who had performed really well at GCSE but attended a historically poor performing school then it was very difficult to achieve a top grade. Teachers had to rank their students in the class, resulting in a relative ranking system that could not equate to absolute performance. It meant that even if you were predicted a B, were ranked at fifteenth out of 30 in your class, and the pupil ranked at fifteenth the last three years received a C, you would likely get a C.

The application of this algorithm caused uproar. Not least because schools with small class sizes – usually private, or fee-paying schools – were exempt from the algorithm resulting in the use of the teaching predicted grades. Additionally, it baked-in past socioeconomic biases, benefitting under-performing students in affluent (and previously high-scoring) areas while suppressing the capabilities of high-performing students in lower income regions.

A major lesson to learn from this, therefore, was transparency in the process and the data that was used.

An example from healthcare

It had an impact on ML cancer prediction with IBM’s ‘Watson for Oncology’ partnering with The University of Texas MD Anderson Cancer Center in 2013 to “uncover valuable insights from the cancer center’s rich patient and research databases”. The system was trained on a small number of hypothetical cancer patients, rather than real patient data. This resulted in erroneous, and dangerous cancer treatment advice.

Significant questions that must be asked include:

  • Where did it go wrong here – certainly the data but in general a wider AI system?
  • Where was the risk assessment?
  • What testing was performed?
  • Where did responsibility and accountability reside?

Machine Learning practitioners know well the statistic that 80 percent of ML work is data preparation. Why then don’t we focus on this 80% effort and deploy a more systematic approach to ensure data quality is embedded in our systems, and considered important work to be performed by an ML team?

This is a view recently articulated by Andrew Ng who urges ML community to be more data-centric and less model-centric[2]. In fact, Andrew was able to demonstrate this using a steel sheets defect detection prediction use case whereby a deep learning computer vision model achieved a baseline performance of 76.2% accuracy. By addressing inconsistencies in the training dataset and correcting noisy or conflicting dataset labels, the classification performance reached 93.1%. Interestingly, and compellingly from the perspective of this blog post, minimal performance gains were achieved addressing the model side alone.

Our view is, if data quality is a key limiting factor in ML performance –then let’s focus our efforts here in improving data quality, and can ML be deployed to address this? This is the central theme of the work that the ML team at Datactics undertakes. Our focus is automating the manual, repetitive (often referred to as boring!) business processes of DQ and matching tasks while embedding subject matter expertise into the process. To do this, most of our solutions employ a human-in-the-loop approach where we capture human decisions and expertise, and use this to inform and re-train our models. Having this human expertise is essential in guiding the process and providing context improving the data and the data quality process. We are keen to free up clients from manual mundane tasks and instead use their expertise on tricky cases with simpler agree / disagree options.

To learn more about an AI-Driven Approach to Data Quality, download our whitepaper by Dr. Browne.