How Data Quality Tools Deliver Clean Data for AI and ML

In her previous blog Dr Fiona Browne, Head of AI and Software Development, assessed the need for the AI and Machine Learning world to prioritise the data that is being fed into models and algorithms (and you can read it here. ) This blog goes into some of the critical capabilities for data quality tools to support specific AI and ML use cases with clean data.

How Data Quality Tools Deliver Clean Data for AI and Machine Learning

A Broad Range of Data Quality Tool Features On Offer

The data quality tools market is full of vendors with a wide range of capabilities, as referenced in the recent Gartner Magic Quadrant. Regardless of the firm’s data volumes, or whether they are a small, midsize or large enterprise, they will be reliant on high quality data for every conceivable business use case, from the smallest product data problem to enterprise master data management. Consequently, data leaders should explore the competitive landscape fully to find the best fit to their data governance culture and the growth opportunities that the right vendor-client fit can offer.

Labelling Datasets

A supervised Machine Learning (ML) model learns from a training dataset consisting of features and labels.

We do not often hear about the efforts required to produce a consistent, well-labelled dataset, yet this will have a direct impact on the quality of a model and the predictive performance, regardless of organization size. A recent Google research report estimates that within an ML project, data labelling can cost between 25%-60% of the total budget.

Labelling is often a manual process requiring a reviewer to assign a tag to a piece of data e.g. to identify a car in an image, state if a case is fraudulent, or assign sentiment to a piece of text.

Succinct, well defined labelling instructions should be provided to reduce labelling inconsistencies. Where data quality solutions can be applied in this context includes the use of metrics to measure the label consistency within a dataset, and based on this, review and improve consistency scores.

As labelling is a laborious process, and access to resource to provide the labels can be limited, we reduce the volume of manual labelling using an active learning approach.

Here, ML is used to identify the trickiest, edge cases within a data set to label. These prioritised cases are passed to a reviewer to manually annotate without the need to label a complete data set. This approach also captures the rationale from a human expert as to why a label was provided, which provides transparency in predictions further downstream.  

Entity resolution

For data matching and entity resolution, Datactics has used ML as a ‘decision aid’ for low confidence matches to reduce again the burden of manual review. The approach implemented by Datactics provides information on the confidence of the predictions through to the rationale as to why a prediction was provided. Additionally, the solution has built in the capability to accept or reject the predictions, so the client can continually update and improve the predictions required, using that fully-explainable, human in the loop approach. You can see more information on this in our White Paper here.

Detecting outliers and predicting rules

This is a critical step in a fully AI-augmented data quality journey, occurring in the key data profiling stage, before data cleansing. It empowers business users, who are perhaps not familiar with big data techniques, coding or programming, to rapidly get to grips with the data they are exploring. Using ML in this way helps them to uncover relationships, dependencies and patterns which can influence which data quality rules they wish to use to improve data quality or deliver better business outcomes, for example regulatory reporting or digital transformation.

This automated approach to identifying potentially erroneous data within your dataset and highlighting these within the context of data profiling reduces manual effort spent in trying to find these connections across different data sources or within an individual data set. It can remove a lot of the heavy lifting associated with data profiling especially when complex data integration or connectivity to data lakes or data stores is required.

The rule prediction element complements the outlier detection. It involves reviewing a data set, and suggesting data quality rules that can be run against this set to ensure compliance to both regulations and to standard dimensions of data quality, e.g. consistency, accuracy, timeliness etc., and for business dimensions or policies such as credit ratings or risk appetite.

Fixing data quality breaks

Again, ML helps in this area where the focus is placed on manual tasks for remediating erroneous or broken data. Can we detect trends in this data, for example on the first day of the month, we ingest a finance dataset and which causes a spike in data quality issues? Is there an optimal path to remediation that we can predict, or are there remediation values that we can suggest?

For fixing breaks, we have seen the use of rewards to the best performing teams which builds that value of the work. This gamification approach can support business goals through optimal resolution of key issues that matter to the business, rather than simply trying to fix everything that is wrong, all at once.

Data Quality for Explainability & Bias

We hear a lot about the deployment of ML models and the societal issues in terms of bias and fairness of a model. Applications of models can have a direct, potentially negative impact on people, and it stands to reason that everyone involved in the creation, development, deployment and evaluation of these models should take an active role in preventing such negative impacts from arising.

Having diverse representative teams building these systems is important. For example, a diverse team could have ensured that Google’s speech recognition software was trained on a diverse section of voices. In 2016, Rachael Tatman, a research fellow in linguistics at the University of Washington, found that Google’s speech-recognition software was 70% more likely to accurately recognise male speech.

Focusing on the data quality of the data that feeds our models can help identify areas of potential bias and unfairness. Interestingly, bias isn’t necessarily a bad thing. Models need bias in the data in order to discriminate between outcomes, e.g. having a history of a disease results in a higher risk of having that disease again.

The bias we want to be able to detect is unintended bias and, accordingly, unintended outcomes (and of course, intentional bias created by bad actors). For example, using techniques to identify potential proxy features, e.g. post or ZIP code even when discriminatory variables are removed such as race. IBM AI Fairness 360 suggest metrics to run against datasets to highlight potential bias e.g. using class labels such as race, gender and running metrics against the decisions made by the classifier. From this identification there are different approaches that can be taken to address these issues such as balancing a dataset, within an algorithm to penalise a bias through to the post processing in favouring a particular outcome.

Explainable AI (XAI)’s Role In Detecting Bias

XAI is a nascent field where ML is used to explain the predictions made by a classifier. For instance LIME (Local Interpretable Model-agnostic Explanations) provides a measure of ‘feature importance’. So if we find that postcode which correlates with race is a key driver in a prediction, this could highlight discriminatory behaviour within the model.

These approaches explain the local behaviour of a model and fit an interpretable model, such as a tree or linear regression. Again, the type of explanation will differ depending on an audience. For example, different processes may be needed to provide an explanation at an internal or data scientist level compared to an external client or customer level. Examples could be extended by providing reason and action codes as to why credit was refused.

Transparency can be provided in terms of model cards structured framework for reporting on ML model provenance, usage, and ethics-informed evaluation and give a detailed overview of a model’s suggested uses and limitations. This can be extended to the data side, and contain meta-data such data provenance, consent sought, and so on and so forth.

That being said, there is no single ‘silver-bullet’ approach to address these issues. Instead we need to use a combination of approaches and to test often.  

Where to next – Machine Learning Ops (MLOps)

These days, the ‘-ops’ suffix is often appended to business practices right across the enterprise, from DevOps to PeopleOps, reflecting a systematic approach to how a function behaves and is designed to perform.

In Machine Learning, that same systematic approach, providing transparency and auditability, helps to move the business from brittle data pipelines to a proactive data approach that embeds human expertise.

Such an approach would identify issues within a process and not rely on an engineer identifying an issue by chance or individual expertise, which of course does not scale and is not robust. This system-wide approach embeds governance, security, risk and ownership at all levels. It does require a need for integration of expertise, for example the model developers gain an understanding into what risk is from knowledge transferred from risk officers and subject matter experts.

We need a maturing of the MLOps approach to support these processes. This is essential for high-quality and consistent flow of data throughout all stages of a project and to ensure that the process is repeatable and systematic.

It also neccessitates monitoring the performance of the model once in production, to take into account potential data drift or concept drift, and address this as and when identified. It should be said that testing for bias, robustness and adversarial attacks is still in nascent stages, but all this serves to do is highlight the importance of an MLOps approach right now rather than wait until these capabilities are fully developed.:

In practical terms, groups such as the Bank of England’s AI Public-Private Forum have significant potential to help the public and private sectors better understand the key issues, clarify the priorities and determine what actions are needed to support the safe adoption of AI in financial services.

Get ADQ 1.4 today!

With Snowflake connectivity, SQL rule wizard and the ability to bulk assign data quality breaks, ADQ 1.4 makes end-to-end data quality management even easier.