Dataset Labelling For Entity Resolution & Beyond with Dr Fiona Browne

In late 2019 our Head of AI, Dr Fiona Browne, delivered a series of talks to the Enterprise Data Management Council on AI-Enabled Data Quality in the context of AML operations, specifically for resolving differences in dataset labelling for legal entity data.

In this blog post, Fiona goes under the hood to explain some of the techniques that underpin Datactics’ extensible AI Framework.

Across the financial sector, Artificial Intelligence (AI) and Machine Learning (ML) have been applied to a number of areas, including the profiling of behaviour for fraud detection and Anti-Money Laundering (AML), through to the use of natural language processing to enrich data in Know-Your-Customer processes (KYC).

An important part of the KYC/AML process is entity resolution, which is the process of identifying and resolving entities from multiple data sources. This is traditionally the space that high-performance matching engines have been deployed, with associated fuzzy-match capabilities used to account for trivial or significant differences (indeed, this is part of Datactics’ existing self-service platform).

In this arena, Machine Learning (ML) techniques have been applied to address the task of entity resolution using different approaches from graphs and network analysis to probabilistic matching.

Although ML is a sophisticated approach for democratizing entity resolution, a limitation of applying this approach is the requirement of large volumes of labelled data for the ML model to learn from when supervised ML is used.

What is Supervised ML? 

For supervised ML, a classifier is trained using a labelled dataset. This is a dataset that contains example inputs paired with their correct output label. In the case of entity resolution, this includes examples of input matches and non-matches which are correctly labelled. The Machine Learning algorithms learns from these examples and identifies patterns that link to specific outcomes. The trained classifier then uses this learning to make a prediction on new unseen cases based on their input values.

Dataset Labelling

As we see from above, for supervised ML we need high quality labelled examples for the classifier to learn from. Unlabelled data or poorly labelled data will only make it harder data labelling tools to work. The process of labelling raw data from scratch can be time-consuming and labour intensive especially if experts are required to provide labels for, in this example, entity resolution outputs. The data labelling process is repetitive in nature, and there is a need for consistency in the labelling process to ensure high quality and correct labels are applied. It is also costly in monetary terms, as those involved in processing the entity data require a high level of understanding of the nature of entities and ultimate beneficial owners, and in the context of failure where regulatory sanctions and fines can result.

Approaches for Dataset Labelling

As AI/ML progresses across all sectors, we have seen the rise in industrial level dataset labelling where companies/individuals are able to outsource their labelling tasks to annotation tools and labelling services. For example, the Amazon Mechanical Turk service, which enables the crowdsourcing of labelling of data. This can reduce data labelling work from months to hours.  Machine Learning models can also be harnessed for data annotation tasks using approaches such as weak and semi-supervised learning along with Human-In-The-Loop Learning (HITL). HITL enables the improvement on ML models through the incorporation of human feedback through stages such as training, testing and evaluation.

ML approaches for Budgeted Learning

We can think of budgeted learning as a balancing act between the expense (in terms of cost, effort and time) of acquiring training data against the predictive performance of the model that you are building. For example, can we label a few hundred types of data instead of hundreds of thousands? There are a number of ML approaches that can help with this question and reduce the burden of manually labelling large volumes of training data. These include transfer learning, where you reuse previously gained knowledge. For instance, leveraging existing labelled data from a related sector or similar task. The recent open-source system Snorkel uses a form of weak supervision to label datasets via programmable labelling functions.

Active learning is a semi-supervised ML approach which can be used to reduce the burden of manually labelling datasets. The ‘active learner’ proactively selects the training  dataset it needs to learn from. This is based on the concept that an ML model can achieve good predictive performance with fewer training sample instances by prioritising the examples to learn from. During the training process, an active learner poses queries which can be a selection of unlabelled instances from a dataset. These ML selected instances are then presented to an expert to manually label.

As it is seen above, there are wide and varied approaches to tackling the task of dataset labelling. What approach to select depends on a number of factors from the prediction task through to expense and budgeted learning. The connecting tenet is ensuring high quality labelled datasets for classifiers to learn from.

Click here for more from Datactics, or find us on LinkedinTwitter or Facebook for the latest news.

Get ADQ 1.4 today!

With Snowflake connectivity, SQL rule wizard and the ability to bulk assign data quality breaks, ADQ 1.4 makes end-to-end data quality management even easier.