This blog from Fiona Browne, Head of Software Development & AI at Datactics, covers the subject of matching data across open datasets, a project for which the firm secured Innovate UK funding.
The Rapid Match project is a vehicle to address the complexity of integrating data and matching data at scale providing a platform for reproducible data pipelines for post and current COVID analysis.
The project provides a generalised framework for data quality, preparation, and matching which is easy to use and reproducible for the integration and merging of diverse datasets at scale.
We highlighted this capability through a Use Case on the identification of financial risk across regions in the UK. Using the Datactics platform, data quality, preparation and matching tasks were undertaken to integrate diverse UK Office of National Statistics (ONS) and UK Companies House (CH) datasets to provide a view on regional funding and sectors and the impact of COVID.
The project is a vehicle to address the complexity of integrating data and matching data at scale providing a platform for reproducible data pipelines for post and current COVID analysis.
COVID-19 related datasets are being generated at speed and volume including governmental sources from ONS, local authorities, open data through to third party datasets. Value is obtained from integrating these data together to provide a view on a particular problem area. For example, fraud detection. It is estimated that British banks have lent about £68 billion through a trio of loan programs, with repayments backstopped by the Government. Concerns have been raised about the risk of fraud, and one estimate found defaults and fraud in the Bounce Back program for small businesses could reach 80% in the worst case.
Institutions and governments need rapid access to high quality data to inform decision making processes. It is essential for the data to be of high quality, accurate and up to date. In order to do this, data needs to be complete, high quality and obtained in timely fashion. These data need to be generated at speed and volume with value achieved from integration. This is often both a tricky and time-consuming process. Furthermore, processes to perform this are often fragmented, ad-hoc, non-systematic, brittle and difficult to reproduce and maintain.
The Rapid Match project addressed the challenges around data quality and matching at scale through a systematic process which joins large amounts of messy, incomplete data in varying formats, from multiple sources. We provide a reliable ‘match engine’ allowing government and organisations to accurately and securely integrate diverse sources of data.
A key outcome of the project has been the data quality applied to the UK Companies House datasets. Companies House datasets are applied to a wide range of applications from providing a register of incorporated UK companies through use in KYC on-boarding and AML checks performed by institutions. It is estimated that “millions of professionals use Companies House data daily”. For example, in due diligence to verify ultimate beneficiary ownership through to matching against financial crime and terrorism lists.
What to do next
If you are considering how to approach your data matching strategies and would like to view the work we carried out, please get in touch with Fiona Browne on LinkedIn.