Self-Service Data Quality for DataOps

At the recent A-Team Data Management Summit Virtual, Datactics CEO Stuart Harvey delivered a keynote on “Self-Service Data Quality for DataOps – Why it’s the next big thing in financial services.” The keynote (available here) can be read below, with slides from the keynote included for reference. Should you wish to discuss the subject with us, please don’t hesitate to contact Stuart, or Kieran Seaward, Head of Sales.

I started work in banking in the 90’s as a programmer, developing real-time software systems written in C++. In these good old days, I’d be given a specification, I’d write some code, test and document it. After a few weeks it would be deployed on the trading floor. If my software broke or the requirements changed it would come back to me and I’d start this process all over again. This ‘waterfall’ approach was slow and, if I’m honest, apart from the professional pride of not wanting to create buggy code, I didn’t feel a lot of ownership for what I’d created.

In the last five years a new methodology in software engineering has changed all that – it’s called DevOps, and brings a very strategic and agile approach to building new software.

More recently DevOps had a baby sister called DataOps, and it’s this subject that I’d like to talk about today.

Many Chief Data Officers (CDO) and analysts have been impressed by the increased productivity and agility their Chief Technology Officer (CTO) colleagues are seeing through the use of DevOps. Now they’d like to get in on the act. In the last few months at Datactics we’ve been talking a lot to CDO clients about their desire to have a more agile approach to data governance and how DataOps fits into this picture.

In these conversations we’ve talked a great deal about the ownership of data. A key question is how to associate the measurement and fixing of a piece of broken data with the person most closely responsible for it. In our experience the owner of a piece of data usually makes the best data steward. These are the people who can positively affect business outcomes through accurate measuring and monitoring of data and is typically a CDO’s role.

We have seen a strong desire to push data science processes, including data governance and the measurement of actual data quality (at a record level) into the processes and automation that exist in a bank.

I’d like to share with you through some simple examples of what we are doing with our investment bank and wealth management clients. I hope that this shows that a self-service approach to data quality (with appropriate tooling) can empower highly agile data quality measurement for any company wishing to implement the standard DataOps processes of validation, sorting, aggregation, reporting and reconciliation.

Roles in DataOps and Data Quality

We work closely with the people who use the Datactics platform, the people that are responsible for the governance of data and reporting on its quality. They have titles like Chief Data Officer, Data Quality Manager, Chief Digital Officer and Head of Regulation. These data consumers are responsible for large volumes of often messy data relating to entities, counterparties, financial reference data and transactions. This data does not reside in just one place; it transitions through multiple bank processes. It is sometimes “at rest” in a data store and sometimes “in motion” as it passes via Extract, Transform, Load (ETL) processes to other systems that live upstream of the point at which it was sourced.

For example, a bank might download counterparty information from Companies House to populate its Legal Entity Master. This data is then published out to multiple consuming applications for Know Your Customer (KYC), Anti-Money Laundering (AML) and Life Cycle Management. In these systems the counterparty records are augmented with information such as a Legal Entity Identifier (LEI), a Bank Identifier Code (BIC) or a ticker symbol.

This ability to empower subject matter experts and business users who are not programmers to measure data at rest and in motion has led to the following trends:

Ownership: Data quality management moves from being the sole responsibility of a potentially remote data steward to all of those who are producing and changing data, encouraging a data driven culture.
Federation: Data quality becomes everyone’s job. Let’s think about end of day pricing at a bank. The team that owns the securities master will want to test accuracy and completeness of data arriving from a vendor. The analyst working upstream who takes an end of day price from the securities master to calculate a volume-weighted average price (VWAP) will have different checks relating to the timeliness of information. Finally, the data scientist upstream of this who uses the VWAP to create predictive analytics. They want to build their own rules to validate data quality.
Governance: A final trend that we are seeing is the tighter integration with standard governance tools. To be effective, self-service data quality and DataOps require tight integration with the existing systems that hold data dictionaries, metadata, and lineage information.

Here’s an illustration of how of how we see Datactics Self Service Data Quality (SSDQ) Platform integrating with DataOps in a high–impact way that you might want to consider in your own data strategy.

1. Data Governance Team

First off, we offer a set of pre-built dashboards for PowerBI, Tableau and Qlik that allow your data stewards to have rapid access to data quality measurements which relate just to them. A user in the London office might be enabled to see data for Europe or, perhaps, just data in their department. Within just a few clicks a data steward for the Legal Entity Master system could identify all records that are in breach of an accuracy check where an LEI is incorrect, or a timeliness check where the LEI has not been revalidated in the Global LEI Foundation’s (GLEIF) database inside 12 months.

2. Data Quality Clinic: Data Remediation

Data Quality Clinic extends the management dashboard by allowing a bank to return broken data to its owner for fixing. It effectively quarantines broken records and passes them to the data engineer in a queue, improving data pipelines and overall data governance & data quality. Clinic runs is a web browser and is tightly integrated with information relating to data dictionaries, lineage and third–party sources for validation. Extending our LEI example just now, I might be the owner of a bunch of entities which have failed an LEI check. Clinic would show me the records in question and highlight the fields in error. It would connect to GLEIF as the source of truth for LEIs and provide me with hints on what to correct. As you’d expect, this process can be enhanced by Machine Learning to automate this entity resolution process under human supervision.

3. FlowDesigner Studio: Rule creation, documentation, sharing

FlowDesigner is the rules studio in which the data governance team of super users build, manage, document and source-control rules for the profiling, cleansing and matching of enterprise data. We like to share these rules across our clients so FlowDesigner comes pre-loaded with rules for everything from name and address checking to CUSIP or ISIN validation.

4. Data Quality Manager: Connecting to data sources; scheduling, automating solutions

This part of the Datactics platform allows your technology team to connect to data flowing from multiple sources, schedule how rules are applied to data at rest and in–motion. It allows for the sharing and re-use of rules across all parts of your business. We have many clients solving big data problems involving hundreds of millions of records using Data Quality Manager across multiple different environments and data sources, on-premise or in public (or more typically private) cloud.

Summary: Self-Service Data Quality for DataOps

Thanks for joining me today as I’ve outlined how self-service data quality is a key part of successful DataOps. CDOs need real-time data quality insights to keep up with business needs while technical architects require a platform that doesn’t need a huge programming team to support it. If you have any questions about this topic, or how we’ve approached it, then we’d be glad to talk with you. Please get in touch below.

Click here for the latest news from Datactics, or find us on Linkedin, Twitter or Facebook

Self-Service Data Quality for DataOps

1. Data Governance Team

2. Data Quality Clinic: Data Remediation

3. FlowDesigner Studio: Rule creation, documentation, sharing

4. Data Quality Manager: Connecting to data sources; scheduling, automating solutions

Summary: Self-Service Data Quality for DataOps

More on this topic

The Importance of Data Quality in Machine Learning

About Datactics

Downloads

Research