Identifying outliers and errors in data is an important but time-consuming task. Depending on the context and domain, errors can be impactful in a variety of ways, some very severe. One of the issues with detecting outliers and errors is that they come in many different forms. There are syntactic errors, where a value like a date or time is in the wrong format, and semantic errors, where a value is in the correct format but doesn’t make sense in the context of the data, like an age of 500. The biggest problem with creating a method for detecting outliers in dataset is how to identify a vast range of different errors with the one tool.
At Datactics, we’ve been working on a tool to solve some of these problems and enable errors and outliers to be quickly identified with minimal user input. With this project, our goal is to assign a number to each value in a dataset which represents the likelihood that the value is an outlier. To do this we use a number of different features of the data, which range from quite simple methods like looking at the frequency of a value or its length compared to others in its column, to more complex methods using n-grams and co-occurrence statistics. Once we have used these features to get a numerical representation of each value, we can then use some simple statistical tests to find the outliers.
When profiling a dataset, there are a few simple things you can do to find errors and outliers in the data. A good place to start could be to look at the least frequent values in a column or the shortest and longest values. These will highlight some of the most obvious errors but what then? If you are profiling numeric or time data, you could rank the data and look at both ends of the spectrum to see if there are any other obvious outliers. But what about text data or unique values that can’t be profiled using frequency analysis? If you want to identify semantic errors, this profiling would need to be done by a domain expert. Another factor to consider is the fact that this must all be done manually. It is evident that there are a number of aspects of the outlier detection process that limit both its convenience and practicality. These are some of the things we have tried to address with this project.
When designing this tool, our objective was to create a simple, effective, universal approach to outlier detection. There are a large number of statistical methods for outlier detection that, in some cases, have existed for hundreds of years. These are all based on identifying numerical outliers, which would be useful in some of the cases listed above but has obvious limitations. Our solution to this is to create a numerical representation of every value in the data set that can be used with a straightforward statistical method. We do this using features of the data. The features currently implemented and available for use are:
- Character N-Grams
- Co-Occurrence Statistics
- Date Value
- Numeric Value
- Symbolic N-Grams
- Text Similarities
- Time Value
We are also working on creating a feature of the data to enable us to identify outliers in time series data. Some of these features, such as date and numeric value are only applicable on certain types of data. Some incorporate the very simple steps discussed above, like occurrence and length analysis. Others are more complicated and could not be done manually, like co-occurrence statistics. Then there are some, like the natural language processing text similarities, which make use of machine learning algorithms. While there will be some overlap in the outliers identified by these features, on the most part, they will all single out different errors and outliers, acting as an antidote to the heterogenous nature of errors discussed above.
One of the benefits of this method of outlier detection is its simplicity which leads to very explainable results. Once features of our dataset have been generated, we have a number of options in terms of next steps. In theory, all of these features could be fed into a machine learning model which could then be used to label data as outlier and non-outlier. However, there are a number of disadvantages to this approach. Firstly, this would require a labelled dataset to train the model with, which would be time-consuming to create. Moreover, the features will differ from dataset to dataset so it would not be a case of “one model fits all”. Finally, if you are using a “black box” machine learning method when a value is labelled as an outlier, you have no way of explaining this decision or evidence as to why this value has been labelled as opposed to others in the dataset.
All three of these problems are avoidable using the Datactics approach. The outliers are generated using only the features of the original dataset and, because of the statistical methods being used, can be identified with nothing but the data itself and a confidence level (a numerical value representing the likelihood that a value is an outlier). There is no need for any labelling or parameter-tuning with this approach. The other big advantage is, that due to the fact we assign a number to every value, we have evidence to back-up every outlier identified and are able to demonstrate how they differ from other none-outliers in the data.
Another benefit of this approach is that it is modular and therefore completely expandable. The features the outliers are based on can be selected based on the data being profiled which increases accuracy. Using this architecture also give us the ability to seamlessly expand the number of features available to be used and if trends or common errors are encounter that aren’t identified using the current features, it is very straightforward to create another feature to rectify this.