Garbage training data in, garbage model out. Here are four things to address to solve data quality problems.
Machine learning (ML) and other forms of artificial intelligence are evolving quickly today and creating a powerful array of valuable new processes for business. Most experimentation has been geared to finding specific solutions to specific problems. However, data quality challenges are likely to become increasingly important. As with the old saying, “garbage in, garbage out,” the nature of the input data can strongly influence the results that come from these systems.
Data quality has always been an issue in database and data collection systems. Transactional databases have established procedures for data quality assurance, but a new range of concerns is raised by ML. The types of data errors and their potential consequences are different from those experienced with transaction-based systems. Use of very large data sources, streaming data, complex data, and unstructured data add to quality issues, and new concerns are raised by modeling and training.
Data with a Difference
ML utilizes very large data sets in training its models as well as in practice when the models are run. This data can be subject to systemic bias that can create serious accuracy problems as well as potentially violating laws and social norms. Biases may not be immediately apparent, particularly when models use training data that is not obviously suspect. The algorithms, the data, and the results are conditioned by the definition of the problem and its solution. For example, if the data only includes male respondents, the model can only yield results that apply exclusively to males with any certainty. The same is true for minorities and other significant differentiating characteristics that may be embedded in data.
The problem of bias is well recognized in ML circles, but it is only the tip of the iceberg. In ML, models and data quality are intrinsically linked through the use of training data. Algorithms may be viewed as a kind of scientific experiment; if the wrong data is selected, then the experiment can fail to produce an adequate result.
In addition to questions of bias, the need to use extremely large data sets results in more common problems such as noise, missing values, outliers, lack of balance in distribution, inconsistency, redundancy, heterogeneity, timeliness, data duplication, and integration. Coding issues can creep in where preparation and attention to detail are lacking.
Huge data sets can be screened and wrangled through programmatic methods, some of which include ML or other AI-based methodologies. However, even in these cases it is difficult to ensure that systemic bias or incorrect problem definition does not occur. Checking algorithms and training them against diverse data is imperative for ensuring data quality. The algorithm and data need to be understood in terms of the desired result.
Quality Issues from Models
Another issue with data prepared for ML and AI is the need to create static models for real-time use after training has completed. Although AI provides considerable flexibility in discovering patterns and creating workable models for specific cases, changes in conditions reflected in the data stream can result in another kind of error. The data may be processed in real time, but the use of a static model means that even small changes in the data stream can produce incorrect results. For this reason, results need to be continuously monitored to ensure that new biases or wrong conclusions are not derived due to alterations in the data.
An additional cause for concern is the interaction of algorithm, training, data quality, and result. The algorithm itself can include data definitions that are inherently prejudicial, or data used in training may not reflect the global data against which the system is to be used. This problem is compounded where data is collected from an area entirely different from the domain of the training data and original use of the model.
Finding a Solution
To solve your data quality problem, you must ensure that both your training data and your working data repository have sufficiently high quality for the task at hand. This requires:
- Data analysis including data characteristics, distribution, source, and relevance.
- Review of outliers, exceptions, and anything that stands out as suspicious with respect to the business conditions being considered.
- Domain expertise from subject matter experts to explain unexpected data patterns so that potentially valid information is not lost and potentially invalid information does not influence the result
- Documentation: the process used must be transparent and repeatable. A data quality reference store is a good way to maintain metadata and validity rules, and this should make the creation of new algorithms and adjustments easier.
Additionally, the processing pipeline needs to be continuously validated based on the rules and experience of previous analysis. Although the specifics might need to be adjusted as data changes, each business will have its own set of domain rules that need to be applied to determine validity.
To do all this requires a data quality team and a sufficient set of tools to operate on the data used in machine learning and AI programs. Given the complexity of data and the individuality of domains, each case is likely to be significantly different. In general, the greater use of complex data and unstructured data, the more careful evaluation needs to be.
As digital transformation proceeds, more enterprises are rapidly jumping on the ML bandwagon and creating larger and more complex data streams with greater data quality difficulties. Quality tools will continue to evolve in response.