Data Quality

Data Quality concepts

Data Quality Overview

  • Determine Data Quality rules

  • Determine threshold values for columns/attributes

  • Ensure refrenstial integrity

  • Ensure constraints are determine

  • Conditional expectation across multiple columns

Data Quality Dimensions

Data quality Summary

Data Quality Dimensions

Functionality

Time Series. (Time)

Geo (Location)

Graph (relation)

Data Quality Paper - Automated Data Quality

Prashant paper on automated Data quality

Autdo_DQ_Paper_Submission_Final.pdf

Automated Data Quality

Goal of automated data quality is determine data quality rules automatically by scanning schema and historical data. After scanning data -system determine what rules and threshold historical data has and automatically build these.

A domain expert can review these suggestion and approve/edit.

Quality Checks job start scanning incoming data at defined frequency. These jobs highlight when approved rules are violated.


Automated Data Quality

Financial services companies depend on peta-bytes of data to make decisions about investments, services and operations. Data-centric methods are needed to ensure the quality of the data used for ML model-based and other business process automation. This paper presents AutoDQ, an end-to-end data quality assurance framework to monitor production data quality and which leverages ML to identify and select validation constraints. AutoDQ introduces novel unit tests derived from the automatic extraction of data semantics and inter-column relationships, in addition to constraints based on predictability and statistical profiling of data. It operates on both tabular and time-series data without requiring schema or any metadata. The components of our framework have been tested over 100 public datasets as well as several internal transactional datasets.


1 Introduction

The financial industry has made growing use of ML for a range of applications [1, 2]. Performance of financial ML systems depend on the quality of the data used [3, 4, 5]. Data quality (DQ) is a business requirement, and automated data quality methods are important to scale the assurance systems and reduce cost. [6, 7, 8, 9]. In ML model deployments errors in data may get propagated and amplified as they move through the pipeline [10, 11], a common delivery pattern [12, 13, 14, 15]. AutoDQ is a data quality assurance framework for ML production pipelines and other data-dependent applications. It automatically determines constraint checks for quality metrics and constantly monitors the data. The framework does not require data schemas or any metadata to achieve data validation along these quality dimensions ([16, 17, 18, 19]): completeness, consistency, accuracy or correctness, and timeliness. To ensure that the inspected data complies with these dimensions, AutoDQ combines constraint checking and data monitoring via anomaly detectors for time-series. Constraint checking occurs in two steps: (1) constraints, metrics, and thresholds are automatically learned using reference data, producing unit tests. (2) Unit tests use metrics and thresholds to check the quality and detect anomalies in production data.

Previous work and new work

Previous work [20, 21] presents automated DQ methods based on constraints extracted from statistical analyses and validation of values through ML-based predictions. AutoDQ extends this work by automatically determining both inter- and intra- column semantic constraints. The advantages include bringing data quality metrics closer to the application use of the data, and enabling data monitoring to leverage a richer range of data constraints.

AutoDQ makes these new contributions:

Automatic identification of semantic-based constraints for both inter- and intra data relation ships.




2 AutoDQ Framework

Figure 1 shows our AutoDQ framework to ensure data quality in ML production pipelines and other data-dependent services. The framework validates tabular data with predictability, semantic based, and statistics-based constraints. These constraints are determined for inter- and intra- column relationships and used for the detection of anomalies through unit tests. Predictability targets data accuracy, while statistics-based validates data completeness and consistency. The semantic-based method uses inferred specialized data types to determine semantic constraints. AutoDQ also monitors production time-series data using ML-based algorithms. Our framework assumes the availability of a reference data for the generation of constraints and anomlay detection models.

Precomputation. This phase computes artifacts needed for the generation of constraints and trains our time-series anomaly detectors. In the case of the predictability constraint, given a tabular dataset D formed by the set of column vectors {di|i = 1, ..., n}, a model miis


trained for each column vector di. For i = k, the corresponding model mk is trained so that dk is the dependent variable while the rest of columns D − {dk} the independent variables. Whether each model miis a classifier or a regressor is determined by the data type of the column di. Metrics such as accuracy, F1-Score, MSE, and RMSE [22] are produced by testing these models wiht a portion of the reference data.

The semantic-based method is one of our main contributions. AutoDQ infers specialized data types on tabular data using a set of heuristics and data properties. Examples of specialized data types include: credit score, account number, CVV, VIN, routing number, email, IP address, port, phone, street address, datetime, and various categorical types. These specialized types are leveraged to generate new constraints during the constraint suggestion phase.

For the statistics-based approach, AutoDQ extracts statistical metrics and thresholds of the data, which are used later to select constraints. Thresholds are estimated parametrically. When para metric approaches are not applicable or they depend on the underlying distribution, thresholds are approximated using bootstrapping methods [23].

AutoDQ trains time-series anomaly detection models with the reference data and collected metrics. These models are used to detect anomalies in the range and distribution of values. ML algorithms that have been tested include MLR, SVM, ANN, CatBoost, LGBM, Extra Trees, and Random Forest.



Constraint Suggestion. The predictability constraint is suggested based on user-defined thresholds. Given the computed metric and the threshold thr the set of highly predictable columns (HP C) in the dataset D is defined as HP C = {di|metric > thr}. This constraint allows us to find deterministic patterns in data (e.g. column group < child, adult, senior > is determined by column < age >.

The semantic-based method uses specialized data types to select semantic inter- and intra- constraints. Example of semantic constraints are: is_credit_score_consistent(), is_credit_score_in_pool(). For semantic-based inter-column constraints, AutoDQ automatically discovers three types of patterns: (1) Comparative (e.g., spread of bid column and ask column are usually positive), (2) Monotonic (e.g., company cumulative sales follows a monotonic increasing pattern), (3) State Transition (e.g., account status cannot go from close to pending state).

In the case of the statistics-based approach, we collected multiple public datasets from which we extracted data type, distinctness, average standard deviation, skewness, and other statistics. From this combination of datasets, we created a rule-based mechanism that suggest statistical single and multi column constraints listed in [19] along with their previously computed thresholds.

Data Validation and Reference Data Update. Production data is validated through unit testing to verify the data complies with the constraints and, in the case of time series data, through the generated anomaly detectors. Production data typically comes in batches delivered overt ime. Once a batch is validated, it is appended to the reference data and the process repeats again. By analyzing the incremental data as new batches arrive and by keeping our reference data up to date, we ensure the timeliness of our framework.

3 Experiments

This section presents a summary of the evaluation conducted over AutoDQ with both public and proprietary data. It also includes detailed results obtained in experiments run over a selected number of AutoDQ components. Due to restrictions regarding to the privacy of our proprietary data, the experiments presented with detailed results are those obtained with public datasets.

3.1 Test of AutoDQ Functionalities

Table 1 presents the number of tests performed by AutoDQ functionality. The summary includes all the permutations and types of constraints in the framework that have been evaluated with both public and proprietary datasets. The component Specialized Data Types Precomputation has been tested and released, allowing AutoDQ to discover constraints based on specialized data types. Other components in the framework are expected to evolve with additional experimentation.

3.2 Use Case of Our Semantic-Based Data Valdiation on Proprietary Data

These section presents results obtained with proprietary datasets over which we were able to identify multi-column patterns without any previous knowledge about the evaluated data. Two different patterns were discovered. (1) A deterministic pattern thrugh predictability, in which two categorical columns followed a hierarchical relationship, where the value of the higher level column is determined by the value of the lower one. (2) A comparative pattern where multiple pairs of numeric columns strictly followed a rule in which one value in the pair was always smaller or bigger than its peer. These results were validated with a data analyst who manages the data.


3.3 Deterministic Pattern Algorithm Through Predictability on Simulated Data

We tested our algorithm on deterministic patterns using a 100K-record synthetic dataset. These data has 23 columns, from which 22 are filled with random values between -1 and 1. The other one is a categorical column indicating the quadrants of the cartesian plane. Two out of the 22 numeric columns named t_x and t_y are used to determine the value of the quadrant column. For instance, in rows with the tuple (t_x, t_y) = (0.1, 0.1) and (t_x, t_y) = (0.1, −0.1), the corresponding quadrant values are "First" and "Third" respectively. The dataset was 60-40 split to get the simulated reference and future data respectively. To test our algorithm, we induced well identified errors in the quadrant column by randomly changing some values to incorrect ones. Twenty records in the future dataset include these types of invalid values. Through predictability, we identified the deterministic pattern in the reference dataset. Using the generated model m for the column vector quadrant, the constraint < predicted quadrant == actual quadrant > was used to check data quality on the future dataset. Records with invalid values were identified as anomalous when violated the learned constraint. Our algorithm detected all the invalid records with a precision and recall of 100%. As comparison using iForest on the same dataset a precision of 0.13% and a recall of 25% were observed.

3.4 Anomaly Detection on Time Series Data

This experiment shows how AutoDQ employs prediction models to monitor the value ranges observed in the evaluated data. A gold price prediction model was trained using 10 years of 27 publically available economic and financial time-series variables such as silver and crude oil prices, Euro-USD exchange, dollar index and so on, which were treated as predictors of the golden price. Time-series were up-sampled to daily where needed. Once the model was trained and tested, a controlled anomaly generator was used to perturb the other test data sets for a set of the most significant time-series variables.

We used random and burst perturbations. In the former, 10% of the test data was randomly modified across the entire dataset. In the latter, 2 bursts of continuous points containing 5% of the dataset were added. For both cases, the values were modified by either adding to them values drawn from zero centered Gaussian distribution or dropping the values. With these anomalies added, we proceeded with the detection evaluation using three different distance metrics: absolute error, relative absolute error and normalize squared error. Values in the test data that were beyond a pre-learned threshold at training time were considered anomalous. Table 2 shows the obtained results. As shown, we achieved a high performance detection with the absolute and normalized square errors. We were more interested in the random perturbations as they were expected to be harder to detect. In both cases we got ROC are above 90% with TPR and FPR of 0.75 and <0.02 respectively.



4 Conclusion

We present AutoDQ, a framework for the assurance of data quality with special attention to the automation of the process. Our system is primarily based on three types of contraints and their data validation unit tests. These constraints are predictably, statistics-based, and semantic-based constraints. In addition AutoDQ relies on anomaly detection mechanisms for continuous monitoring of both production time series data and changes in the metrics over time. We introduce a novel approach to extract specialized data types. These types provide us with important semantic information that enables the application of new semantic-based constraints. Components of AutoDQ have been tested with more than 100 public datasets as well as a number of proprietary datasets involving transactional data. These experiments show promsing results including the ability to detect anomalies using models that were automated.



References

[1] J.p. morgan - artificial intelligence research. https://www.jpmorgan.com/technology/ artificial-intelligence. Accessed: 2021-09-30.

[2] Wells fargo - engaging tech for your business. https://www.wellsfargo. com/biz/wells-fargo-works/planning-operations/planning-marketing/ engaging-tech-for-your-business/. Accessed: 2021-09-30.

[3] Nithya Sambasivan, Shivani Kapania, Hannah Highfill, Diana Akrong, Praveen Paritosh, and Lora M Aroyo. “everyone wants to do the model work, not the data work”: Data cascades in high-stakes ai. In proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–15, 2021.

[4] Lawrence A Palinkas, Sarah M Horwitz, Carla A Green, Jennifer P Wisdom, Naihua Duan, and Kimberly Hoagwood. Purposeful sampling for qualitative data collection and analysis in mixed method imple mentation research. Administration and policy in mental health and mental health services research, 42(5):533–544, 2015.

[5] Alon Halevy, Peter Norvig, and Fernando Pereira. The unreasonable effectiveness of data. IEEE Intelligent Systems, 24(2):8–12, 2009.

[6] Mohammad Mahdavi, Felix Neutatz, Larysa Visengeriyeva, and Ziawasch Abedjan. Towards automated data cleaning workflows. Machine Learning, 15:16, 2019.

[7] Felix Biessmann, Jacek Golebiowski, Tammo Rukat, Dustin Lange, and Philipp Schmidt. Automated data validation in machine learning systems. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering.[Google Scholar], 2021.

[8] Shazia Sadiq, Tamraparni Dasu, Xin Luna Dong, Juliana Freire, Ihab F Ilyas, Sebastian Link, Miller J Miller, Felix Naumann, Xiaofang Zhou, and Divesh Srivastava. Data quality: The role of empiricism. ACM SIGMOD Record, 46(4):35–43, 2018.

[9] Tammo Rukat, Dustin Lange, Sebastian Schelter, and Felix Biessmann. Towards automated ml model monitoring: Measure, improve and quantify data quality. In ML Ops workshop at MLSys, 2019.

[10] Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. Data management challenges in production machine learning. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 1723–1726, 2017.

[11] David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaud hary, Michael Young, Jean-Francois Crespo, and Dan Dennison. Hidden technical debt in machine learning systems. Advances in neural information processing systems, 28:2503–2511, 2015.

[12] Denis Baylor, Eric Breck, Heng-Tze Cheng, Noah Fiedel, Chuan Yu Foo, Zakaria Haque, Salem Haykal, Mustafa Ispir, Vihan Jain, Levent Koc, et al. Tfx: A tensorflow-based production-scale machine learning platform. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1387–1395, 2017.

[13] Joos-Hendrik Böse, Valentin Flunkert, Jan Gasthaus, Tim Januschowski, Dustin Lange, David Salinas, Sebastian Schelter, Matthias Seeger, and Yuyang Wang. Probabilistic demand forecasting at scale. Pro ceedings of the VLDB Endowment, 10(12):1694–1705, 2017.

[14] Tushar Chandra, Eugene Ie, Kenneth Goldman, Tomas Lloret Llinares, Jim McFadden, Fernando Pereira, Joshua Redstone, Tal Shaked, and Yoram Singer. Sibyl: a system for large scale machine learning. Keynote I PowerPoint presentation, Jul, 28, 2010.

[15] Evan R Sparks, Shivaram Venkataraman, Tomer Kaftan, Michael J Franklin, and Benjamin Recht. Key stoneml: Optimizing pipelines for large-scale advanced analytics. In 2017 IEEE 33rd international conference on data engineering (ICDE), pages 535–546. IEEE, 2017.

[16] Carlo Batini, Cinzia Cappiello, Chiara Francalanci, and Andrea Maurino. Methodologies for data quality assessment and improvement. ACM computing surveys (CSUR), 41(3):1–52, 2009.

[17] Monica Scannapieco and Tiziana Catarci. Data quality under a computer science perspective. Archivi & Computer, 2:1–15, 2002.

[18] Felix Naumann. Quality-driven query answering for integrated information systems, volume 2261. Springer, 2003.

5

[19] Jie Song and Yeye He. Auto-validate: Unsupervised data validation using data-domain patterns inferred from data lakes. In Proceedings of the 2021 International Conference on Management of Data, pages 1678–1691, 2021.

[20] Eric Breck, Neoklis Polyzotis, Sudip Roy, Steven Whang, and Martin Zinkevich. Data validation for machine learning. In MLSys, 2019.

[21] Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Biessmann, and Andreas Grafberger. Automating large-scale data quality verification. Proceedings of the VLDB Endowment, 11(12):1781–1794, 2018.

[22] Kathrin Blagec, Georg Dorffner, Milad Moradi, and Matthias Samwald. A critical analysis of metrics used for measuring progress in artificial intelligence. arXiv preprint arXiv:2008.02577, 2020.

[23] Bradley Efron. Bootstrap methods: another look at the jackknife. In Breakthroughs in statistics, pages 569–593. Springer, 1992.