Supervised Learning for data cleaning in the coherence and completeness dimensions
Main Article Content
Information has become an asset for companies because most business strategic decisions are made based on data analysis; however, the best results are not always obtained in these analyses due to the low quality of information . It as several evaluation dimensions, making the task complex of achieving an adequate level of quality. One of the main activities before proceeding with any type of analysis is the pre-processing of the data. This activity is one of the most demanding in time; the expected levels of quality are not always obtained, nor are the evaluation dimensions with the most significant impact are covered. This work presents the use of machine learning as a tool to clean data in the dimension of completeness and coherence; its validation is done on a data set provided by a government entity in charge of protecting children's rights at the national level. It starts from the selection of the information processing tools, the descriptive analysis of the data, the specific identification of the problems in which the machine learning techniques will be applied to improve the quality of the data, experimentation, and evaluation of the different models, and finally the implementation of the best performing model. Among the results of this work, there is an improvement in the completeness dimension, decreasing the null data by 4.9%. In the coherence dimension, 2.6% of the records were identified with contradictions, thus validating machine learning for data cleaning.
IEEE Potentials Magazine, November/December 2020 [Internet]. [citado 24 de abril de 2021]. Disponible en: //read.nxtbook.com/ieee/potentials/novem ber_december_2020/index.html
Carlo Batini, Monica Scannapieco. DATA AND INFORMATION QUALITY. I. Switzerland: Springer International Publishing; 2016. 500 p.
Sammut C, Webb GI, editores. Encyclopedia of Machine Learning and Data Mining [Internet]. Boston, MA: Springer US; 2017 [citado 23 de marzo de 2019]. Disponible en: http://link.springer.com/10.1007/978-1- 4899-7687-1
Who we are - Eurostat [Internet]. [citado 23 de agosto de 2020]. Disponible en: https://ec.europa.eu/eurostat/about/whowe-are
Grow BG, January 24 2020. Data Quality Predictions for 2020 [Internet]. Transforming Data with Intelligence. [citado 21 de agosto de 2020]. Disponible en: https://tdwi.org/articles/2020/01/24/diqall-data-quality-predictions-for-2020.aspx
Redman TC. Bad Data Costs the U.S. $3 Trillion Per Year. Harvard Business Review [Internet]. 22 de septiembre de 2016 [citado 21 de agosto de 2020]; Disponible en: https://hbr.org/2016/09/bad-data-coststhe-u-s-3-trillion-per-year
Grow BG, July 6 2018. Reducing the Impact of Bad Data on Your Business [Internet]. Transforming Data with Intelligence. [citado 21 de agosto de 2020]. Disponible en: https://tdwi.org/articles/2018/07/06/diqall-reducing-the-impact-of-bad-data.aspx
Fisher CW, Kingma BR. Criticality of data quality as exemplified in two disasters. Inf Manage. 1 de diciembre de 2001;39(2):109-16.
crodwflower. 2016 DATA SCIENCE REPORT. 2016 [Internet]. Disponible en: https://visit.figure-eight.com/rs/416-ZBE142/images/CrowdFlower_DataScienceR eport_2016.pdf?mkt_tok=eyJpIjoiTVRKa U9HWTBOVGxpWXpSbSIsInQiOiJ2V XJRdzlQK1RaRlNzeVdLamF2ZkUrR1w vUnJlNDY3Mk03bm42MExwWEZoNX VJOEFHWUVXdjJ0Q3FSc1RvTCtFK21 mUmkyUFwvUUJYMzBCcm5YU0xldE d3MENOVTNKaW10QjBxTDBVVHhlT kNab3NqV1Q5TllSREhNelhxYVBMQ3 ZEIn0%3D
ISO 9000:2015(en), Quality management systems — Fundamentals and vocabulary [Internet]. [citado 23 de agosto de 2020]. Disponible en: https://www.iso.org/obp/ui/#iso:std:4548 1:en
Batini C, Scannapieco M. Data Quality Dimensions. En: Data and Information Quality [Internet]. Springer, Cham; 2016 [citado 2 de julio de 2018]. p. 21-51. (Data-Centric Systems and Applications). Disponible en: https://link.springer.com/chapter/10.1007/ 978-3-319-24106-7_2
Batini C, Scannapieco M. Activities for Information Quality. En: Data and Information Quality [Internet]. Springer, Cham; 2016 [citado 2 de julio de 2018]. p. 155-75. (Data-Centric Systems and Applications). Disponible en: https://link.springer.com/chapter/10.1007/ 978-3-319-24106-7_7
Batini C, Scannapieco M. Object Identification. En: Data and Information Quality [Internet]. Springer, Cham; 2016 [citado 2 de julio de 2018]. p. 177-215. (Data-Centric Systems and Applications). Disponible en: https://link.springer.com/chapter/10.1007/ 978-3-319-24106-7_8
Liu H, Kumar TKA, Thomas JP. Cleaning Framework for Big Data - Object Identification and Linkage. En: 2015 IEEE International Congress on Big Data. 2015. p. 215-21.
Tejada S, Knoblock CA, Minton S. Learning object identification rules for information integration. Inf Syst. diciembre de 2001;26(8):607-33.
Zhang X-D. Machine Learning. En: Zhang X-D, editor. A Matrix Algebra Approach to Artificial Intelligence [Internet]. Singapore: Springer; 2020 [citado 25 de octubre de 2021]. p. 223-440. Disponible en: https://doi.org/10.1007/978-981-15- 2770-8_6
El Naqa I, Murphy MJ. What Is Machine Learning? En: El Naqa I, Li R, Murphy MJ, editores. Machine Learning in Radiation Oncology: Theory and Applications [Internet]. Cham: Springer International Publishing; 2015 [citado 25 de octubre de 2021]. p. 3-11. Disponible en: https://doi.org/10.1007/978-3-319- 18305-3_1
Bonaccorso G. Machine Learning Algorithms. Packt Publishing Ltd; 2017. 352 p.
Cunningham P, Cord M, Delany SJ. Supervised Learning. En: Cord M, Cunningham P, editores. Machine Learning Techniques for Multimedia: Case Studies on Organization and Retrieval [Internet]. Berlin, Heidelberg: Springer; 2008 [citado 25 de octubre de 2021]. p. 21-49. (Cognitive Technologies). Disponible en: https://doi.org/10.1007/978-3-540-75171- 7_2
jasallen. Create machine learning models - Learn [Internet]. [citado 25 de octubre de 2021]. Disponible en: https://docs.microsoft.com/enus/learn/paths/create-machine-learnmodels/
Machine Learning A-Z (Python & R in Data Science Course) [Internet]. Udemy. [citado 25 de octubre de 2021]. Disponible en: https://www.udemy.com/course/machinel earning/
Machine Learning with R : Learn How to Use R to Apply Powerful Machine Learning Methods and Gain an Insight Into Real-world Applications [Internet]. [citado 2 de marzo de 2019]. Disponible en: http://web.a.ebscohost.com.ezproxyegre.u niandes.edu.co:8888/ehost/ebookviewer/e book/bmxlYmtfXzY1NjIyMl9fQU41?sid =41844cd0-1074-4ed9-8f60- ae228952ea8a@sessionmgr4008&vid=0 &format=EB&rid=1
Workshops SuperDataScience - Machine Learning | AI | Data Science Career | Analytics | Success [Internet]. SuperDataScience. [citado 25 de octubre de 2021]. Disponible en: https://www.superdatascience.com/works hops
Friedman JH. Stochastic gradient boosting. Comput Stat Data Anal. 28 de febrero de 2002;38(4):367-78.
Frontiers | Gradient boosting machines, a tutorial | Frontiers in Neurorobotics [Internet]. [citado 25 de octubre de 2021]. Disponible en: https://www.frontiersin.org/articles/10.33 89/fnbot.2013.00021/full
Murthy Sreerama. Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey.
What are Neural Networks? [Internet]. 2021 [citado 25 de octubre de 2021]. Disponible en: https://www.ibm.com/cloud/learn/neuralnetworks
Müller H, Freytag J-C. Problems, Methods, and Challenges in Comprehensive Data Cleansing. :23.
Fernández SF, Sánchez JMC, Córdoba A, Largo AC. Estadística Descriptiva. ESIC Editorial; 2002. 576 p.
Google Colaboratory [Internet]. [citado 29 de junio de 2020]. Disponible en: https://colab.research.google.com/notebo oks/welcome.ipynb?hl=es-419
hrasheed-msft. ¿Qué es Azure HDInsight? [Internet]. [citado 27 de abril de 2020]. Disponible en: https://docs.microsoft.com/eses/azure/hdinsight/hdinsight-overview
J. Wang, C. Zhang, X. Wu, H. Qi and J. Wang. SVM-OD: A New SVM Algorithm for Outlier Detection - Google Académico. En 2003 [citado 24 de agosto de 2020]. Disponible en: https://scholar.google.com/scholar?hl=es &as_sdt=0%2C5&q=SVMOD%3A+A+New+SVM+Algorithm+for +Outlier+Detection&btnG=
Factores que afectan el peso y la salud | NIDDK [Internet]. National Institute of Diabetes and Digestive and Kidney Diseases. [citado 16 de mayo de 2020]. Disponible en: https://www.niddk.nih.gov/healthinformation/informacion-de-lasalud/control-de-peso/informacion-sobresobrepeso-obesidad-adultos/factoresafectan
Lean Yu, Shouyang Wang, Lai KK. An integrated data preparation scheme for neural network data analysis. IEEE Trans Knowl Data Eng. febrero de 2006;18(2):217-30.
Sumithra V.S,Subu Surendran. A Review of Various Linear and Non Linear Dimensionality Reduction Techniques. Int J Comput Sci Inf Technol. 6.
Sidi F, Shariat Panahy PH, Affendey LS, Jabar MA, Ibrahim H, Mustapha A. Data quality: A survey of data quality dimensions. En: 2012 International Conference on Information Retrieval Knowledge Management. 2012. p. 300-4.
Grow BG, May 3 2019. Data Quality Best Practices for Today’s Data-Driven Organization [Internet]. Transforming Data with Intelligence. [citado 23 de agosto de 2020]. Disponible en: https://tdwi.org/articles/2019/05/03/diqall-data-quality-best-practices-for-datadriven-organizations.aspx
Taylor J. Clean your data with unsupervised machine learning [Internet]. Towards Data Science. 2018 [citado 17 de marzo de 2019]. Disponible en: https://towardsdatascience.com/clean-your-data-with-unsupervised-machinelearning-8491af733595
Januzaj E, Januzaj V. An Application of Data Mining to Identify Data Quality Problems. En: 2009 Third International Conference on Advanced Engineering Computing and Applications in Sciences. 2009. p. 17-22.
LEILA - Librería de calidad de datos — documentación de LEILA - 0.1 [Internet]. [citado 27 de agosto de 2020]. Disponible en: https://ucd-dnp.github.io/leila/
Accepted 2021-11-26
Published 2022-05-26
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Authors grant the journal and Universidad del Valle the economic rights over accepted manuscripts, but may make any reuse they deem appropriate for professional, educational, academic or scientific reasons, in accordance with the terms of the license granted by the journal to all its articles.
Articles will be published under the Creative Commons 4.0 BY-NC-SA licence (Attribution-NonCommercial-ShareAlike).