Predicting Readability of Health Educational Resources for Children Using Semantic Features

  • Yanmeng Liu School of Languages and Cultures, the University of Sydney, Australia
Keywords: health education materials, children, readability, semantic, machine learning


The success of health education resources largely depends on their readability, as the health information can only be understood and accepted by the target readers when the information is uttered with proper reading difficulty. Unlike other populations, children feature limited knowledge and underdeveloped reading comprehension, which poses more challenges for the readability research on health education resources. This research aims to explore the readability prediction of health education resources for children by using semantic features to develop machine learning algorithms. A data-driven method was applied in this research:1000 health education articles were collected from international health organization websites, and they were grouped into resources for kids and resources for non-kids according to their sources. Moreover, 73 semantic features were used to train five machine learning algorithms (decision tree, support vector machine, k-nearest neighbors algorithm, ensemble classifier, and logistic regression). The results showed that the k-nearest neighbors algorithm and ensemble classifier outperformed in terms of area under the operating characteristic curve sensitivity, specificity, and accuracy and achieved good performance in predicting whether the readability of health education resources is suitable for children or not.


Alotaibi, S., Alyahya, M., Al-Khalifa, H., Alageel, S., & Abanmy, N. (2016). Readability of Arabic Medicine Information Leaflets: A Machine Learning Approach. Procedia Computer Science, 82, 122-126.
Alpaydin, E. (2020). Introduction to machine learning. MIT press.
Balyan, R., Crossley, S. A., Brown III, W., Karter, A. J., McNamara, D. S., Liu, J. Y., Lyles, C. R., & Schillinger, D. (2019). Using natural language processing and machine learning to classify health literacy from secure messages: The ECLIPPSE study. PloS one, 14(2), e0212488.
Benjamin, R. G. (2012). Reconstructing readability: Recent developments and recommendations in the analysis of text difficulty. Educational Psychology Review, 24(1), 63-88.
Coleman, M., & Liau, T. L. (1975). A computer readability formula designed for machine scoring. Journal of applied psychology, 60(2), 283.
Collins-Thompson, K. (2014). Computational assessment of text readability: A survey of current and future research. ITL-International Journal of Applied Linguistics, 165(2), 97-135.
D'Alessandro, D. M., Kingsley, P., & Johnson-West, J. (2001). The readability of pediatric patient education materials on the World Wide Web. Archives of pediatrics & adolescent medicine, 155(7), 807-812.
Feng, L., Jansche, M., Huenerfauth, M., & Elhadad, N. (2010). A comparison of features for automatic readability assessment. Paper presented at the 23rd International Conference on Computational Linguistics, Beijing, China.
Ferster, A. P. O. C., & Hu, A. (2017). Evaluating the Quality and Readability of Internet Information Sources regarding the Treatment of Swallowing Disorders. Ear, Nose & Throat Journal, 96(3), 128-138.
Field, A. (2009). Logistic regression [PowerPoint slides]. Retrieved from
Flesch, R. (1948). A new readability yardstick. Journal of applied psychology, 32(3), 221.
Friedman, D. B., & Hoffman-Goetz, L. (2006). A systematic review of readability and comprehension instruments used for print and web-based cancer information. Health Education & Behavior, 33(3), 352-373.
Gunning, R. (1952). Technique of clear writing. McGraw-Hill, New York.
Kong, K., & Hu, A. (2015). Readability Assessment of Online Tracheostomy Care Resources. Otolaryngology–Head and Neck Surgery, 152(2), 272-278.
Mc Laughlin, G. H. (1969). SMOG grading-a new readability formula. Journal of reading, 12(8), 639-646.
Meade, C. D., & Smith, C. F. (1991). Readability formulas: cautions and criteria. Patient Education and Counseling, 17(2), 153-158.
Mumford, M. E. (1997). A descriptive study of the readability of patient information leaflets designed by nurses. Journal of Advanced Nursing, 26(5), 985-991.
Narkhede, S. (2018). Understanding auc-roc curve. Towards Data Science, 26, 220-227.
Nation, K. (2005). Children's Reading Comprehension Difficulties.
O'Hayre, J., & Management, U. S. B. o. L. (1966). Gobbledygook Has Gotta Go. U.S. Department of the Interior, Bureau of Land Management. Retrieved from
Rayson, P., Archer, D., Piao, S., & McEnery, A. M. (2004). The UCREL semantic analysis system. In Proceedings of the beyond named entity recognition semantic labelling for NLP tasks workshop, Lisbon, Portugal.
Senter, R., & Smith, E. A. (1967). Automated readability index [Technical Report]. Retrieved from
Shoemaker, S. J., Wolf, M. S., & Brach, C. (2014). Development of the Patient Education Materials Assessment Tool (PEMAT): a new measure of understandability and actionability for print and audiovisual patient information. Patient Education and Counseling, 96(3), 395-403.
Si, L., & Callan, J. (2001). A statistical model for scientific readability. In Proceedings of the tenth international conference on Information and knowledge management, Atlanta, U.S.A.
Taylor, H., & Bramley, D. (2012). An analysis of the readability of patient information and consent forms used in research studies in anaesthesia in Australia and New Zealand. Anaesthesia and intensive care, 40(6), 995-998.
World Health Organization. (2017). WHO Strategic Communications Framework for effective communications [Report]. Retrieved from
Yang, L., & Shami, A. (2020). On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing, 415, 295-316.
Yi, G. S., & Hu, A. (2020). Quality and Readability of Online Information on In-Office Vocal Fold Injections. Annals of Otology, Rhinology & Laryngology, 129(3), 294-300.
Zheng, J., & Yu, H. (2018). Assessing the readability of medical documents: A ranking approach. JMIR medical informatics, 6(1), e17.
Hyperparameter tunning (ENS)