Purpose: With technological advancements, uniform resource locators (URLs) are increasingly used in healthcare to store patient records, reducing paperwork. However, security concerns arise as malicious URLs can deceive users, leading to data breaches. Machine learning (ML) offers a solution by analyzing past data to predict whether a URL is malicious or benign. Methods: In this work, a dataset from GitHub containing 151,828 URL samples was pre-processed, revealing unique characteristics of malicious URLs. Ad hoc feature extraction techniques were applied to capture these distinguishing traits. To classify URLs, various supervised ML classifiers were used, including logistic regression (LR), perceptron, decision tree (DT), random forest (RF), extreme gradient boosting (XGBoost), adaptive boosting (AdaBoost), gradient boost (GB), k-nearest neighbors (KNN), support vector machine (SVM), cat boost (CB), multinomial naive bayes (MNB), bernoulli baive bayes (BNB), light gradient boosting (LGBM) and passive aggressive classifier (PAC). Additionally, “automatic” feature extraction was performed using the term frequency-inverted document frequency (TF-IDF) method and the extracted features were then used with models such as LR, DT, RF, XGBoost, CB, KNN, LGBM, PAC, MNB, and BNB. Results: Experimental results demonstrate that automatic feature extraction improves classification accuracy, making it a reliable method for detecting malicious URLs. The RF classifier had the best performance with both methods, achieving 99.82% accuracy with automatic feature extraction compared to 99.57% with hand-crafted features. The other metrics also improved with automatic feature extraction, including 99.84% precision, 99.44% recall, and 99.64% F1 score. Conclusion: This approach has potential applications in securing healthcare systems, web browsers, and cybersecurity platforms, helping prevent unauthorized access to sensitive information.

Securing healthcare systems: a random forest approach to malicious URL detection

Conte L.;De Nunzio G.
Ultimo
2025-01-01

Abstract

Purpose: With technological advancements, uniform resource locators (URLs) are increasingly used in healthcare to store patient records, reducing paperwork. However, security concerns arise as malicious URLs can deceive users, leading to data breaches. Machine learning (ML) offers a solution by analyzing past data to predict whether a URL is malicious or benign. Methods: In this work, a dataset from GitHub containing 151,828 URL samples was pre-processed, revealing unique characteristics of malicious URLs. Ad hoc feature extraction techniques were applied to capture these distinguishing traits. To classify URLs, various supervised ML classifiers were used, including logistic regression (LR), perceptron, decision tree (DT), random forest (RF), extreme gradient boosting (XGBoost), adaptive boosting (AdaBoost), gradient boost (GB), k-nearest neighbors (KNN), support vector machine (SVM), cat boost (CB), multinomial naive bayes (MNB), bernoulli baive bayes (BNB), light gradient boosting (LGBM) and passive aggressive classifier (PAC). Additionally, “automatic” feature extraction was performed using the term frequency-inverted document frequency (TF-IDF) method and the extracted features were then used with models such as LR, DT, RF, XGBoost, CB, KNN, LGBM, PAC, MNB, and BNB. Results: Experimental results demonstrate that automatic feature extraction improves classification accuracy, making it a reliable method for detecting malicious URLs. The RF classifier had the best performance with both methods, achieving 99.82% accuracy with automatic feature extraction compared to 99.57% with hand-crafted features. The other metrics also improved with automatic feature extraction, including 99.84% precision, 99.44% recall, and 99.64% F1 score. Conclusion: This approach has potential applications in securing healthcare systems, web browsers, and cybersecurity platforms, helping prevent unauthorized access to sensitive information.
File in questo prodotto:
File Dimensione Formato  
2025 Securing healthcare systems a random forest approach to malicious URL detection.pdf

accesso aperto

Tipologia: Versione editoriale
Licenza: Creative commons
Dimensione 3.98 MB
Formato Adobe PDF
3.98 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11587/576274
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact