Spam SMS Classification Analysis Using Naive Bayes with Python Language

Beny Yusman

Authors

Beny Yusman Universitas Hafshawaty Zainul Hasan Author

Keywords:

SMS fraud (spam); promotional SMS; normal SMS; Naïve Bayes; classification.

Abstract

Short Message Service (SMS) continues to be widely used in Indonesia, both by official institutions and private entities, despite the growing prevalence of internet-based communication technologies. This study aims to classify SMS messages into three categories—normal SMS, promotional SMS, and fraudulent (spam) SMS—using the Naïve Bayes algorithm. The dataset used in this study comprises 1,143 records, obtained from an open-source platform on GitHub. The research stages include dataset collection, text preprocessing (consisting of case folding, tokenization, filtering, normalization, and stemming), term weighting using two text representation techniques: Count Vectorizer and TF-IDF, and classification using the Multinomial Naïve Bayes algorithm. Classification performance was evaluated using a confusion matrix, along with accuracy, precision, recall, and F1-score metrics. The results show that both combinations—Multinomial Naïve Bayes with Count Vectorizer and with TF-IDF—performed well in classifying SMS messages. The Count Vectorizer model achieved an accuracy of 93%, while the TF-IDF model demonstrated competitive precision and recall values. These findings confirm that the Naïve Bayes algorithm, when paired with appropriate text representation techniques, can serve as an effective solution for automatic SMS classification systems, particularly for short messages in the Indonesian language. This research also opens opportunities for exploring more advanced classification algorithms in future studies.