End‑to‑End Data‑Mining Pipelines with Fine‑Tuned Language Models for Reliable Text Classification

Manasi Murlidhar Bendale

doi:10.67231/g6kree09

End‑to‑End Data‑Mining Pipelines with Fine‑Tuned Language Models for Reliable Text Classification

Authors

Manasi Bendale

Santa Clara University

Author

DOI:

https://doi.org/10.67231/g6kree09

Keywords:

Text classification, Transformer Models, Fine-Tuning, Data Mining, Interpretability, Learning Curves, Model Evaluation, Reproducible Pipelines

Abstract

This paper proposes a practical data-mining workflow for supervised text classification that combines pretrained neural architectures with a systematic and reproducible development process that covers the complete model life-cycle from corpus setup and data pre-processing to training, evaluation and interpretation. The proposed workflow minimizes laborious manual feature extraction by leveraging the representation power of the pretrained models while adhering to established protocols for data partitioning, model selection and performance measurement. We used the standard metrics for evaluation, such as precision, recall, F1-score, confusion matrix, and learning-curve analysis, to characterize classification performance, assess generalization performance and study the impact of training data size on predictive accuracy. Our experimental results using different datasets show consistent improvements over standard machine learning baselines while remaining computationally efficient and reproducible. It also provides practical advice for GPU-accelerated training to lower execution time and improve scaling to larger collections of texts. In order to facilitate transparent model evaluation, our workflow incorporates local model-agnostic explanation techniques that enable us to clarify which input tokens are responsible for a prediction outcome, thus allowing for detailed error analysis and validation of a given classification decision. Taking everything into account, the offered framework provides a portable, efficient and interpretable toolbox for accurate text classification design in a common statistical computing environment.

References

Cover Image

Downloads

PDF

Published

2026-06-29

Issue

Vol. 1 No. 3 (2026)

Section

Articles

License

This work is licensed under a Creative Commons Attribution 4.0 International License.