SQLEM++: An Explainable Hybrid Learning Framework for SQL Query Cost Prediction and Optimization
DOI:
https://doi.org/10.59992/IJCI.2026.v5n6p6Keywords:
SQL Query Optimization, SQL-BERT, Learned Cost Estimation, Graph Neural Networks, SQLEM++Abstract
While relational database systems can process billions of queries per day with SQL, their cost-based optimizers use hand-crafted heuristics and simplified cardinality estimators that can lead to large estimation errors. These errors creep up into the query plan selection, often leading to poor or even disastrous performance when the query is executed. In this paper, new framework for SQL query cost prediction and recommendation of optimize SQL query, named as SQLEM++ (SQL Query Learning and Embedding Model). SQLEM++ is a dual-encoder architecture consisting of a domain-adapted transformer (SQL-BERT) and a structural Graph Neural Network (GNN) over query Abstract Syntax Trees to encode queries in semantically rich representations. The framework also includes heterogenous feature extraction of join complexity, filter selectivity, index coverage, and schema-level statistics. A multi-task prediction module is used to estimate the execution cost along with providing the classification of query quality into actionable levels, using the asymmetric Huber loss. Further, an explainability layer, derived from explainability with SHAP attribution and attention roll-out, converts model predictions to interpretable optimization recommendations.
On top of this, SQLEM++ is introduced as an adaptive extension that fuses a learned cost model with the traditional optimizers in a hybrid manner, an uncertainty aware decision algorithm for reliability control, and a reinforcement learning feedback loop for continuous self-improvement.
The results of an extensive evaluation on TPC-H (scale factors 1–100), and TPC-DS show that SQLEM achieves an average speed-up of query execution by 3.24 times compared to the PostgreSQL 16 optimizer and a mean absolute prediction error reduction of 71.4%, with an R² of 0.94. Ablation studies validate that both semantic and structural encoders support each other.
References
] Leis, V., Gubichev, A., Mirchev, A., Boncz, P., Kemper, A., & Neumann, T. (2015). How good are query optimizers, really? PVLDB, 9(3), 204–215.
[2] Selinger, P. G., Astrahan, M. M., Chamberlin, D. D., Lorie, R. A., & Price, T. G. (1979). Access path selection in a relational database management system. SIGMOD 1979, 23–34.
[3] Ioannidis, Y. E. (1993). Universality of serial histograms. Proceedings of VLDB 1993, 256–267.
[4] Deshpande, A., Ives, Z., & Raman, V. (2007). Adaptive query processing. Foundations and Trends in Databases, 1(1), 1–140.
[5] Stillger, M., Lohman, G., Markl, V., & Kandil, M. (2001). LEO — DB2's learning optimizer. VLDB 2001, 19–28.
[6] Marcus, R., Negi, P., Mao, H., Zhang, C., Alizadeh, M., Kraska, T., & Modi, N. (2019). Neo: A learned query optimizer. PVLDB, 12(11), 1705–1718.
[7] Yu, X., Li, G., Chai, C., & Tang, N. (2020). Reinforcement learning with tree-LSTM for join order selection. ICDE 2020, 1297–1308.
[8] Yang, Z., et al. (2022). Balsa: Learning a query optimizer without expert demonstrations. SIGMOD 2022.
[9] Lin, X., Socher, R., & Xiong, C. (2020). Bridging textual and tabular data for cross-domain text-to-SQL semantic parsing. Findings of EMNLP 2020.
[10] Scholak, T., Schucher, N., & Bahdanau, D. (2021). PICARD: Parsing incrementally for constrained auto-regressive decoding from language models. EMNLP 2021.
[11] Pourreza, M., & Rafiei, D. (2023). DIN-SQL: Decomposed in-context interactive text-to-SQL with self-correction. NeurIPS 2023.
[12] Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., ... & Zhou, M. (2020). CodeBERT: A pre-trained model for programming and natural language. Findings of EMNLP 2020.
[13] Guo, D., Ren, S., Lu, S., Feng, Z., Tang, D., Liu, S., ... & Zhou, M. (2021). GraphCodeBERT: Pre-training code representations with data flow. ICLR 2021.
[14] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
[15] Marcus, R., Negi, P., Mao, H., Alizadeh, M., Kraska, T., Papaemmanouil, O., & Modi, N. (2021). Bao: Making learned query optimization practical. SIGMOD 2021, 1275–1288.
[16] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
[17] Meyer, Gregory P. "An alternative probabilistic interpretation of the huber loss." Proceedings of the ieee/cvf conference on computer vision and pattern recognition. 2021.
[18] LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. "Deep learning." nature 521.7553 (2015): 436-444.