Plant diseases have been detrimental for the agriculture industry, as they cause substantial crop loss globally. To overcome this, IoT and AI-based smart agriculture solutions are being deployed for plant disease detection. However, a diverse range of crops and their diseases pose enormous challenges to these methods. Additionally, limited generalizability and the black-box nature of existing deep learning models, together with the scarcity of in-field datasets, are the main bottlenecks in developing efficient and acceptable solutions for large-scale applications. In the present work, a lightweight model 'ConViTX' is proposed for plant disease classification that demonstrates improved generalizability and explainability. The compact architecture of ConViTX uses a fusion of convolutional neural networks and vision transformers to simultaneously capture local and global features. Remarkably, ConViTX outperforms nine state-of-the-art deep learning methods on four publicly available datasets and a self-collected in-field maize dataset. Furthermore, the model demonstrates explainable prediction through Gradient Weighted Class Activation Maps and Locally Interpretable Model-Agnostic Evaluations. ConViTX attains 98.8% accuracy on the maize dataset and 61.42% on drone camera-captured raw images. With only 0.7 million parameters and 0.647 billion operations per second, the proposed model has the potential for deployment on resource-constrained precision agriculture setups.