Skip to main content

Comparative analysis of machine learning and statistical models for cotton yield prediction in major growing districts of Karnataka, India

Abstract

Background

Cotton is one of the most important commercial crops after food crops, especially in countries like India, where it’s grown extensively under rainfed conditions. Because of its usage in multiple industries, such as textile, medicine, and automobile industries, it has greater commercial importance. The crop’s performance is greatly influenced by prevailing weather dynamics. As climate changes, assessing how weather changes affect crop performance is essential. Among various techniques that are available, crop models are the most effective and widely used tools for predicting yields.

Results

This study compares statistical and machine learning models to assess their ability to predict cotton yield across major producing districts of Karnataka, India, utilizing a long-term dataset spanning from 1990 to 2023 that includes yield and weather factors. The artificial neural networks (ANNs) performed superiorly with acceptable yield deviations ranging within ± 10% during both vegetative stage (F1) and mid stage (F2) for cotton. The model evaluation metrics such as root mean square error (RMSE), normalized root mean square error (nRMSE), and modelling efficiency (EF) were also within the acceptance limits in most districts. Furthermore, the tested ANN model was used to assess the importance of the dominant weather factors influencing crop yield in each district. Specifically, the use of morning relative humidity as an individual parameter and its interaction with maximum and minimum temperature had a major influence on cotton yield in most of the yield predicted districts. These differences highlighted the differential interactions of weather factors in each district for cotton yield formation, highlighting individual response of each weather factor under different soils and management conditions over the major cotton growing districts of Karnataka.

Conclusions

Compared with statistical models, machine learning models such as ANNs proved higher efficiency in forecasting the cotton yield due to their ability to consider the interactive effects of weather factors on yield formation at different growth stages. This highlights the best suitability of ANNs for yield forecasting in rainfed conditions and for the study on relative impacts of weather factors on yield. Thus, the study aims to provide valuable insights to support stakeholders in planning effective crop management strategies and formulating relevant policies.

Introduction

Cotton (Gossypium hirsutum L.) is one of the most important and widely produced commercial crop in the world (Aslam et al. 2020) and is cultivated mainly in tropical regions under rainfed conditions. With respect to acreage, India ranks first in the world (13.06 million hectares) and second in production, with an annual production of 34.34 million bales (170 kg per bale in India). This crop is produced as a raw material for the textile industry and has important uses in medicine and automobile industries. As cotton is cultivated mainly in rainfed regions, its production dynamics rely on the dynamics of weather factors, which prevail during the crop growth period. Although the final yield of the crop is dependent on the interaction between the genotype and environment, the dynamics of weather and resource availability (water and nutrients) play a pivotal role in the yield potentials. The crop has shown a positive response to solar radiation and temperature under optimum availability of other resources like soil moisture and nutrients (Mao et al. 2019). Biomass formation is also closely related to the accumulation of resources such as photosynthetically active radiation (PAR), effective temperature accumulation, and soil water content (SWC) during the crop growth period (Wu et al. 2022). In addition to the variability in above ground microclimate, variations in soil temperature and moisture are known to impart root growth, which in turn affects above ground biomass formation (Tang et al. 2010). Many studies have been conducted to prove the intimate association of weather factors with the metabolism and physiological activation of cotton phenology (Wang et al. 2019). There is a projection of resource scarcity in the future because of changing climate, so improving the utilization of light, temperature and water resources is highly important for sustainable production of cotton (Howden 2008). A key for this purpose is to advance the prediction of crop performance and estimate the major factors impacting growth and yield using theoretical and applied techniques.

Most yield forecasting studies have focused on food crops, such as wheat (Kogan et al. 2013), and rice (Wang et al. 2010), but there has also been considerable interest in forecasting important fibre crops, such as cotton (Baigorria et al. 2010). As these crops are cultivated under natural/rainfed conditions, weather-based crop yield forecasting is essential for shaping policies related to supply, trade, and production exchange (Dharmaraja et al. 2020). The reliability of such weather-based crop yield forecasting depends on the choice of model, its input requirements (Hara 2021; Chipanshi 2015), and the objective evaluation of model performance. Many techniques have been developed to forecast growth and to make the best forecast of cotton area, production, and yield in different cultivation conditions of India, but their suitability depends on the ability of the model to describe the observed data. Crop simulation models and statistical models are two broad approaches to yield forecasting (Bocca et al. 2016). Crop simulation models offer detailed insights into crop biology through their reliance on extensive data such as soil, plant, and weather data. However, these models often face challenges due to limited data availability. In response to these difficulties, statistical models based on weather parameters have been developed to provide reliable crop acreage estimation and yield predictions (Sharma et al. 2018). Although statistical models provide forecasts with reasonable precision, the calibration and testing of those models using historical datasets are crucial. Multiple linear regressions (MLRs) are commonly used statistical crop yield prediction models (Rai et al. 2013; Dhekale et al. 2014; Kumar et al. 2014). Vashisth et al. (2018) conducted studies on maize at the flowering stage and the grain filling stage with weather-based statistical model. However, there are chances of model over-fitting when the number of samples is lesser than the number of predictors and the existence of multicollinearity among independent factors (Verma et al. 2016). To overcome such discrepancies, feature selection techniques such as stepwise multiple linear regression (SMLR), least absolute shrinkage and selection operator (LASSO), elastic net (ENet) or feature extraction such as principal component analysis (PCA) statistical techniques are used (Das et al. 2017) in forecasting yields in many crops (Paswan et al. 2013; Das et al. 2018; Bali et al. 2021), showcasing their enhanced effectiveness in yield forecast. Statistical models have the potential to expand the scope of advance yield estimation and examine more crop types, particularly for those where established process-based models are lacking due to a scarcity of crop-specific parameters. Going beyond major crops to include more will provide a better picture of future global food availability under climate change (Hu et al. 2024).

Considering the above background, this study was planned based on long term weather and yield datasets spanning from 1990 to 2021 with a motive of identifying the most suitable forecasting technique for predicting cotton yield in major production districts of Karnataka, India, estimating the range of variability in cotton yield as predicted by different models, the factors that limit the ability of a model to predict the yield, options to overcome these limitations, etc. The weather-induced production variability impacts regional food security, thus it is necessary to study the major weather factors behind crop production.

Materials and methods

Study districts

Among the top cotton producing states in India, Karnataka stands the fourth, with an area accounting for 7% of the area under cotton in the country and the production accounting for 4%, with a productivity of 653 kg·hm–2. The top ten districts of Karnataka were selected for the study, including Ballari, Belagavi, Chitradurga, Dharwad, Haveri, Kalaburagi, Koppal, Mysuru, Raichur, and Vijayapura, which collectively contribute to approximately 70% of the state’s cotton area and production. The rainfall and rainy days of ten districts are presented in Supplementary Table 1. In 2021, Kalaburagi emerged as a significant contributor to cotton cultivation, leading to an area of 67 065 hm2, producing 205 930 bales and achieving a higher productivity of 522 kg·hm–2. Raichur, with an extensive cultivation area of 169 518 hm2, also played a substantial role, producing 427 785 bales of cotton at a productivity of 429 kg·hm–2 (Table 1). In 2023, a notable change in the cotton area was observed across these districts. Raichur contributed the most to the cotton cultivation area (179 701 hm-2), followed by Kalaburagi. These districts were the focus of the study for forecasting cotton yield using different models, considering their significant contribution to the state’s cotton production.

Table 1 The planted area, production, and productivity of cotton in top ten districts of Karnataka

The cotton yield distribution across ten districts from 1990 to 2021 is shown in Fig. 1, highlighting notable variations in yield patterns. Raichur has the highest variability whereas Mysuru, Dharwad, Chitradurga, and Belagavi display moderate variability in yield. Koppal exhibits relatively low variability in yield, suggesting a more stable and consistent yield pattern. The median of yield is positioned towards the upper end of the range, indicating a tendency for higher yields. In Kalaburagi, the median leans towards the upper end, signifying wide variability in yield indicating a mix of higher yields but with notable variability. Vijayapura showes a wide interquartile range, indicating significant variability in yield. The median yield is toward the lower end, suggesting that the majority of yields are below the median.

Fig. 1
figure 1

Box plot representing district wise cotton yield distribution during 1990–2021. The central box in each plot represents the interquartile range, with the median line inside the box. The whiskers extend to the minimum and maximum values

Datasets sources

Long term (from 1990 to 2021) dataset on planted area, production, and productivity of cotton during kharif (crop sown during south-west monsoon season grown under rainfed conditions) season in major growing districts of Karnataka was sourced from the Directorate of Economics and Statistics, Government of Karnataka (https://des.karnataka.gov.in). The datasets were checked for the presence of outliers, i.e., extremes, and were detrended based on their regression with time factor. After each step of detrending, the significance was checked and if there was no significant yield change with time, again the same process was continued until there was an observed significant change in yield with time. A dataset pertaining to daily weather parameters, i.e., maximum and minimum temperature, morning and evening relative humidity, and rainfall, pertaining to the study years, was sourced from the India Meteorological Department (https://mausam.imd.gov.in) using the inverse distance weightage method.

Calculation of weather indices

To formulate a composite model that considers the individual and interactive impacts of weather variables, a set of independent factors such as weather variables and weather indices were calculated. These factors can be classified into two categories: unweighted and weighted weather variables. Unweighted weather variables representing direct observation and weighted weather variables were calculated to account for the interactive impact of weather factors on crop performance. To account for yield variability due to both sole and interactive effects of weather factors forecasting models that depend on both unweighted/individual and unweighted/interactive weather factors are often employed for predicting yields in crops like rice, wheat, sugarcane and potato (Manideep 2022; Mehta et al. 2010). To generate these weather indices, two distinct methodologies were used. The unweighted weather indices were computed by aggregating weekly weather variables encountered throughout the crop period. On the other hand, the weighted indices were established by summing the product of the correlation coefficient and the value of the corresponding weekly weather variable. The formulas for computing unweighted and weighted weather indices are summarized below. By doing this, a time series dataset comprising 32 (from 1990 to 2021) weather variables (Table 2) and yield was generated.

Table 2 Derived unweighted and weighted indices of composite weather parameters for model analysis

Unweighted weather indices:

$$Z_{ij}=\sum_{w\;=\;1}^mx_{iw}$$
$$Z_{ii'j}=\sum_{w\;=\;1}^mx_{iw}x_{i'w}$$

Weighted weather indices:

$$Z_{ij}=\sum_{w\;=\;1}^mr_{iw}^jx_{iw}$$
$$Z_{ii'j}=\sum_{w\;=\;1}^mr_{ii'w}^jx_{iw}x_{i'w}$$

Where, xiw and \(\:{\varvec{X}}_{{i}^{{\prime\:}}w}\) are values of two distinct weather variables (ith/i′th) for the same time period (the wth week), rjiw represents the correlation coefficient between the de-trended yield and the ith weather variable during the wth week of the jth time period, and rjii′w represents the correlation coefficient related to the interaction between ith and i′th weather variables, and the detrended yield during the wth week in the jth time period.

Brief background of multivariate models used in the study

The details of multivariate models used in this study to develop kharif cotton yield prediction are described below and the structured framework for yield forecasting models is illustrated in Fig. 2.

Fig. 2
figure 2

Framework representing different stages in model for yield forecasting of cotton

Stepwise multiple linear regression

MLR is the standard and simplest approach for developing calibration models. However, its application to datasets with more independent variables and a greater sample size is not always successful (Balabin 2011). Feature selection in the form of SMLR gives good results on large datasets. A stepwise regression procedure was adopted for the selection of the best regression variable among many independent variables (Singh 2014). ICAR - Indian Agricultural Statistical Research Institute developed models to express the effect of weather variables on crop yield. Yield is considered the dependent variable and weekly weather variables are considered the independent variables. Weekly weather variables are generated from daily data by averaging daily maximum temperature, minimum temperature, morning relative humidity, evening relative humidity, and rainfall summing up. Two weather indices (unweighted and weighted) are developed for each weather variable, and indices are also generated for the interaction of weather variables. A combination of weather indices generated from weather variables (Table 2). Regression analysis is used to fit equations; weighting coefficients in the equations are obtained empirically using standard statistical procedures such as multivariable regression analysis using SPSS software. It appears that the study focuses on understanding the relationship between weather variables and crop yield and the use of regression models, including SMLR, to analyze and predict these relationships. The weighting coefficients are determined through empirical methods to enhance the accuracy of the model.

Artificial neural networks

These artificial intelligence (AI) methodologies provide a more effective means of tackling complexities within natural systems characterized by a multitude of inputs. ANNs are nonlinear and non-statistical models that mimic the learning process of the human brain (Starks 2019; Lawrence 1994) and no assumption of normality of the data is implied. Achieving optimal crop yield at minimal cost is a primary objective in agricultural production. The timely identification and management of issues associated with crop yield indicators play a pivotal role in amplifying overall productivity. The recent application of AI, encompassing technologies like ANNs, fuzzy systems, and genetic algorithms, has showcased enhanced efficiency in addressing challenges linked to agricultural yield. In the current study, a three-layered feed-forward artificial neural network comprising input, hidden, and output layers was proposed. The neurons or nodes in each layer are interconnected, with the number of nodes in the input and output layers predetermined by the dataset. The number of nodes in the input and output layers is fixed by the dataset used. There is a need to take care to choose the optimum number of hidden layers while implementing the ANN for yield forecasting, by using the ‘train’ function of the ‘caret’ package, using the method ‘nnet’ with 10-fold cross-validation in R software (Kuhn 2008). The ANN model is iteratively trained and evaluated until its predictive accuracy is maximized (Yang 2017). The analysis involved allocating 80% of the dataset for calibration (training) purposes and, the remaining 20% for validation (testing). A comprehensive set of 32 weather indices was utilized as inputs with yield serving as the dependent variable, and other factors acting as independent variables (Fig. 3).

Fig. 3
figure 3

Graphical image of the established artificial neural network for cotton yield forecasting (I1, I2, I3………. I31 represent the input layer consisting of independent weather indices, B1 represents the hidden layer accounting for the intermediate effect of the combination of weather indices as eight neurons viz., H1, H2…H8, B2 indicates the second hidden layer addressing the impact of the first hidden layer and at the end, O1 indicates the output i.e. yield.)

Least absolute shrinkage and selection operator

LASSO and ENet methods are two shrinkage regression methods used for handling multicollinearity by penalizing the magnitude of regression coefficients (Piaskowski et al. 2016). LASSO reduces the number of predictors in a regression model and identifies important predictors. By shrinking the coefficients of less useful predictors to zero, LASSO can automatically choose an important variable and reject the rest from the model. By adopting a regularization technique, the variance of the estimated regression coefficients is minimized, and thus, the estimators are more stable.

Random forest (RF)

The RF model is a supervised technique for both classification regression and non-linear problems. This method uses the ensemble learning method for regression and is a bagging technique because it combines individual decision trees to yield better results. The advantage of the RF model is that it handles the missing values and maintains accuracy (Fang et al. 2021). A RF is an ensemble machine learning technique that constructs multiple trees while training data and gives class labels for classification problems or mean/average prediction for regression. It can also be used in both univariate and multivariate time series forecasts by manually creating lag and seasonal component variables. According to the nature of the data, different algorithms react differently.

Autoregressive integrated moving average (ARIMAX)

The ARIMAX model is an extended version of the ARIMA model. The ARIMAX model is linear in nature and hence does not explain the nonlinearity components. Here, we have tried to improve the performance of the ARIMAX model by explaining residuals through machine learning approaches such as ANN and support vector machines (Zhang 2003).

Model performance evaluation

Model performance was tested using different statistical model performance evaluation measures. The use of more than one measures helps us to evaluate a single model’s performance and compare multiple models. In this study, the coefficient of determination (R2), root mean square error (RMSE), normalized root mean square error (nRMSE), modelling efficiency (EF), and mean absolute percentage error (MAPE) were calculated.

The R2 is important for measuring the effectiveness of the models (Shaikh et al. 2021; Ağbulut et al. 2020), ranging from 0 to 1. This approach provides insight into how well the trend of the model result is able to track the trends of observed data (Ağbulut et al. 2021). A value closer to 1 indicates that the model is more accurate. RMSE measures the average magnitude of the error and is related to the deviation from the actual value. An RMSE value of 0 indicates that the model has a perfect fit. The lower the RMSE is, the better the model and its predictions. The nRMSE expresses the spread around the measurements used for the classification of model performance into distinct groups (excellent, good, fair, or poor when the values are in the range of < 10%, 10–20%, 20–30%, or > 30%, respectively). The modelling efficiency indicates whether the model describes the data better than simply the average of the predictions. The optimal values are those that are near 1 (Thimmegowda et al. 2023). The MAPE was defined as the sum of the percentage to mean absolute error (MAE) (Kumar et al. 2020). For a good model, a smaller MAPE value is desirable. The MAPE less than 5% is considered as an indication that the forecast is acceptably accurate. MAPE greater than 10% but less than 25% indicates low accuracy, but a MAPE greater than 25% indicates very low accuracy. The model with a lower MAPE is preferred for forecasting purposes.

Stage specific evaluation of cotton yield prediction models using weather data

The quantification of weather impacts on crop growth is a cumbersome task, as weather factors impart yield through their direct and interactive effects. In our study, kharif cotton yield was forecasted at the vegetative stage (leaf development, stem growth, and root expansion, i.e., 40 to 60 days after sowing) and mid stage (flowering and fruit development, i.e. 80 to 100 days after sowing) using five different models (SMLR, LASSO, ANN, RF, and ARIMAX). The crop duration in cotton is around 160–190 days, depending on the variety and growing conditions. Here the models were calibrated (1990–2018) and validated (2019–2021) using the historical dataset of weather variables and crop yield datasets, and the yield in 2023 was forecasted. Previous studies have reported that data from a couple of months prior to harvest can be used for short range crop predictions using statistical regression models (Chipanshi 2015; Mkhabela 2011; Seiler 2000). A similar methodology was used to analyse yield prediction with 16 standard meteorological weeks (SMWs) corresponding to the vegetative stage, and 20 SMWs corresponding to the mid-stage of crop growth. This approach ensured the stage-specific weather influences were accurately captured and integrated into the prediction model.

Results

Cotton yield prediction using SMLR model

The Kharif cotton yield was validated in 2020 and 2021 at the F1 and F2 stages using SMLR across ten districts (Table 3). The model’s prediction accuracy by displaying the actual yield, predicted yield, and percent deviation among them. The results showed that prediction accuracy was good with low deviations in few districts, while in other districts it showed large deviation in other districts, indicating varying performance of the SMLR model across different districts and stages among the districts. The negative deviation indicated that the model has overestimated the yield, and positive deviations indicated under-estimations. The yield forecasted in Dharwad district at both stages in 2020 and 2021 exhibited better results in comparison with other districts, while the forecast results for Kalaburagi were worse with overestimation reaching − 77% at the F1 stage and − 30% at the F2 stage followed by Raichur district with the same trend.

Table 3 District-wise deviation percent of kharif cotton yield at F1 and F2 stages validated in 2020 and 2021 using SMRL model

Cotton yield prediction using ARIMAX model

The predicted cotton production for ten districts deviated from the actual cotton production using the ARIMAX model (Table 4). The RMSE ranged between 87 (Koppal) and 41 (Chitradurga and Vijayapura) at the F1 stage and between 81 (Koppal) and 32 (Haveri) at the F2 stage. Except for the Koppal at the F1 stage and Belagavi district at the F2 stage, all districts had a MAPE value of less than 25% at both the F1 and F2 stages, indicating lower but acceptable accuracy. The yield was overstimated and ranged from − 16% to − 87% in 2020 and − 49% to − 4% in 2021 at the F1 stage. Furthermore, cotton production was underestimated in the remaining districts (for example, Ballari, Chitradurga, Koppal, Mysuru, and Raichur in 2020 and Chitradurga, Dharwad, Vijayapura in 2021). Similar results were also observed at the F2 stage. The model consistently performed better in districts like Dharwad and Chitradurga with lesser deviation and lower RMSE. On the other hand, Kalaburagi and Koppal districts showed large deviations with higher RMSE suggesting the model need improvement or external factors are influencing cotton yield in these areas.

Table 4 District-wise deviation percent of kharif cotton yield at F1 and F2 stages validated in 2020 and 2021 using ARIMAX model

Cotton yield prediction using LASSO model

The yields forecasted in 2020 and 2021 for cotton at the F1 and F2 stages using LASSO model were calibrated and validated against the actual yields (Table 5). The RMSE ranged from 33 (Dharwad) to 91 (Koppal) at the F1 stage and from 31 (Haveri) to 85 (Koppal) at the F2 stage. At the F1 stage, except for the Koppal district, all other districts had MAPE values less than 25%, indicating lower but acceptable accuracy, and at the F2 stage, the MAPE values for most districts were less than 25%, except for Belagavi and Koppal district. The LASSO model demonstrated variable performances across districts and years. Cotton production is overstimated across the districts, yield deviating as − 18% (Belagavi), − 4% (Dharwad), − 19% (Haveri), − 79% (Kalaburagi), and − 35% (Vijayapura) in 2020 at the F1 stage. At the F2 stage of 2021, the percent deviation of − 33% in Belagavi, − 13% in Haveri, − 41% in Kalaburagi, − 79% in Kalaburagi, − 17% in Raichur district and other districts showed understimates. Underestimates and overestimates were observed in different districts and different years, suggesting the need for refinement. Further analysis and refinement of the LASSO model may be necessary to improve accuracy, especially in districts where significant deviations were observed.

Table 5 District-wise deviation percent of kharif cotton yield at F1 and F2 stages validated 2020 and 2021 using LASSO model

Cotton yield prediction using RF model

The cotton yield prediction was validated in 2020 and 2021 at the F1 and F2 stages using random forest (Table 6). The model calibrated and tested for RMSE and MAPE ranged from 29 (Raichur) to 105 (Koppal) and from 13 (Raichur) to 39 (Koppal) at the F1 stage, respectively. Similarly, at the F2 stage, the calibrated yields for the RMSE and MAPE ranged from 28 (Mysuru) to 89 (Kalaburagi) and from 10 (Mysuru) to 30 (Belagavi), respectively. Dharwad and Vijayapura districts showed smaller deviations, indicating more accurate predictions among the other districts. Whereas, districts like Haveri, Kalaburagi, and Koppal showed higher deviations, particularly at the F1 stage. However, the validated results at the F1 and F2 stages in 2020 and 2021 showed mixed results of underestimation and overestimation by the model; in a few districts, the deviation percentage was within the acceptable limit, and in other districts, the predicted yield tended to vary.

Table 6 District-wise deviation percent of kharif cotton at F1 and F2 stages validated in 2020 and 2021 using random forest model

Cotton yield prediction using ANN model

The percentage difference between the forecast and actual yield was validated for the period 2020 and 2021 to determine the accuracy of the ANN model (Table 7). In 2020, at the F1 stage, Koppal (–5.2%), Kalaburagi (–13.6%), Mysuru (–0.6%), Raichur (–2.1%), and Vijayapura (–0.4%) exhibited overestimation, and other districts exhibited underestimation, ranging from 0.6 to 3.5%. At the F2 stage, Ballari and Vijayapura yield validated were overestimation with − 0.4% and − 9.3%, respectively.

Table 7 District-wise deviation percent of kharif cotton yield at F1 and F2 stages validated in 2020 and 2021 using ANN model

Similarly, in the 2021 F1 stage, two (Mysuru and Vijayapura) districts overstimated the cotton yield, with a − 1% deviation each; however, for other eight districts, the forecasted yields were underestimated; at the F2 stage, the yield was understimated for six districts out of ten districts, with a deviation percent ranging from 0.3% to 9.1% and rest of the districts overstimated the yield. The results revealed an excellent agreement between the actual and forecasted yields. The errors calculated by this model were within the acceptable limits i.e., ± 10%, for most of the districts except for Kalaburagi at the F1 stages in both 2020 and 2021; hence, this can be best used for yield predicting.

The performance of the calibrated kharif cotton yield prediction model using ANN was evaluated across various districts (Table 8). A model with smaller RMSE, nRMSE, and higher EF values is considered to be better. The ANN models were used to used to evaluate for the F1 and F2 stages, with RMSE values ranging from 1.30 to 49.0 for the F1 stage and 1.8 to 60.1 for the F2 stage. The nRMSE values ranged from 0.4 to 16.9 for the F1 stage and 0.6 to 16.9 for the F2 stage, while the EF values ranged from 0.9 to 1.0 for both stages. Among the districts yield predicted, at the F1 stage, lower values of RMSE (1.30), nRMSE (0.4) and the highest EF (1.00) was found in Haveri district and higher value was observed in Belagavi district, with 49.0, 16.9, and 0.80 of RMSE, nRMSE, and EF, respectively. Similarly, at the F2 stage, lower value of RMSE (1.8), nRMSE (0.6) and EF (1.00) was found in Haveri district and higher RMSE was observed in Kalaburagi and Belagavi districts, with the highest nRMSE of 19.7. Overall, the model performed excellently, with an nRMSE value less than 10% categorized as excellent for eight out of ten districts in the F1 stage and for seven districts as excellent in the F2 stage. Moreover, the nRMSE value was categorized as good in two districts in the F1 stage and in three districts during the F2 stage.

Table 8 Statistical evaluation of validated kharif cotton yield using ANN model

Inter comparison of models for their yield predictability

The kharif cotton yield was forecasted in 2023 at the F1 stage using SMLR for ten districts. The model performance was evaluated using R2, F value, and standard error (SE) of the estimates resulted from different weather variables (Table 9). The R2 value in the model ranges from 0.52 to 0.87. The model generally performs well across the districts, with R2 values above 0.7 for nine districts. Dharwad district had a lower SE (44.16), indicating relatively accurate predictions, and a higher R2 (0.78). The Kalaburagi district has a higher SE (112.16), suggesting less accurate predictions, but still with a moderate R2 (0.76). The R2 value of the Belagavi district is less than 0.6, indicating a moderate fit and suggesting that the model may not fit the data in that region. While R2 provides an overall measure of goodness-of-fit, it is essential to consider the specific context of each district and the agricultural factors that might influence the predictions.

Table 9 Kharif cotton yield forecast in 2023 at the F1 stage using SMLR

Similarly, the yield was also forecasted in 2023 at the F2 stage using SMLR for ten districts, and the regression equations and weather variables influencing the equation and the model performance were evaluated (Table 10). The model generally performs well across all districts, with consistently higher R2 values. A lower SE value suggests that the model provides accurate estimates for most districts. Dharwad district had a lower SE (41.91), indicating relatively accurate predictions, and the highest R2 (0.89). The Kalaburagi district has a higher SE (95.26), suggesting less accurate prediction, but still with a good R2 of 0.84. All districts have an R2 value above 0.7, indicating a stronger fit for predicting cotton yield at the F2 stage compared with the F1 stage.

Table 10 Kharif cotton yield forecast in 2023 at the F2 stage using SMLR

The kharif cotton yield forecasted in 2023 at the F1 and F2 stages for the ten districts using ARIMAX, LASSO, RF, and ANN (Tables 11 and 12). The estimated yield at F1 stage ranged from 206 kg·hm–2 (Vijayapura) to 916 kg·hm–2 (Kalaburagi), 196 kg·hm–2 (Chitradurga) to 860 kg·hm–2 (Kalaburagi), 170 kg·hm–2 (Chitradurga) to 792 kg·hm–2 (Kalaburagi), and 145 kg·hm–2 (Chitradurga) to 486 kg·hm–2 (Vijayapura) using ARIMAX, LASSO, RF, and ANN, respectively.

Table 11 Kharif cotton yield forecast in 2023 at the F1 stage using different machine learning mode
Table 12 Kharif cotton yield forecast in 2023 at the F2 stage using different machine learning model

Similarly, at the F2 stage, the estimated yields ranged from 201 kg·hm–2 (Mysuru) to 883 kg·hm–2 (Kalaburagi), 144 kg·hm–2 (Mysuru) to 839 kg·hm–2 (Kalaburagi), 144 kg·hm–2 (Koppal) to 797 kg·hm–2 (Kalaburagi), and 193 kg·hm–2 (Mysuru) to 819 kg·hm–2 (Ballari) using ARIMAX, LASSO, RF, and ANN, respectively.

The district average yields forecasted at the F1 stage in 2023 were found to be 401, 414, 408, and 380 kg·hm–2 using ARIMAX, LASSO, RF, and ANN, respectively. Similarly, the average yield predicted at the F2 stage in 2023 was 424, 408, 383, and 395 kg·hm–2 using respective models. The predicted mean yield in 2023 using different models was found to be higher than the average yield (1990–2021) of 289 kg·hm–2 (Fig. 4).

Fig. 4
figure 4

Inter comparison of different multivariate models for their kharif cotton yield predictability during vegetative (F1) and mid (F2) stages in major cotton growing districts of Karnataka

Assessment of major weather factors imparting cotton yield

As the performance of the ANN in comparison to other models was statistically good, the tested model was further used to assess variables of importance. Assessment of variable importance is a statistical methodology commonly used for identifying top variables having greater contribution over the dependent parameter and is dependent on the ‘weights’ assigned by the ANN during the formulation of the model. Significant weather variables that strongly influenced cotton production in all districts in the present study were identified (Figs. 5 and 6). Firstly, ANN and SMLR differed with respect to the identification of important variables, as there was a deficiency of SMLR to consider a large number of variables except input variables. In the case of ANN, it considers the interaction between two variables as a new variable and assigns ‘weights’ to the particular variable after iterating with all other variable combinations. Secondly, there was a district wise difference in the type of important variable identified by ANN because of relative variability of interaction of weather factors in each district (Das 2018). The districts differed with respect to variable importance for cotton yield formation during the vegetative stage (F1), however, similarities in the most influencing factor for cotton yield i.e. interaction between Tmax and Rh II (Z151) were identified in districts like Dharwad and Vijayapura districts. In districts like Kalaburagi, Belagavi, and Mysuru, the interaction between Tmin and Rf (Z230) was found highly influential on yield. In remaining districts, there was a mixed occurrence of highly influential variables of importance, signifying a differential role of weather variables in different districts on yield formation.

Fig. 5
figure 5

Importance of top 10 weather indices in predicting cotton yield using the ANN model at the F1 stage. The y-axis indicates the weather indices, and x-axis indicates the importance of the particular feature in predicting the yield

Fig. 6
figure 6

Importance of top 10 weather indices in predicting cotton yield using the ANN model at the F2 stage. The y-axis indicates the weather indices, and x-axis indicates the importance of the particular feature in predicting the yield

During mid-stage (F2) there was differential influence of weather factors identified, in Mysuru and Raichur districts the unweighted interaction of Tmax and Rf (Z130) was responsible for yield formation, whereas, in Dharwad and Ballari, the weighted interaction of Tmin and Rf (Z231) were responsible. Except these, in the remaining six districts a mixed influence of weather factors was observed. The influence of critical weather variables varied notably among districts and among the growth stages. The variability is largely due to the crop’s specific weather requirements for optimal growth and yield. For example, in dry farming districts like Kalaburagi, Ballari, there is a limited availability of soil moisture (rainfall) and a high scope for evapotranspiration hence the yield is most likely to be imparted through these. These variables not only impact crop yield but also influence pest and disease epidemiology, which in turn impacts cotton yield (Madasamy et al. 2020). Previously, correlation studies have revealed a positive correlation between morning and evening relative humidity on the population of sucking insect pests and a negative correlation between maximum and minimum temperature on the population of sucking insect pests (Shivaray Navi et al. 2021; Krishna et al. 2020).

Discussion

Crop yield, being a complex function of different factors like edaphic/soil, climate, and management, relied on the variabilities brought in among them. The edaphic factors are relatively stable, and the management is constant, the yield is defined majorly by climatic factors. These being dynamic with time and space, impart a variable impact on the crop. Though efforts have been made to have an estimate of crop yield due to climate variability, the traditional techniques such as SMLR fail to capture complex interactive effects of climatic parameters thus necessitating the application of machine learning models. These models also differ in their ability to capture the influence of weather factors, models like ANNs are successful to a maximum extent in predicting the yield (Khaki 2021; Alvarez 2009; Li et al. 2007).

For evaluating the effectiveness of ANN in cotton yield prediction, a comparison of different methodologies based on RMSE and MAPE values during model calibration was conducted. The results showed that the ANN approach outperformed other methods as evidenced by lower error values highlighting the superiority of ANN model in predicting the yield across different districts of Karnataka (Table 8). The superiority of the ANN approach over the conventional empirical model to predict the yield of maize (Uno et al. 2005), rice (Paswan et al. 2013) and other food crops (Behroozi-Khazaei et al. 2017; Basir et al. 2021). The performance of the ANN approach was based on nRMSE during model calibration in the F1 and F2 stages, except in Belagavi and Raichur, all other districts exhibited excellent results (F1 stage); and at the F2 stage, the performance was good in Dharwad, Belagavi, and Kalaburagi districts, while the remaining districts exhibited excellent results. This might be due to the ability of ANN to consider the collinearity between weather variables for yield prediction (Haghverdi 2018; Abrouguia 2019). Variations in average weather patterns and extreme weather conditions have posed major risks to crop production worldwide. The use of machine learning algorithms is a reliable method for yield forecasting with lower error. Proper tuning of model parameters and inclusion of large datasets for model calibration and validation is the key to successful prediction. A study on the effects of remote sensing and data size and climate on cotton yield prediction, cotton yield is affected by many factors that can be largely categorized as genetics, environment, and management practices (Sawan 2017; Bakhsh 2005; Chaudhry 2009; Haghverdi 2018; Pokhrel 2018; Niedbala 2019). Therefore, there is a need for more studies to determine how ANN models can be used to determine the effects of these factors on cotton yield (Yildirim et al. 2022). The use of machine learning tools such as ANN, LASSO, NNet, etc., paves a promising approach for precision yield forecasting in other rainfed crops such as sorghum, rice, etc. where there’s observed variability in yield which can mainly be attributed for variations in weather conditions during the crop growth period. In turn, the outcomes of such studies aid in having an idea of advanced estimates of crop yields based on weather conditions during initial crop growth periods, especially for decision making in future crop management and planning of policies. Furthermore, the inclusion of more features related to soil and crop growth parameters in the future can help improve the accuracy of machine learning models. Observed differences in the model performances and these can be minimized by combining factors such as edaphic variables (e.g., soil moisture, nutrient availability) and management practices (e.g., irrigation, nutrient application). These factors may be gathered through physical observations or remote sensing by measuring plant vigor using the normalized difference vegetation (NDVI).

Further, the machine learning models are not devoid of critical limitations, for example, ANNs may perform well, but they often function as black-box models, lacking interpretability and failing to reveal underlying relationships, such limitations have to be taken into account while the intention is to have an idea of underlying relationships besides yield estimation only (Hu et al. 2023). The result provides an understanding of the model’s performance across different districts and years, shedding light on both success and areas where improvements or further exploration may be beneficial. The interplay of factors influencing agricultural yield is complex, and these analyses serve as valuable guides for refining predictive models and agricultural strategies.

Conclusions

The study emphasizes the comparison of different statistical and machine learning techniques for forecasting kharif cotton yield in the growing regions of major cotton producing states in India. To account for the individual and interactive impacts of weather factors, weighted and unweighted weather indices were calculated and used as independent factors. One statistical model (SMLR) and four machine learning models (ANN, LASSO, ARIMAX, and RF) were tested and compared for their performance in cotton yield forecasting in two growth stages (F1, vegetative stage and F2, mid-stage). The ANN model outperformed all other models, as demonstrated by the satisfactory ranges of the model performance evaluated by RMSE, nRMSE, and EF values. Furthermore, the tested ANN model was used to identify the top ten variables of importance impacting kharif cotton yield in each district, which indicated the difference in the set of variables in different districts because of variability in weather factors and their interaction in each district under study. Morning relative humidity, along with its interactions with maximum and minimum temperatures significantly affects cotton yield in most of the predicted districts. Necessitating development of appropriate planning mitigate the negative impacts of weather variables on agricultural policies.

Data availability

The data that supports the findings of this study can be provided upon reasonable request.

References

Download references

Acknowledgements

Authors are grateful to acknowledge the FASAL-India Meteorological Department, New Delhi and the Directorate of Economics and Statistics, Bangalore for providing the weather and yield data.

Funding

This study was funded through India Meteorological Department, New Delhi, India under the Forecasting Agricultural output using Space, Agrometeorology and Land based observations (FASAL) project and fund number: No. ASC/FASAL/KT-11/01/HQ-2010).

Author information

Authors and Affiliations

Authors

Contributions

Thimmegowda MN and Manjunatha MH: resources, conceptualization, validation. Lingaraj H and Soumya DV: analysis, investigation, original draft preparation. Satish GS and Nagesha L: data curation, original draft preparation. Jayaramaiah R: review and editing, visualization, supervision. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Soumya D. V..

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Thimmegowda, M.N., Manjunatha, M.H., Lingaraj, H. et al. Comparative analysis of machine learning and statistical models for cotton yield prediction in major growing districts of Karnataka, India. J Cotton Res 8, 6 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s42397-024-00208-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s42397-024-00208-8

Keywords