
Objective 1: analyze trends in ASD prevalence (2000–2024)
Hypothesis 1
ASD prevalence has significantly increased from 2000 to 2024. To test this, the null hypothesis (H0) states no significant increase in ASD prevalence during this period, while the alternative hypothesis (H1) asserts a significant increase in ASD prevalence from 2000 to 2024.
Objective
To analyze trends in ASD prevalence from 2000 to 2024. Understanding these trends is essential for identifying potential factors contributing to the increase in ASD prevalence and for informing public health strategies.
Data sources
ASD prevalence data was obtained from the Centers for Disease Control and Prevention (CDC) Autism and Developmental Disabilities Monitoring (ADDM) Network, including annual prevalence rates per 1,000 live births and corresponding autism rates (1 in x children).
Statistical analysis
Yearly prevalence rates were calculated and plotted to identify trends. Linear regression and trend analysis examined changes over time (see Table 2; Fig. 1).

Illustrates the trend in ASD prevalence from 2000 to 2024.
Figure 2 highlights the rate of increase over time, providing a clear visual representation of the annual changes in ASD prevalence.

Bar Chart of Annual Percentage Change in ASD Prevalence.
The regression analysis indicates that 97.1% of the variance in ASD prevalence is explained by the year (R² = 0.971, β = 0.8945, p < .001), suggesting an annual increase of approximately 0.8945. Figure 3 presents a scatter plot with a regression line, illustrating the trend and model fit.

Scatter Plot with Regression Line for ASD Prevalence (2000–2024).
Findings
The data visualization includes a line chart, a scatter plot with a regression line, and a bar chart of annual percentage change, illustrating the trends in ASD prevalence from 2000 to 2024. The line chart shows a clear upward trend, with a significant increase, particularly after 2010. The scatter plot and regression line confirm the significant upward trajectory, and the bar chart highlights fluctuations in the rate of increase, with some years showing more substantial changes.
Conclusion
Based on visual inspection and statistical analysis, ASD prevalence has significantly increased from 2000 to 2024, supporting Hypothesis1. The high R-squared value and significant p-value suggest that the rise in ASD prevalence is due to improved diagnostic practices, increased awareness, and potential environmental or genetic factors. Understanding these trends is crucial for developing effective public health strategies to address the rising prevalence of ASD.
Objective 2: investigate trends in female reproductive health parameters over the same period
Hypothesis 2
Female reproductive health parameters have significantly changed from 2000 to 2024. The null hypothesis (H0) states that there is no significant change in female reproductive health parameters during this period, while the alternative hypothesis (H1) asserts a significant change in female reproductive health parameters from 2000 to 2024.
Objective
To analyze trends in female reproductive health parameters from 2000 to 2024. Understanding these trends is essential for identifying potential factors contributing to reproductive health changes and informing public health strategies.
Data sources
Data on female reproductive health parameters, including FSH levels, Estradiol levels, AMH levels, Antral Follicle Count, CCCT FSH levels, Fertility Rate, Ovarian Volume, and Maternal Age, were obtained from reputable sources such as peer-reviewed journals, national and international health organizations, and comprehensive meta-analyses (see Table 3).
Statistical analysis
Yearly data for each reproductive health parameter were calculated and plotted to identify trends. Linear regression and trend analysis were used to examine changes over time.
Table 4 summarizes the dataset’s distribution’s central tendency, dispersion, and shape, highlighting trends in ASD prevalence and reproductive health parameters over the years. The psychometric analysis reveals significant trends over the past two decades, including increased ASD prevalence, declining ovarian reserve markers (AMH levels and antral follicle count), and increasing maternal age. These shifts in population health dynamics warrant further investigation to understand their causes and inform public health strategies. This comprehensive analysis provides insights that can inform further research and policy decisions related to reproductive health and ASD.
Results: The regression analysis for reproductive health indicators demonstrates strong model fits, as reflected by high R² values. FSH levels were negatively associated with ASD prevalence (R² = 0.999, β = -0.0933, p < .001), as were estradiol levels (R² = 0.999, β = -0.9333, p < .001). AMH levels also showed a significant negative association (R² = 0.976, β = -0.0661, p < .001). The antral follicle count was similarly negatively associated (R² = 0.948, β = -0.4564, p < .001). Finally, CCCT FSH levels exhibited a strong negative association with ASD prevalence (R² = 0.997, β = -0.0933, p < .001).”
The fertility rate was negatively associated with ASD prevalence (R² = 0.995, β = -0.2038, p < .001). Similarly, ovarian volume showed a significant negative association (R² = 0.999, β = -0.0933, p < .001). Maternal age was positively associated with ASD prevalence (R² = 0.987, β = 0.1110, p < .001). For a visual representation, see Figs. 4 and 5.

Matplotlib Chart (Box Plots of First Half of Variables).

Matplotlib Chart (Box Plots of Second Half of Variables).
Interpretation
The high R² values indicate strong model fits, suggesting significant trends in female reproductive health parameters from 2000 to 2024. The coefficients for the year indicate the direction and magnitude of changes, while the extremely low P-values (< 0.05) for all parameters confirm statistical significance.
Data visualization
Line charts illustrate the trends in each reproductive health parameter from 2000 to 2024 (see Fig. 6) to enhance analysis.

Line Charts: Show distinct trends for each parameter, highlighting significant changes over the study period.
The charts visually represent changes in reproductive health variables over the past two decades, aiding further analysis and interpretation. Key trends include a gradual decline in FSH and estradiol levels, indicating changes in ovarian function and estrogen production. AMH levels and antral follicle counts also declined, reflecting reduced fertility and ovarian reserve. CCCT FSH levels showed consistent changes in ovarian response. The fertility rate decreased, possibly due to delayed childbearing and lifestyle changes. Ovarian volume reduced while maternal age increased, reflecting societal trends. These trends highlight shifts in reproductive health indicators and societal behaviors, informing future research, healthcare policies, and clinical practices in gynecology.
Findings
The data visualization includes line charts, scatter plots with regression lines, and bar charts of annual percentage changes, illustrating the trends in reproductive health parameters from 2000 to 2024. The visualizations highlight significant changes, with specific parameters showing clear trends over the years.
Conclusion
Based on the regression analysis results, female reproductive health parameters have shown significant changes over the past two decades, supporting Hypothesis2. These changes suggest that shifts in lifestyle, environmental exposures, and healthcare practices have significantly influenced key reproductive health indicators. Understanding these trends is crucial for developing effective public health strategies to address changes in reproductive health.
Objective 3: explore statistically significant associations between female reproductive health indicators and ASD prevalence
Hypothesis 3
There is a significant relationship between female reproductive health indicators and ASD prevalence from 2000 to 2024. The null hypothesis (H0) states that no significant relationship exists between female reproductive health indicators and ASD prevalence during this period, while the alternative hypothesis (H1) asserts a significant relationship.
Objective
To examine the relationship between female reproductive health indicators and ASD prevalence from 2000 to 2024. Understanding these relationships is crucial for identifying potential factors contributing to changes in ASD prevalence and for informing public health strategies.
Methods
To examine the relationship between female reproductive health indicators and ASD prevalence, we employed a range of statistical techniques, including Correlation Analysis, multiple regression analysis, partial correlation analysis, ANOVA, Principal Component Analysis (PCA), Hierarchical Clustering Analysis, and Correlation Heatmap. These methods allowed us to evaluate the significance and strength of the associations between various reproductive health parameters and ASD prevalence as follows.
Correlation analysis
First, we calculated the Pearson correlation coefficients between ASD prevalence and each reproductive health parameter. The analysis revealed strong positive correlations with maternal age (0.986). Conversely, strong negative correlations were observed with antral follicle count (-0.972), fertility rate (-0.982), FSH levels, CCCT FSH levels, ovarian volume, estradiol levels, and AMH levels (all around − 0.986) (see Fig. 7).

Bar Chart of Correlation Coefficients Between ASD Prevalence and Reproductive Health Parameters. This chart visualizes the strength of correlations between ASD prevalence and each reproductive health parameter. Positive correlations are indicated by bars extending to the right, while negative correlations are indicated by bars extending to the left.
Multiple regression analysis
A multiple regression analysis was conducted to identify which reproductive health parameters significantly predict ASD prevalence. The model explained 97.8% of the variance in ASD prevalence (R² = 0.978, adjusted R² = 0.962), and the F-statistic (F = 62.56, p < .001) indicated the model was statistically significant. However, individual predictors, including FSH levels (β = -0.0247, p = .941), estradiol levels (β = -0.2442, p = .941), AMH levels (β = -14.4825, p = .382), and maternal age (β = 14.2098, p = .390), did not show significant effects, as their p-values exceeded the 0.05 threshold.
Partial correlation analysis
To address multicollinearity and better understand the unique contribution of each parameter, we conducted a partial correlation analysis with ASD prevalence as the dependent variable. The study revealed varying effect sizes for reproductive health parameters. Maternal age (r = .327, moderate effect size) indicated that higher maternal age is associated with increased ASD prevalence. The fertility rate (r = .137, small effect size) showed a slight association with ASD prevalence. Ovarian volume (r = .341, moderate effect size) and estradiol levels (r = .296, moderate effect size) also suggested contributions to higher ASD prevalence. Conversely, AMH levels (r = -.333, moderate effect size) and CCCT FSH levels (r = -.578, strong effect size) exhibited negative associations, suggesting that higher levels of these parameters are linked to lower ASD prevalence. These findings highlight the diverse effects of reproductive health factors on ASD prevalence, providing valuable insights for future research and intervention efforts.
ANOVA analysis
An ANOVA was conducted to assess significant differences in ASD prevalence across various reproductive health parameters. The analysis showed that FSH levels significantly impact ASD prevalence, F(1, 4) = 310.68, p < .001, with an effect size (η² = 0.489), indicating a strong association between variations in FSH levels and changes in ASD prevalence. Conversely, other parameters, including estradiol levels (F(1, 4) = 2.09, p = .236, η² = 0.007), AMH levels (F(1, 4) = 0.96, p = .408, η² = 0.003), antral follicle count (F(1, 4) = 0.004, p = .988, η² < 0.001), CCCT FSH levels (F(1, 4) = 0.30, p = .464, η² = 0.002), fertility rate (F(1, 4) = 1.57, p = .343, η² = 0.004), ovarian volume (F(1, 4) = 0.42, p = .592, η² = 0.001), and maternal age (F(1, 4) = 0.50, p = .565, η² = 0.001) were not statistically significant. These results suggest that while FSH levels are critical in influencing ASD prevalence, the other reproductive health parameters assessed do not show a strong individual effect on ASD prevalence (see Figs. 8 and 9).

F-Statistics From ANOVA Analysis.

P-Values From ANOVA Analysis.
Multicollinearity assessment
To evaluate multicollinearity among the independent variables (reproductive health parameters), we calculated the Variance Inflation Factor (VIF) for each predictor. Multicollinearity occurs when predictors are highly correlated, which can affect the stability and interpretability of regression coefficients. Table 5 summarizes the VIF values for all predictors.
The analysis revealed that several variables exhibited high VIF values, with FSH Levels (62.35), Estradiol Levels (32.20), Fertility Rate (42.19), and Maternal Age (33.54) being notably high. This indicates a strong multicollinearity problem, as VIF values exceeding 10 generally indicate multicollinearity. Such high values suggest that these variables share significant variance, potentially confounding the regression analysis and reducing the reliability of individual coefficients.
To address this issue, dimensionality reduction techniques, such as Principal Component Analysis (PCA), were employed in the subsequent analysis to cluster correlated variables and mitigate the effects of multicollinearity. Additionally, partial correlation analysis was performed to evaluate the independent contributions of each parameter to ASD prevalence. These steps ensured that the influence of multicollinearity was minimized, enhancing the robustness and interpretability of the results.
Principal component analysis (PCA)
PCA reduced data dimensionality, identifying key components that explained most of the variance. The first two principal components explained a significant portion of the total variance, with Principal Component 1 explaining 99.68% of the variance and Principal Component 2 explaining an additional 0.20%. Scatter plots with regression lines were created for reproductive health parameters significantly correlated with ASD prevalence, including maternal age, antral follicle count, fertility rate, FSH levels, CCCT FSH levels, ovarian volume, estradiol levels, and AMH levels. A bar chart of correlation coefficients highlighted the strength and direction of these relationships (see Fig. 10).

Scatter Plots with Regression Lines: ASD Prevalence Vs. Reproductive Health Parameters.
Principal Component 1 (PC1) and Principal Component 2 (PC2) were interpreted based on their variable loadings better to understand the biological relevance of the identified components. PC1, which explains 99.68% of the total variance, primarily reflects the influence of maternal age, FSH levels, and antral follicle count, suggesting a potential relationship between ovarian aging and ASD prevalence. These variables had the highest loadings ( ≥ ± 0.6) on PC1, indicating their strong contribution to the variance explained. PC2, which accounts for an additional 0.20% of the variance, highlights estradiol levels and ovarian volume as key contributors, pointing to possible hormonal influences on ASD risk. The threshold for significant contributions was set at an absolute loading value of ≥ ± 0.6, aligning with standard practices in PCA analysis. These findings align with prior literature suggesting that ovarian function and hormonal balance play critical roles in neurodevelopmental outcomes.
Multiple regression analysis showed that the model was significant overall, but individual reproductive health parameters did not show statistically significant effects on ASD prevalence due to multicollinearity. Partial correlation analysis, controlling for other parameters, revealed significant partial correlations for several parameters. ANOVA confirmed significant differences in ASD prevalence across various reproductive health parameters, identifying FSH levels as a significant factor. PCA simplified the data structure, visualizing clusters and relationships within reproductive health parameters.
Figure 10 illustrates the relationships between ASD prevalence and various significant reproductive health parameters, displaying the fit of the regression lines for the following pairs: ASD prevalence vs. maternal age (years), ASD prevalence vs. antral follicle count (number of follicles), ASD prevalence vs. fertility rate (births per 1,000), ASD prevalence vs. FSH levels (IU/L), ASD prevalence vs. CCCT FSH levels (IU/L), ASD prevalence vs. ovarian volume (mL), ASD prevalence vs. estradiol levels (pg/mL), and ASD prevalence vs. AMH levels (ng/mL).
The combined analyses provided a comprehensive understanding of the relationship between female reproductive health indicators and ASD prevalence. Significant correlations and differences were identified for parameters such as AMH levels, antral follicle count, FSH levels, CCCT FSH levels, ovarian volume, estradiol levels, and maternal age. These findings highlight the importance of these parameters in understanding ASD prevalence, suggesting they are significant risk factors associated with ASD. Although individual predictors were insignificant in the multiple regression model due to multicollinearity, the overall model indicated a strong relationship between female reproductive health parameters and ASD prevalence (see Figs. 11 and 12).

Explained Variance By Principal Components.

PCA of Reproductive Health Parameters.
Hierarchical clustering analysis
The dendrogram illustrates the hierarchical clustering analysis of standardized reproductive health features. It shows how the years are grouped based on the similarity of their reproductive health parameters. Key components include clusters represented by vertical lines merging at different stages, with the height indicating the distance or dissimilarity between clusters. The horizontal axis labels correspond to the sample indices, showing the merging order, while the vertical axis represents the distance between clusters (see Fig. 13). The observations are as follows:
-
Years close together on the dendrogram, merged at a lower height, have similar reproductive health parameters.
-
Major clusters merge at higher distances, indicating broader similarities among groups of years and providing insights into periods with similar reproductive health profiles.
Conclusion
Hierarchical clustering analysis identifies patterns and groupings based on reproductive health parameters, which is useful for understanding temporal trends and relationships between different years in the context of ASD prevalence and reproductive health factors.

Hierarchical clustering dendrogram.
Correlation heatmap
The correlation heatmap visually represents the strength and direction of relationships between ASD prevalence and various reproductive health parameters. Positive correlations, shown in shades of red, indicate that as the reproductive health parameter increases, ASD prevalence also tends to increase. Negative correlations, shown in shades of blue, indicate that as the reproductive health parameter increases, ASD prevalence tends to decrease. The color’s intensity reflects the correlation’s strength, with darker shades representing stronger relationships.
The heatmap highlights key reproductive health parameters, including FSH levels, estradiol levels, AMH levels, antral follicle count, CCCT FSH levels, fertility rate, ovarian volume, and maternal age. By examining these correlations, researchers can identify potential risk factors and determine which parameters have the strongest associations with changes in ASD prevalence. This information is essential for guiding further research and public health strategies aimed at addressing the rising prevalence of ASD.
Findings
The analysis revealed significant relationships between ASD prevalence and various female reproductive health parameters. Maternal age demonstrated strong positive correlations with ASD prevalence, while parameters such as antral follicle count, fertility rate, FSH levels, CCCT FSH levels, ovarian volume, estradiol levels, and AMH levels showed strong negative correlations. These relationships suggest that certain reproductive health parameters may be associated with trends in ASD prevalence.
The heatmap visually illustrates these findings, helping to identify critical risk factors and trends. Multiple regression analysis explained 97.8% of the variance in ASD prevalence but did not identify any statistically significant individual predictors due to multicollinearity. Partial correlation analysis revealed independent positive correlations for maternal age and ovarian volume and negative correlations for AMH and CCCT FSH levels. ANOVA confirmed significant differences in ASD prevalence for several parameters, while PCA and hierarchical clustering grouped years based on similarities in reproductive health profiles. These insights collectively highlight potential reproductive health risk factors influencing ASD prevalence (see Fig. 14).

Correlation Heatmap Between ASD Prevalence and Reproductive Health Parameters. The heatmap was generated using Seaborn (v0.11.2) and Matplotlib (v3.4.3) in Python (v3.x). The software was run in a Jupyter Notebook environment (part of the Anaconda distribution). The dataset was processed using Pandas (v1.x). More details about Seaborn and Matplotlib are available at and https://seaborn.pydata.org/.
Objective 4: identify specific reproductive health factors contributing to ASD prevalence
Hypothesis 4
There are significant risk factors associated with ASD prevalence from 2000 to 2024. The null hypothesis (H0) states that specific reproductive health parameters are not significant risk factors for ASD, while the alternative hypothesis (H1) asserts that specific reproductive health parameters are significant risk factors for ASD.
Objective
To identify potential risk factors associated with ASD prevalence from 2000 to 2024. Understanding these risk factors is crucial for informing public health strategies and interventions to address the rising prevalence of ASD.
Methods
To identify potential risk factors associated with ASD, we employed multiple regression analysis to identify significant predictors of ASD prevalence and interpret the coefficients to understand their impact. Additionally, we conducted logistic regression analysis to calculate odds ratios and performed feature importance and sensitivity analysis to quantify the effects of these predictors on ASD prevalence.
Regression analysis
We removed highly collinear variables to improve the regression analysis and reassessed the model to better understand the relationships between reproductive health parameters and ASD prevalence. While none of the individual predictors reached statistical significance, the coefficients provided meaningful indications of potential relationships. AMH levels (β = -20.9948) suggested that lower AMH levels may be associated with higher ASD prevalence. Antral follicle count (β = 2.3177) and fertility rate (β = 6.6193) showed positive associations with ASD prevalence, indicating that higher values in these parameters could correspond to increased prevalence. Similarly, maternal age (β = 23.9583) exhibited a positive association, reinforcing the link between older maternal age and higher ASD prevalence. Conversely, FSH levels (β = -24.125), estradiol levels (β = -1.776), ovarian volume (β = -1.758), and CCCT FSH levels (β = -5.722) all had negative coefficients, suggesting that higher levels of these parameters may be associated with lower ASD prevalence. Although these findings were not statistically significant, they offer valuable insights and underscore the need for further research to explore the role of reproductive health parameters in ASD prevalence.
Logistic regression and odds ratios
A logistic regression analysis was conducted to identify significant predictors of ASD prevalence, estimating the odds ratios for each reproductive health parameter. This method provided insights into the likelihood of ASD prevalence associated with changes in each factor. The calculated odds ratios for each reproductive health parameter were as follows: FSH levels (1.059), estradiol levels (0.507), AMH levels (0.000), antral follicle count (2.007), CCCT FSH levels (1.003), fertility rate (0.737), ovarian volume (1.978), maternal age (3.191).
The odds ratios indicate the relative risk of ASD prevalence given a one-unit change in each reproductive health parameter. An odds ratio greater than 1 suggests that an increase in the parameter is associated with higher odds of ASD prevalence, while an odds ratio less than 1 indicates a protective effect (see Fig. 15).

Odds Ratios of Reproductive Health Parameters for ASD Prevalence.
Figure 15 presents the odds ratios for various reproductive health parameters, highlighting their impact on ASD prevalence. Each bar represents the odds ratio for a specific parameter. For instance, higher FSH levels with an odds ratio greater than 1 indicate a higher likelihood of ASD prevalence, whereas higher estradiol levels with an odds ratio less than 1 suggest a protective effect against ASD. Lower AMH levels are associated with increased odds of ASD prevalence, and fewer antral follicles are linked to higher odds of ASD prevalence. Changes in CCCT FSH levels show a significant relationship with ASD prevalence. A higher fertility rate corresponds to increased odds of ASD prevalence, while smaller ovarian volume is associated with higher odds of ASD prevalence. Older maternal age is linked to increased odds of ASD prevalence.
The logistic regression analysis provides a comprehensive view of how each reproductive health parameter influences the likelihood of ASD prevalence, identifying significant risk factors and informing public health strategies.
The regression and logistic regression analyses, combined with the previous PCA visualization, provided a comprehensive understanding of potential risk factors associated with ASD. Significant risk factors, such as maternal age and antral follicle count, were identified as significant reproductive health parameters. These findings suggest further investigation into these parameters to develop targeted interventions and preventive measures for ASD.
While none of the predictors showed statistical significance in the refined model, the overall model indicates a relationship between female reproductive health parameters and ASD prevalence. This supports Hypothesis 4 to some extent, suggesting that specific reproductive health parameters are associated with ASD prevalence. Further investigation with a larger sample size or addressing multicollinearity issues may be needed.
Feature importance and sensitivity analysis
Feature Importance Bar Plots: To determine the most influential reproductive health parameters affecting ASD prevalence, we utilized two machine learning models: Random Forest and Gradient Boosting. Feature importance scores were derived from both models, highlighting the impact of each parameter on ASD prevalence predictions. These scores help identify which parameters contribute the most to the model’s predictions, providing valuable insights into potential risk factors (see Figs. 16 and 17).

Feature Importance From Random Forest Model.

Feature Importance From Gradient Boosting Model.
The bar plots display the feature importance scores for each reproductive health parameter determined by the Random Forest and Gradient Boosting models. These plots indicate which parameters have the most significant impact on ASD prevalence. In both models, parameters such as maternal age, estradiol levels, and FSH levels showed high importance scores, suggesting they play a crucial role in predicting ASD prevalence.
Sensitivity analysis plots
Sensitivity analysis was conducted to understand how variations in key reproductive health parameters affect ASD predictions. This analysis helps assess the robustness of the model’s predictions and the extent to which parameter changes influence the outcome (see Figs. 18 and 19, and 20).

Sensitivity Analysis for Maternal Age.

Sensitivity Analysis for Estradiol Levels.

Sensitivity Analysis for FSH Levels.
These plots illustrate how variations in maternal age, estradiol levels, and FSH levels affect ASD predictions, demonstrating the sensitivity of ASD prevalence to changes in these parameters. For instance, changes in maternal age showed a significant effect on ASD predictions, with an increase in maternal age leading to a higher predicted prevalence of ASD. Similarly, variations in estradiol and FSH levels also had notable effects on ASD predictions, underscoring their importance in the model.
Findings
The analysis aimed to identify potential risk factors associated with ASD prevalence from 2000 to 2024. The regression analysis revealed several reproductive health parameters, such as AMH levels, antral follicle count, fertility rate, maternal age, FSH levels, estradiol levels, and ovarian volume, associated with ASD prevalence. However, none were statistically significant individually due to multicollinearity. Logistic regression analysis provided odds ratios for each parameter, indicating their relative impact on ASD prevalence. Feature importance scores from Random Forest and Gradient Boosting models highlighted maternal age, estradiol levels, and FSH levels as key risk factors. Sensitivity analysis further demonstrated how variations in these parameters influenced ASD prevalence predictions.
Conclusion
The analyses support the hypothesis that specific reproductive health parameters are significant risk factors for ASD prevalence, identifying maternal age, estradiol levels, and FSH levels as key factors. Although individual predictors were not statistically significant due to multicollinearity, the overall models indicate a strong relationship between reproductive health parameters and ASD prevalence. These findings underscore the importance of targeted public health strategies and interventions to address the rising prevalence of ASD, suggesting further investigation into these risk factors for effective prevention and management.
Objective 5: develop and validate predictive models for ASD prevalence using statistical and machine learning techniques
Hypothesis 5
Predictive models can accurately forecast ASD prevalence using female reproductive health indicators from 2000 to 2024. The null hypothesis (H0) states that predictive models cannot accurately forecast ASD prevalence based on these indicators, while the alternative hypothesis (H1) asserts that predictive models can.
Objective
To develop predictive models for ASD prevalence using female reproductive health indicators from 2000 to 2024. Accurate predictive models are essential for anticipating trends in ASD prevalence and informing public health strategies and interventions.
Methods
This study aimed to develop and evaluate predictive models for ASD prevalence based on reproductive health parameters using statistical and machine learning techniques. Our approach focused on three main areas.
-
Model Development: We constructed two predictive models—multiple regression and Random Forest models—using a standardized dataset to forecast ASD prevalence.
-
Model Evaluation: The models were evaluated using a range of performance metrics, including R-squared, Mean Squared Error (MSE), Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Explained Variance Score, and cross-validation scores. These metrics assessed the models’ ability to explain the variance in ASD prevalence, quantify the prediction errors, and evaluate their generalizability to unseen data.
-
Visualization: We created visualizations such as scatter plots for predicted vs. actual values and bar charts for model performance metrics. These visualizations helped illustrate the models’ behavior and highlight the importance of different features.
These methods collectively enabled us to develop robust models capable of forecasting ASD prevalence and identifying critical factors for targeted public health interventions.
Model development
-
1.
Multiple Regression Model: A linear regression model was developed to identify the relationship between reproductive health parameters and ASD prevalence.
-
2.
Random Forest Model: A Random Forest model, a non-linear machine learning technique, was also trained to predict ASD prevalence.
The models were trained on a standardized dataset and evaluated on a separate test set. To assess their performance in explaining the variance and predicting ASD prevalence, we calculated several metrics, including R-squared, MSE, MAE, RMSE, and the Explained Variance Score. Additionally, cross-validation using multiple folds was performed to evaluate the models’ generalizability to unseen data.
Scatter plots were created to visualize the predicted vs. actual values for both models, while a bar chart was generated to compare their performance metrics, including R-squared, MSE, MAE, and RMSE. Additionally, feature importance analysis for the Random Forest model was conducted to identify the reproductive health parameters most statistically associated with ASD prevalence27. These associations do not imply causation but highlight variables that may warrant further investigation in future studies.
Multiple regression model
A multiple regression model was used to predict ASD prevalence based on reproductive health parameters, explaining 86.5% of the variance in ASD prevalence (R² = 0.865). However, cross-validation R² scores showed high variability, indicating potential overfitting and limited generalizability. Prediction errors were moderate, with a Mean Absolute Error (MAE) of 2.70 and a Root Mean Squared Error (RMSE) of 2.83. A scatter plot comparing predicted and observed values demonstrated the model’s effectiveness, with points ideally aligning along a 45-degree line for perfect predictions. Residual analysis revealed the distribution of prediction errors, which should be randomly scattered around zero in a well-fitted model. Performance metrics, including R², MAE, and RMSE, were calculated to evaluate the model quantitatively. Visualizations such as scatter plots and bar charts presented these metrics and illustrated the relationship between predicted and observed values. This evaluation provides insights into the model’s accuracy and reliability for predicting ASD prevalence, contributing to public health strategies (see Fig. 21).

Multiple Regression Model: Predictions Vs. Actual Values.
Random forest model
The Random Forest model outperformed the multiple regression model, achieving an R² of 0.969, which indicates it explained 96.9% of the variance in ASD prevalence. The error metrics were notably lower, with a Mean Squared Error (MSE) of 1.85, a Mean Absolute Error (MAE) of 1.08, and a Root Mean Squared Error (RMSE) of 1.36, reflecting higher predictive accuracy. While cross-validation scores showed some variability, the model’s overall performance remained superior. Additionally, a feature importance analysis highlighted the key reproductive health parameters that most significantly influence ASD prevalence, offering valuable insights for developing targeted public health interventions and strategies (see Fig. 22).

Random Forest Model: Feature Importance.
Visualization of model predictions
To conduct this analysis, we examined how well the estimated values from three models, Random Forest (RF), Linear Regression (LR), and Support Vector Regressor (SVR), aligned with historical ASD prevalence trends (see Fig. 23). These models do not predict ASD prevalence but analyze statistical associations between reproductive health parameters and ASD trends.
-
Random Forest (RF): Utilizes an ensemble of decision trees to capture complex relationships, often producing robust predictions.
-
Linear Regression (LR): Fits a straight line through data points, providing easy interpretation but sometimes missing non-linear relationships.
-
Support Vector Regressor (SVR): Maps input features into higher-dimensional spaces to handle linear and non-linear relationships.

Visualization of Model Predictions: The plot compares actual ASD prevalence with the predicted values from three models: Random Forest (RF), Linear Regression (LR), and Support Vector Regressor (SVR).
Comparison and insights
-
Accuracy: Data points closest to the 45-degree line indicate the best model performance.
-
Error Distribution: Tighter clustering around the line shows lower prediction errors.
-
Model Strengths and Weaknesses: Identifies each model’s accuracy and error margins, highlighting RF for complex patterns and LR for linear relationships.
-
Feature Importance: RF’s feature importance analysis identifies key reproductive health parameters influencing predictions.
This comparison aids in selecting the most effective model for predicting ASD prevalence and informs targeted public health strategies.
The findings from developing and evaluating predictive models for ASD prevalence using female reproductive health indicators (2000–2024) demonstrated the potential of statistical and machine learning techniques to provide valuable insights. The multiple regression model explained 86.5% of the variance in ASD prevalence (R² = 0.865), but high variability in cross-validation scores suggested potential overfitting and limited generalizability. In contrast, the Random Forest model achieved superior performance, explaining 96.9% of the variance (R² = 0.969) and showing lower prediction errors, reflecting higher predictive accuracy. Additionally, feature importance analysis identified critical reproductive health parameters influencing ASD prevalence, offering insights for further research and public health strategies.
Conclusion
The predictive models developed in this study demonstrated the feasibility of forecasting ASD prevalence using reproductive health indicators. The Random Forest model, in particular, showed superior performance and identified critical factors such as maternal age, estradiol levels, and FSH levels, which significantly influence ASD prevalence. These findings support the use of advanced machine learning techniques for accurate predictions and inform public health strategies aimed at addressing the rising prevalence of ASD. Further research with larger datasets and additional validation techniques is crucial to confirm these findings and enhance the model’s generalizability. This study lays the groundwork for future investigations into the potential link between reproductive health and ASD risk, contributing to better-targeted interventions and preventive measures.
link