摘要
AbstractTwo-vehicle crashes resulting from distracted driving led to a higher number of fatalities and serious injuries over time. This study utilized machine learning and econometric models to investigate two-vehicle-involved distracted driving crashes from the Crash Report Sampling System within the United States. XGBoost and Random Forest were utilized to identify the top variables based on SHAP value, although mixed logit with unobserved heterogeneity was used to model injury severity. The model results indicate that there is a complex interaction of driver characteristics, such as demographics (male drivers), driver actions (careless driving, driving more than the speed limit of more than 15 mph, hitting a stopped vehicle), a driver without violation history, turning violation, drinking, roadway characteristics (non-interstate highways, undivided and divided roadways with positive barrier, curved roadways, dry surface), environmental conditions (rainy weather), vehicle attributes (motorcycle, displacement volume up to 2500 cc, newer vehicle within five years of crash-involvement), temporal characteristics (4–6 PM, July–September, and year 2017). These findings underscore the importance of driving behavior and roadway design. As such, prioritizing efforts to address distracted driving behavior through driver training and law enforcement, as well as considering its implications for roadway design and maintenance, becomes crucial.Keywords: machine learningunobserved heterogeneityinjury severitymultivehicle crashesmixed logit modelCRSS Correction StatementThis article has been corrected with minor changes. These changes do not impact the academic content of the article.Disclosure statementNo potential conflict of interest was reported by the author(s).Notes1 See more details in comparative injury severity analysis in Islam and Mannering (Citation2020) on aggressive and non-aggressive driving, Islam et al. (Citation2022) on straight and curved segments, and Islam (Citation2022b) on work-zone and non-work-zone involving large trucks.2 Temporal instability was not considered within the scope of this study, as the primary focus here is on developing an empirical framework to integrate machine learning and econometric modeling.3 The CRSS is a weighted sample of police-reported motor vehicle crashes in the United States from 2016 through 2020. The General Estimates System (GES) of the National Highway Traffic Safety Administration was superseded by this data set. The CRSS data is a national sample drawn from the nearly six million crashes documented by police each year. The CRSS sampling system records all types of crashes, from minor to fatal.4 The extensive discussions and justifications for accounting for unobserved heterogeneity in the crash data modeling are highlighted in a study by Mannering et al. (Citation2016). Considering the paradigm shift from the other traditional models, if unobserved heterogeneity is ignored, and the effects of observable variables is restricted to be the same across all observations, the model will be mis-specified and the estimated parameters will, in general, be biased and inefficient, which could in turn lead to erroneous inferences and predictions.5 Ohio, Indiana, Illinois, Michigan, Wisconsin, Minnesota, North Dakota, South Dakota, Nebraska, Iowa, Missouri, Kansas in the Midwest (Region 2) per CRSS user manual.6 Maryland, Delaware, Washington DC., West Virginia, Virginia, Kentucky, Tennessee, North Carolina, South Caroline, Georgia, Florida, Alabama, Mississippi, Louisiana, Arkansas, Oklahoma, and Texas in the South (Region 3) per CRSS user manual.