Accident data analysis using Statistical methods –A case study of Indian Highway

Rahul Badgujar, Priyam Mishra, Mayank Chandra, Sayali Sandbhor , Humera Khanum 1,2,3 Undergraduate scholars, Department of Civil Engineering, Symbiosis Institute of Technology, SIU, Pune, India. 4,5 Assistant Professor, Department of Civil Engineering, Symbiosis Institute of Technology, SIU, Pune, India. rahul.badgujar@sitpune.edu.in, priyam.mishra@sitpune.edu.in, mayank.chandra@sitpune.edu.in sayali.sandbhor@sitpune.edu.in, 5 humera.khanum@sitpune.edu.in


A. Introduction to case study
As we experience increase in number of vehicles on road simultaneously road accidents are also increasing in same manner. Road accidents are one of the biggest killers in India. Statistics suggests that one person dies in a road accident in India every four minutes [12]. In spite of these numbers being so high, not much effort is made to make roads safer.
National highway (NH9) is a major East -West highway in India &passes through almost 7 states in present case study analysis data pertaining to accidents on a 101 kms stretch on this highway from Pune to Solapur cities in Maharashtra state of India. Data for 3 years is taken into consideration for analysis. This case study is an effort to make roads safer for road user by using a prediction model developed through regression analysis [6].

II. RESEARCH METHODOLOGY
Methodology for the research work is mainly divided into three parts:- Analysis based on location -analysis for identifying stretch points with higher number of accidents.  Analysis based on time -analysis for identifying the hours having highest number of accidents occurring on stretch.  Regression analysis -for obtaining prediction equation by using parameters available in accidental data.

I. Analysis based on location
For location based analysis stretch is divided into 11 sections for equal length and further divided into right and left lane for analysis. Data is sorted for both lanes and respective accident data for each section is recorded as shown in table 1.    Figure 4 shows % distribution on basis of time. The available data has been processed to apply regression analysis and predict the accident possibility. Following section explains the same in detail.

III. REGRESSION ANALYSIS
The regression model is developed for predicting accident and fatalities, considering factors contributing to occurrence of accidents as independent variables and accident as dependent variable using regression equation.

A. Introduction
The form for linear regression models developed will be in following form: Where, y= accident to be predicted. x=factor contributing to occurrence of accident. m & c = slope and coefficient. For multiple regressions if there are n predictor variables, then the regression equation model is represent the n predictor variables. Those parameters are the same as before, is the constant, is the coefficient on the first predictor variable, β2 is the coefficient on the second predictor variable, and so on. c is the error term or the residual that can't be explained by the model.
Microsoft Excel was used to perform the regression analysis on available data.The df(Regression) is one less than the number of parameters being estimated. There are n predictor variables and so there are n parameters for the coefficients on those variables. There is always one additional parameter for the constant so there are 1 parameters. But the df is one less than the number of parameters, so there are 1 1 degrees of freedom. That is, the df(Regression) = # of predictor variables. [8] The df (Residual) is the sample size minus the number of parameters being estimated, so it becomes df (Residual) = 1 or df (Residual) = 1. It is easier just to use subtraction once you know the total and the regression degrees of freedom. Thedf(Total) is still one less than the sample size as it was before. df(Total) = 1. A variance is a variation divided by degrees of freedom, that is MS = SS / df. The F test statistic is the ratio of two sample variances with the denominator always being the error variance. So F = MS(Regression) / MS(Residual).
The null hypothesis claims that there is no significant correlation at all. That is, all of the coefficients are zero and none of the variables belong in the model.The alternative hypothesis is not that every variable belongs in the model but that at least one of the variables belongs in the model. If p-value is 0.000, we must conclude that there is no correlation at all and have a good model for prediction.If the coefficient is zero, then that variable drops out of the model and it doesn't contribute significantly to the model. If the p-value < 0.05, we'll reject the null hypothesis and retain it otherwise.[13] [2][3]

B. Steps for Regression Analysis
Enter the data into the spreadsheet that you are evaluating. You should have at least two columns of numbers that will be representing your Input Y Range and your Input X Range. Input Y represents the dependent variable while Input X is your independent variable. Open the Regression Analysis tool. Define your Input Y Range. Repeat the previous step for the Input X Range. Click OK. The summary of your regression output will appear where designated. For regression analysis following parameters are choose on basis of correlation and regression is applied using Microsoft Excel (2007).

C. Application of regression analysis
=Classification of accident, = Road Feature, = Road Condition, = Intersection Type and Control, = Weather Condition Table 4 below shows the results of application of regression analysis.

D. Prediction
Using the equation form from regression analysis of accidental data prediction is made for different independent variables along the stretch so as to find classification of accident that can occur. Validation of prediction model is done using available accident data. Error in respective model is indentified and their validity is checked (Table 5). Using prediction equation and proper parameters, prediction for classification of accident that can take happen on stretch can be made. For prediction purpose, random 60 values from available accident data is chosen for input variables and prediction is made using equation. Predicted values from regression equation were compared with available accident data and found that prediction model for classification of accident predicts 66% values with an error of approx 10%, 12% values with an error of 20% and 22% values with an error higher than 30%.Using field details of stretch and prediction model potential points for accidents can be predicted.

V.
CONCLUSION Severity of accident can be reduced by applying prediction model with proper input of parameters. The likelihood of accidents on the study stretch can be reduced. The need of costly remedial work can be reduced. The total cost of stretch safety for community, including accidents, disruption and trauma is minimized. Safety provisions needed for KM180-189 on right lane and on KM240-249 on left lane as these have highest no. of accidents. Lighting provisions must be improved for 18:00-20:59 hrs on study stretch.