Development of Methodology for Depth Estimation for Images and Videos

— Recent experiences with stereoscopic image/video conversion have sharply increased their demand. Although 3D stereoscopic view enhances visual quality compared to 2D, depth information which is required to generate 3D view is unavailable for existing 2D content. Therefore, there is a large requirement to generate depth information. This paper uses a fusion of monocular cues as Motion, Aerial Perspective cue (AP), Linear Perspective cue (LP), and Defocus cue to estimate the depth. The proposed system developed a mechanism to re-estimate depth map if the estimated depth map is inaccurate in a situation such as fast motion, false foreground estimation. This algorithm is tested in different conditions such as the sequence of camera motion and multiple objects, static cameras and stationary background, a highly dynamic foreground, background with less motion and when motion is behind the foreground. The experimental results show that generation of the depth map is very close to the real depth map. Thus, the algorithm can be applied for 2D-to-3D conversion. To evaluate the performance of our system its results are compared with existing algorithms. The subjective evaluation test was performed on proposed algorithm. The result shows that proposed system has good performance.


II. EXPERIMENTAL METHODOLOGY
This paper uses four different monocular depth cues: motion cue, aerial perspective cue, defocus cue, and linear perspective cue to estimate depth. Initially, key frames are extracted from a scene. Four different cues discussed earlier are extracted. A multicue depth estimation module is proposed to generate the depth map. This paper utilizes the foreground and background separation to avoid complexity. After this, Depth Image Based Rendering (DIBR) algorithm gives 3D views. The workflow of our proposed system consists of following steps: • Key frame and Non-Key frame Extraction • Estimation of depth from four monocular cues such as motion, LP Cue, AP Cue and Defocus.
• Multi-cue fusion and depth refinement • Stereo view frame synthesis.

A. Motion Monocular Cue
This algorithm always focuses on moving objects from the scene. The moving object is considered as foreground and background object is considered as non-moving objects. It is essential to separate this foreground and background object. Thus, Pixel (Frame) based method is employed to extract the background and foreground object. The formula is as below: is the past frame. The difference is calculated, and this difference is compared to a threshold value. The threshold is calculated as follows: Where t σ -standard deviation, N is the number of pixels, p mean of the pixel values for the frame.

B. Linear Perspective Monocular Cue
To estimate depth map from linear perspective monocular cue, vanishing point is used. The vanishing point is the point of intersection of parallel in an image plane that appears to be vanishing but actually it is not vanishing. Thus to point a accurate vanishing point is the important task to estimate depth. Following steps are used to estimate depth map from the linear perspective monocular cue: • Extraction of edges by canny edge detector.
• Remove the noise from edge detected image.
• Apply the Hough Transform to identify straight lines in an image.
• Locate the intersection of point of the curve.
• Find the vanishing point.
• Estimate depth map from vanishing points.

C. Aerial Perspective Monocular Cue
Aerial perspective is a result of scattering of light. When seen at a distance, the vapour, dust in the atmosphere leads the light to bend which creates a landscape of blurred outlines scene. This paper uses Dark channel prior based algorithm for removal of haze in the scene. Aerial perspective cue [30] of pixel p is given as- Where θ is a variable used to adjust irradiance, ) ( p t is a medium transmission, which is calculated using dark channel prior explained in [30].
is the lowest intensity at each pixel in RGB channels, Ω is a window size of the minimum filter.

D. Defocus Monocular Cue
The mechanism provides an interactive system to add defocus cue to get depth map that will be close to real depth map. The depth value using defocus cue is calculated as- ) ( min I α is the minimum depth value in the images. s is the control factor, in our system we consider s as 0.05. r is the variable determined by stereo effect ,we have taken r =0.5.

E. Fusion of Monocular Cues
This algorithm selected four monocular cues-motion cue, AP cue, LP cue and Defocus cue. Each of these gives different depth estimation. Each individual cue produces depth. The combination of this monocular cue removes weakness of each other and gives better depth estimation. This paper fuses the depth map of each depth cue. The proposed system interacts with user to select certain shot for depth refinement. The proposed algorithm divides a input scene into two sections: foreground and background region.
a) Background Region: The depth information of the object which consists of background is given as- Where LP w , AP w represents weight of the linear perspective cue, Aerial perspective cue. These weight values are determined by the human visual system as explained in [22].  .

+ =
Where f is focal length of camera, B is distance between left and right cameras. Z represents depth value of each pixel, and it is calculated as- Where d the depth level from 0-255, min z , max z are the minimum and maximum depth values.

III. EXPERIMENTAL RESULTS
The experimental platform is a PC I5 2.90 GHz and 4GB RAM. This algorithm can be applied to input image sequences as well as video sequences.

A. Simulated and Experimental Result of image sequences
Four depth maps based stereoscopic image sequences-interview, orbi, Break dance, and Ballet were used in the experiment which shows different conditions. Interview is captured with a static camera, and background is static, Orbi has camera motion and multiple objects; the sequences Break dance and Ballet are obtained from the multiview image sequences [33], Break dance contains a highly dynamic break dancer in the foreground, and a number of supporters are with less motion at the background. Ballet sequence has a stationary observer in the foreground, and a Ballet dancer is behind the foreground observer. These sequences are captured by a stationary camera. These dataset have true depth map so we can compare the performance of our 2D-to-3D conversion system. These are generated by the Interactive Visual Media group at Microsoft Research [33]. Fig. 1 and Fig.2 shows the experimental evaluation of estimated depth map. Foreground object and depth estimation may be not correct if the object in fast motion, mechanism provides a user interface to estimate the depth as per the requirement. It should be noted that it is not possible to estimate depth map exactly to true depth map, but researcher's goal is to tries to approximate estimated depth map to true depth map. Our system uses four monocular cue-motion, LP, AP and defocus. In images, motion cue doesn't exist, hence it is not used.  To evaluate performance of proposed algorithm, the algorithm is compared with 3 different algorithms [36], [37] and [38] as shown in Fig.3.

B. Simulated and Experimental Result of Video sequences
We have demonstrated that proposed algorithm can also give stereoscopic view on real-time 2D video as well as video test sequences. Three depth maps based stereoscopic video sequence-Poznan Street, Kendo, and Akko&Kayo video sequences were used in the experiment which shows different conditions. These dataset have the true stereoscopic view. The result of proposed experiment is compared with stereoscopic view. It compares the video captured by the stereoscopic camera. Initially we selected four monocular cues as motion, AP Cue, LP Cue and defocus cue. The depth from each individual cue is estimated and then fuses these depth maps to estimate depth map. Fig.4 shows the converted 3D Left view and right view (stereoscopic view) which can show 3D effect by wearing red-cyan glasses.

C.
Subjective Evaluation of Proposed Algorithm The subjective evaluation Test was performed on the proposed system. The experimental result obtained by proposed system is compared with true stereoscopic view generated by stereoscopic camera. This thesis selected Poznan Street, Kendo, and Akko Kayo video sequences for subjective evaluation. This evaluation test was done on these three video sequences. The depth map was generated using the left view of stereoscopic view available. In this parameters were taken as depth quality and visual comfort as described in ITU-R BT.500-13. The synthesized two-view were displayed on 3D monitoring display with active-shutter glasses. 15 volunteers were selected for subjective evaluation. These volunteers are professor, student and non-teaching faculty. The volunteers watched the stereoscopic video captured by the stereoscopic camera and stereoscopic view generated by the proposed algorithm. The volunteers watched the videos in random order and were informed to rate each video. The parameters were given as depth quality and visual comfort. The depth quality was accessed by fivesegment scale as Excellent (80-100), Good (60-80), Fair(40-60), Poor(20-60) and Bad(0-20) [34]. The visual comfort analysis was performed by comparing generated depth map (proposed algorithm) and true depth map and scaled as very comfortable(80-100),Comfortable(60-80), Mildly comfortable(40-60), Comfortable (20-40), and Extremely uncomfortable (0-20). The video sequences were shown randomly and repeated multiple times randomly to ensure a better result.

IV. CONCLUSION
This paper has presented an effective semi-automatic 2D-to 3D conversion algorithm. The proposed algorithm uses selected four monocular cues as AP, LP, Motion, and Defocus cue. Each cue has its own advantages: Motion cue gives information to differentiate the foreground and background information, defocus cue is better to estimate the mid-range depth, the aerial perspective cue can give the estimation of depth at long distance object, Linear perspective cue plans the basic depth to the scene. The performance of proposed algorithm is evaluated by comparing with existing algorithms. The experimental results shows that estimated depth map are very close to true depth map. The subjective assessment method was performed on proposed algorithm. The subjective evaluation test result shows that proposed system has good performance. In future, we may integrate a system such that 3D can be viewed with naked eyes.