Design and Development of a Personal Scheduler Based on Continuous Speech for Smart Phones

- A scheduler is one of the most essential applications for all people like various business and IT professionals. Usability of smartphones is exponentially on the rise as it replaces multiple and most essential gadgets that make the life of a human person much easier. One of such important tool is to provide hands free access like speech input with smart phone is much on demand. Speech recognition is one best possible way that converts speech to text which is very convenient mechanism during hectic schedules. Hence, implementing speech recognition that schedules activities of a person through smart phone is a great boon for busy people using smart phones. This research aims at designing and developing of a personal mobile scheduler application based on “Continuous Speech" for smart phones.


SPEECH RECOGNITION
Speech Recognition or Automatic Speech Recognition (ASR) or computer speech recognition is the process of converting speech signals to a sequence of words, by means of an algorithm implemented as a computer program [3].

A. Architecture of Speech Recognition
The architecture of typical speech recognition system is shown in Figure 1. Using frequency response analysis, the input speech signal is processed to extract feature values which are input to speech recognition engine. The input feature values are compared by the speech recognition engine with acoustic and language models which are trained using previously accumulated data, determining a list of the most likely morphemes as output. The acoustic model expresses the correspondence between speech feature values and phonemes while the language model expresses the likelihood that a morpheme would precede or follow a given morpheme. Fig. 1 Architecture of speech recognition [8] The accuracy of speech recognition depends on the condition of how the acoustic and language models are trained close to the actual input environment [7]. It is necessary to reflect the features of the actual user when training the acoustic model and also it is important to include a large vocabulary in order to recognize a wide range of utterances when training the language model. Thus, training with a large text data set is required to build a language model. Users input in the form of voice will be recorded by a microphone which is then connected to a speech server. Streamed speech data will be given to the speech server which converts it to phonetics and then to text. The scheduler automatically detects the ending of the section and user can also stop the recording. If the scheduler detects an ending then the speech server will return the recognized sentence to the scheduler. During recognition period the server can detect many types of errors such as audio recording error, client side errors, insufficient permissions, recognition result matched, recognition service busy, server sends error status and no speech input. If the scheduler fails to detect the ending of the sentence or the user didn't stop the recorder, then scheduler will show timeout error message to the user after 5 minutes. If the user cancels the recording in the middle, partial voice data will be given to the server and later the result will be deleted without any response to the user. The following process will take place while the correct recognized word is given to the scheduler  The recognized sentence will be parsed as date and event,  The event will be scheduled in the scheduler and  The scheduled events will be reminded to the user on every day morning 6'o clock and one hour before the appointment.

RESULTS
An experiment has been conducted on human speech in proposed scheduler as shown in Fig 3. The experiment is conducted with continuous speech that is converted to text which is given to the scheduler returning a text reminder to the user. The proposed scheduler uses normal human conversational language for scheduling day to day appointments and reminds about the appointment in the morning of the appointment date and one hour before the appointment time. Fig 3 shows the output in which toggle button is used to start and stop the speech recordings from the user. The button shows the status of the recording that is whether it is in recording mode or not. If the user presses the toggle button then the recording will start and the speech will be converted to text. For example, if the user speaks a sentence, "Doctor's appointment is on March 19 th 2016 at 10 am", CSBPS converts the speech into text and the text will be parsed into strings, date and time. The output of the CSBPS will be String: Doctor's appointment, Date: March 19 th 2016 and Time: 10 am. The parsed string, date and time will be given to the scheduler in which the string will be set as appointment, the date and time will be set as schedule date and time respectively. If there is any mispronunciation then the application will display all the possible sentences and makes the user to select one from the possibilities. After selecting the correct sentence, user can click the 'OK' button to see the parsed sentence. Then the scheduled appointment will be reminded to the user in the morning of the appointment date and one hour before the appointment time.  The scheduler is tested with 100 sentences with 1066 words in which 1024 words have been recognized yielding a result of 96% accuracy. Among the 100 sentences, 80 sentences are related to appointment while 20 are not related to appointment, hence the performances of the scheduler in deducting sentences which are not related to appointment are also analyzed with confusion matrix. Fig 6 shows the percentage of the recognized and non-recognized words.   Table 1 summarizes the comparisons between SIRI and the proposed scheduler for normal human conversational language. The developed scheduler has been tested with three input datasets consisting of 10, 20 and 30 sentences respectively. SIRI yields 78, 71.3 and 70.775 percentage of recognition for 10, 20 and 40 sentences dataset respectively. The proposed scheduler yields 97, 97.95 and 97.3 percentage of recognition for 10, 20 and 40 sentences dataset respectively. Thus the proposed scheduler shows improvement of 19, 26.65 and 26.525 percentage of recognition than SIRI for the three datasets.

DISCUSSION
The confusion matrix determines the performance of the proposed scheduler. This matrix describes all possible outcomes of a prediction results in table structure. The possible outcomes of a two class predictions can be represented as True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN). The sentences related to appointment and not related to appointment are correctly classified as True Positive and True Negative respectively. True Positives (TP) are predicting that the given sentences as appointments which are actually related to appointments. True negatives (TN) are predicting that the sentences as not related to appointment which are actually not related to appointment. False positives (FP) are predicting that the sentences as related to appointment, but the sentences are actually not related to appointment. False Negatives (FN) are predicting that the sentences are not related to appointment, but the sentences are actually related to appointment. In the confusion matrix given in Table 2, 'a' is the number of correct predictions that an instance is negative, 'b' is the number of incorrect predictions that an instance is positive, 'c' is the number of incorrect predictions that an instance negative, 'd' is the number of correct predictions that an instance is positive [6].
The recall and precision can be calculated using equations shown below: Recall = TP / (TP + FN) Precision = TP / (TP + FP) By applying the formulae to the developed scheduler, the recall and precision are 100% and 93.02% respectively.
Sensitivity and specificity are statistical measures of the performance of a binary classification called classification function. Sensitivity measures the proportion of actual positives and specificity measures the proportion of negatives. The sensitivity and specificity are is calculated as shown below [10]: Sensitivity = TP / (TP + FN) Specificity = TN / (TN + FP) By applying the above formulae to the developed scheduler, the sensitivity and specificity are computed as 100% and 70% respectively.
VI. CONCLUSION This research work is concerned with design and development of a personal mobile scheduler application based on "Continuous Speech" wherein inputs are based on normal human conversational language for scheduling meetings, reminds user about the scheduled events and to improve the effectiveness of Speech Recognition (SR). A comparative study of the existing tools reveals that SIRI has limited accuracy for normal conversational language. Even though Google Now has high accuracy it does not take the sentence to scheduler instead it takes the given speech to the search engine since it is embedded inside Google search. The proposed scheduler is developed using normal human speech and specially for scheduling the appointments and reminds about the appointment to the user in the morning of the appointment date and one hour before the appointment time. The concepts developed are tested through a mobile application yielding good performance which is developed using android and eclipse platform. REFERENCE