Smart Design Pattern and Applicable Model for Fine Tuning of Sensor Data Analysis with Local Data Historian

–Every process industry is highly equipped with wireless sensors for process monitoring in those locations human intervention has to be limited. Sensor data analysis and anomaly detection using predictive analytics for process industries where average performance done and only applicable for standalone installation. It can also require sufficient memory and processing speed. We are proposing two solutions in current work to significantly improve the applicability and performance for both standalone and distributed environments. Standalone model is implemented through file level partition based data analysis and Distributed analysis is implemented through Intra-Node Cluster with Local Historian. Users of these models are freedom to choose any model based on their requirement. Prescriptive analysis is used here because output or old values are fed back as input to proposed elastic clustering algorithm. Our simulation result shows the performance of the both the models in terms of time and data size.

B. Apache Cassandra: Apache Cassandra is a highly scalable, high-performance distributed database designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. It is a type of NoSQL row partitioned database.
III. PROPOSED APPROACH We are proposing two solutions in current work to significantly improve the applicability and performance for both standalone and distributed environment. Standalone model is implemented through file level partition based data analysis and Distributed analysis is implemented through Intra-Node Cluster with Local Historian. A. Standalone Model This model can work on single system with minimum primary memory and moderate hardware environment. It is a pure desktop application. Following modules of applications are required to explain standalone model of sensor data analysis. 1) Extracting of Data from Remote Process Data Historian Server: A plug-in is required to extract the data according to the historian model and provider. This plug-in is installed and first export the data from historian to a excel file locally at remote server. That file is downloaded using any remote desktop application or email using cloud drive. But capacity of the file must not be larger than fixed size. This limitation is not problem with plug-in, it is the limitation of spreadsheet application or text editors. To overcome this limitation, partition and export is the only solution.
2) Building of database schema: This module is used to build separate database for each process industry to store information about sensors and values of selected sensors for further analysis. But limitation is number of columns allowed in each table of any selected database are limited. So according to that number, number of databases has to be divided. For this reason vertical partition based database structure is developed with proposed standalone approach. These databases are created by considering number of sensors present in the input file header. Each database table contains "max-allowed-columns" predefined constant. Again these columns are divided into two types, one for actual data and other column to store respective status of actual data.
3) Local Partitioning of Download file: We are considering "max-file-size" property to partition the file into small files. File must be partitioned horizontally. Now this file is loaded into a Matrix and extracts the header part to find out respective database table of all the sensors. Again these files must be vertically partitioned in the same order as mentioned in previous module. This is very important step for bulk export data from file to database. 4) Machine State Assignment: Based on past sensor data compare the current sensor data, mark the state of the machine with available state behaviors. This state identification can be achieved through various data mining and machine learning techniques such as regression analysis, for example. But proposed work is depended on non-parametric unsupervised learning technique called "elastic clustering". This novel method is proposed for sensor state analysis to handle concept-drift in sensor data streams. This algorithm is proposed as follows. Algorithm : Elastic Clustering Input : Dataset, Steady State Range Output : Clusters step 1: Initialize sim-th = random number between (0-1) step 2: Load existing similarity thresholds if any from database for all sensors step 3: Apply incremental Adaptive Micor-clustering proposed on our previous paper [1] with current sim-th step 4: auto tune the sim-th based on the quality of the cluster with existing similarity thrsholds and extra random numbers. step 5: repeat steps 3 and 4 till best quality clusters found i.e., by calculating the error variance among previous iteration errors. step 6: extract the best threshold and store it in database for respective sensor. step 7: now mark cluster states and accordingly mark the state of points those fall in this clusters. step 8: return Final Clusters.
According to proposed algorithm, clusters don't store data points for long time. Data points are stored to calculate the quality of the cluster. Once best quality is found, data points are cleared from respective clusters. Because they are micro clusters,i.e they are used to store statistics of cluster such as mean, variance, number of points and timestamps. Next point in the algorithm includes states. We are considered Four states Shut-down, Transient, Steady State, No Change. Each one is described as follows. Shut-down means machine is not running. Transient means no proper value is maintained within a range for long time. Steady state means the values generated from these machines or respective sensors monitoring these machine are within supplied range. This state is very important for any other analysis because we can predict the machine behavior only when it is in steady state. No Change state is used to identify interface failures or sensor failures. This can be calculated if data generated from sensor is unchanged. Finally cluster quality is calculated using Davies-Boulding Index proposed in [2]. 5) Exporting Data to Database: After assignment of state, update the matrix with state codes for all the sensor values. Now write back this matrix to data file. Now apply bulk loader to update the database respective database table in which current set of sensors are placed. This data is used for any data analysis techniques.
In this way standalone model can work with any number of records and any number of columns. But it is little bit time consuming. To reduce the time complexity, concurrent execution on multi-core system is possible. We applied the same if selected system supports multiple cores by get the information of system configuration. B.Distributed Analysis This is mostly a hybrid approach. It is implemented through Intra-Node Cluster with Local Data Historian and Cloud Interface. We are try to simulate this model locally. This model is having following modules. 1) Extracting of Data: Sensor data accessed from OPC (OLE for Process Control) protocol is not having any data storing capacity. And all the data of each sensor are replaced with new value periodically. So the previous data has to be stored each time. The data generated from sensors each time called as sensor data stream and data reading through OPC is called data in highway or snapshot data. For analysis this data must be stored. All types of process industries may or may not have data historians. They might ignore the data or maintain for some period (last one month or 6 months for example) or use traditional relational databases. Limitation of relational database itself is not scalable and doesn't support high dimensionality. Query response type is also high. So it leads to degradation of performance. To overcome these limitations proposed approach uses columnar database instead of traditional relational databases. Columnar databases have many prominent features when compared to RDBMS. Some of them are, High Speed Query Processing, Significant compression and highly scalable and support high dimensionality. Among the available columnar databases Apache Cassandra [3] is one of the best options. Proposed approach uses Apache Cassandra as local data historian at data processing centre. But to store the data in this local historian, periodical interaction with cloud drive is required. Because data from highway is maintained in flat files for some period and they flushed also periodically. So before flushing those files are store in the cloud drive for remote access or to make those files available from remote data processing centers.
2) Cloud Plug-in Next stage of work is developing a cloud-plug in to download data from cloud to local data historian. This plugin uses third party API based on the chosen cloud service provider. For example, If Microsoft Azure is used then respective Cloud API has to be used to develop plug-in. This plug-in is used as an interface between industry sensor network and data processing centre. In our case, a common cloud gateway is developed so that any cloud api can be accessed from single window based on supplied credentials. 3) Local Data Historian As discussed in the previous module Apache Cassandra is used as local data historian. It is a hybrid NoSQL Data store and row partitioned and partially columnar database and its stable version recently released this year (Jan 18th,2016). Sensor data streams are treated as temporal data or time series data. So each point is associated with a timestamp. So according to this model on record is a snapshot retrieved from the DCS (Distributed Control System) at process industry and it is uniquely identified based on timestamp. These snapshot are read periodically with one minute or 'n' minute interval based on the user parameter. Each snapshot contains required number of values with respective to the sensors and a timestamp. Database structure is required to build data historian which includes Sensor Master, DataArchive, ErrorCodes and Interfaces column families.
For storing the data downloaded from cloud drive, bulk data loader program is used. It reduces individual record insertion time complexity. After this step cloud drive has been cleared. Proposed local data historian has a capability to work in standalone and distributed mode. So we can store the data in a single system or more than one system. But proposed system uses single system for data historian and other systems are processing systems such that these systems can access the data from local data historian.

4) Distributed Processing
As discussed in earlier modules volume of sensor data is high and it is difficult to apply analysis on single system. For that purpose proposed system introduces a hybrid model with RMI and Apache Cassandra. Sensor analysis is running as a service at each system. So RMI (Remote Method Invocation) is used to set or get the state of service. It acts like a bridge between actual client process (sensor analytics) and data historian. The systems participating in the sensor analysis is treated as cluster. Each system in the cluster gets the data from local data historian and processes the data and push back results to historian. Same Machine State Assignment and Elastic Clustering discussed earlier is converted into parallel and distributed algorithm. Here parallel means efficient utilization of latest development in hardware with core processors at each system and distributed means processing of unique subset of sensor data is processed at each node and finally all the results are merged at server. Partially this process is like Map-Reduce framework in Hadoop. B. Simulation Environment Under java domain, Matrikon OPC Simulation Server, OpenScada as OPC Server interface, Apache Cassandra as Data Historian is used to design and develop our simulation environment. We setup the simulation environment for both standalone and distributed models. By tuning the some of the parameters in the configuration file, Cassandra can be applicable for both of the models. Real time data is used in the simulation to test the work load. Due to privacy agreement with data providers, we can not disclose the source of the data.  Proposed model is simulated with single node cluster only. Its performance and availability is more increased if it is deployed in multi-node environment.   i.e one sensor reading is noted for each one minute interval and other sensor reading is noted for each five minutes interval in an hour. According to that first sensor is having 60 readings and second sensor is having only 12 readings in an hour. But to do data analytics all sensors must maintain equal number of readings. To achieve that goal, interpolation is used. Here liner interpolation is enough. So it takes little higher time to generate readings when compared to raw data download. In the above figure the result of elastic clustering is described. In that figure, Signature represents the group of sensors which are technically monitoring one section or module or a machine itself in the real plant. So, the data of all those sensors influences one single sensor called as dependent sensor and it is the value to be predicted by proposed algorithm. Total fields are total number for independent sensors participated for a section or machine. Total clusters are number of clustering formed due to elastic clustering. This number can vary based on the threshold and sensor data stream values. The table in the figure contains information about micro clusters. Cluster id is serial number of the cluster, Density is the number of points or values fell in that cluster, First timestamp is when that cluster is created and last timestamp is when that cluster is recently updated, mean is average of all the values in that cluster and var is variance of the cluster values. For given one month of data for 2000 sensors, the clustering process took less than 2 minutes to generate above clustering results. Why we considered 2000 sensors is that most of the medium and semi-large industries require analysis of those 2000 sensors data even though they are having greater than that number of sensors. V. CONCLUSION Proposed approach fine tunes the performance and improved scalability and availability of the data with two different approaches. Standalone model is implemented through file level partition based data analysis and Distributed analysis is implemented through Intra-Node Cluster with Local Historian. Users of these models are freedom to choose any model based on their requirement. Proposed approach is economically and technically feasible to be implemented even in small and medium level industries. It is also more reliable model also. In general 1 TB of hard disk is able to store nearly one decade of time series data with 2000 sensors and ready to query and analyze that data any time. Due to proposed model machine behavior and performance can be periodically evaluated and take necessary action in the production or work environment to reduce operational, calibration and machine and man power safety costs. Finally in simple words proposed work is a pilot to "Smart Big Time series analytics with local data historian".

IV. RESULTS
VI. FUTURE WORK We are simulated the current work. Current work can be extended by real time deployment and testing the performance and security issues.