Visual Data Mining of SARS distribution using Self-Organization Maps

—Background: From a public health perspective, the socioeconomic conditions correlate with occurrences of infectious diseases. Our premise is that the number of SARS patients is a non-linear function of socioeconomic effects that are not normally distributed among regions. The objective was to integrate multivariate data sets representing social and economic factors to evaluate the hypothesis that regions with similar socioeconomic characteristics exhibit similar distributions of SARS disease. Results: The SOM algorithm used the intrinsic distributions of 21 social and economic variables to classify 31 regions into five clusters. SOM determined clusters were compared with the distributions of SARS outcomes. The result picture shows that the variability between regions clusters was significant with respect to the distribution of SARS occurrence. Conclusion: Our study demonstrated a positive relationship between socioeconomic conditions and SARS outcomes in regions using the SOM method to overcome data and methodological challenges traditionally encountered in public health research. Results demonstrated that community health can be classified using socioeconomic variables and that the SOM method may be applied to multivariate socioeconomic health studies.


INTRODUCTION
Huge amounts of data are collected every day, and common databases contain several terabytes of data while today's data management systems provide only limited means to access to the data. Visual Data Mining (VDM) is a theory, methodology and also a technology, which combine traditional data mining and information visualization techniques [1]. It uses the technology of Visualization to transform the data in a large data set to graphic or images and shows the result to screen. It's extended from the integration of Visualization and Data Mining. VDK takes advantages of both data mining and visualization. It could analysis large data set at a time and be potential for automation. And also, the user is directly involved in the exploration process. It's applicable even if only little is known about the data or the exploration goals are vague, or highly inhomogeneous and noisy data is given.VDK combines the power of automatic calculations and the capabilities of human processing.
A Self Organizing Map (SOM) which is also called Kohonen networks [2] is a feed forward artificial neural network. It's an important method in VDK. It uses an unsupervised training algorithm to perform non-linear non-parametric regression. It is capable of projecting multidimensional datasets onto low, usually 1-or 2-D displays while preserving the useful information within the raw data, in doing so they enhance analyst ability to extract knowledge in the form of novel patterns, structures and correlations hidden in the data. Kohonen's SOM techniques provide an excellent tool for analyzing disparate multidimensional datasets from different sources [3]. Using SOM techniques, multi sourced datasets, even with inconsistent labeling could be collectively analyzed to discover implicit, previously unknown useful knowledge embedded in the complex data set. SOMs are increasingly seen as better data clustering, visualization, and dimensions-reduction tools in data mining and knowledge discovery applications. SOM applications to exploratory data analysis have produced considerable success in many fields, such as medical, financial market, customer segment, industrial engineering, manufacturing to name a few from the lists of thousands of SOM applications in Kaskiy et al (1998) and Oja et al (2002) [4][5][6][7][8]. In this paper, we use SOM net to analysis the relationship between socioeconomic variables and the outbreak of SARS.

A. Self-Organization Map
The SOM algorithm, first introduced by Teuvo Kohonen was developed from the basic information processing modeling in the human brain's cortical cells, known from the neurophysiological experiments of the late twentieth century (Kohonen, 1982). In the training process of the algorithm, initially the SOM output layer units or nodes are assigned with a random set of vectors, usually referred to as a code book. Each set of the input vectors is then presented to the SOM input nodes and matched against the output units to find the best matching unit (BMU) in the code book. Once a BMU is found, the particular set of input vectors is assigned to it, and then that output unit vector values are adjusted more close to the values of the input set. The neighboring units of this BMU are also adjusted close to the values of the latter. Similarly, the whole set of input data are assigned to their BMUs in the output layer, mapping the similar input data vectors together on the 2D display with most of the original attributes preserved. Hence, the trained SOM display enables analysts to view any implicit previously unknown useful knowledge within the raw data in the form of patters, structures and relationships.

B. The principle of Self-Organization Map
A typical architecture of the SOM network is shown in figure 1. The essential constituents of feature maps are as follows [9][10] an array of neurons that compute simple output functions of incoming inputs of arbitrary dimensionality. a mechanism for selecting the neuron with the largest output. an adaptive mechanism that updates the weights of the selected neuron and its neighbors. The training algorithm proposed by Kohonen for forming a feature map is summarized as follow: Step 1: Initialization: Choose random values for the initial weights wj(0) Step 2: Winner Finding: Find the winning neuron * j at time k, using the minimum distance Euclidean criterion: represents the th k input pattern, 2 N is the total number of neurons, and || . || indicates the Euclidean norm.
Step 3: Weights Updating: Adjust the weights of the winner and its neighbors, using the following rule

III. CASE STUDY AREA AND VARIABLES
Severe acute respiratory syndrome (SARS) was first recognized as a global threat in mid-March 2003 [11][12]. The first known cases and the last case of SARS occurred in November 2002 and in July 2003. The international spread of SARS resulted in 8098 SARS cases in 26 countries, with 774 deaths. As an example of visual data mining using SOM, we choose the socioeconomic data to be trained in the map. Then we search the relationship between SARS occurrence and socioeconomic factors.
The SARS data which include the number of SARS patients in every region in China was got from the report of Ministry of Health of the People's Republic of China, then the socioeconomic data from China population and employment statistics yearbook 2007. The data formula is as table 1.

IV. RESULT
Choose the significantly related data as input, the cluster was got as figure 2 In this cluster, the whole data was classified into three parts, in which Guangdong province was selected alone, Beijing city classified together with Shanghai city, and the other as a whole cluster.Put all the related factors into the net, get the follow results,  In these two cluster results, Guangdong province and Beijing city was separated out, just similar to the distribution of the epidemic situation of SARS. Look into figure 4. Consider from the two results, the factors that with low correlation are important in the net training, the more information got, the more exact results got.
In this result, Tianjin city and Shanghai city were collected into the same cluster, and they are both similar to Beijing city (more near, more similar), but in fact, the number of SARS cases of these three cities is 2521(Beijing), 175 (Tianjin), 8 (Shanghai). The situation in Beijing is far different from Shanghai. As we all known, SARS has very close relation with air temperature and environment factors. But Beijing city and shanghai city have far different climate and environment which are important in SARS transmission, though they have high similar population and social structure. Limited to the data, the air temperature and other environment factors were not discussed in this paper. Then the result is a little different from reality. The result shows that, under the same temperature and environment, shanghai city has a high similar with Beijing city in the SARS transmission. To get more exact result, more data should be added. In this paper, SOM net was used to visualize the relationship between socioeconomic and SARS occurrence, the result shows that the socioeconomic factors are related to the SARS transmission. The regions that more similar in socioeconomic have more similar attribute in SARS transmission. To get more information, detailed data should be collected. SOM net could find the hiding information from Huge amounts of data, and give the intuitionistic visualization. It has simper net structure, strong ability of auto learning, and computes fast. It could be useful in visual data mining of public health.