Information-Theory Analysis of Cell Characteristics in Breast Cancer Patients

A problem of selecting a subset of parameters containing a maximum amount of information on all parameters of a given set is considered. The proposed method of selection is based on the informationtheory analysis and rank statistics. The uncertainty coefficient (normalized mutual information) is used as a measure of information about one parameter contained in another parameter. The most informative characteristics are selected from the set of cytological characteristics of breast cancer patients.


Introduction
The selection of a subgroup of parameters containing a maximum amount of information on all parameters of a given group of parameters is an important problem of medical informatics.This problem has applications in analyzing oncology patients data [1], mathematical modeling of tumor growth [2] and developing intellectual medical systems [3,4].
In the present article, a subgroup of parameters describing cytological characteristics of breast cancer patients and containing the maximum amount of information on all parameters of this group is selected.
This selection is required for constructing models of cancer diagnosis and prognosis.
The problem under consideration is rather difficult.First, this group includes both discrete and continuous parameters; second, the distributions of continuous parameters are non-Gaussian and, third, the interrelations between parameters are nonlinear.
Any method of selecting information-intensive parameters is based on the use of a measure of parameters correlation.In the majority of methods, a correlation coefficient is used as such a measure.However, the application of the correlation coefficient suggests that parameters distributions are Gaussian, and the correlations of parameters are linear.Therefore, in the present article, the uncertainty coefficient (normalized mutual information) is used as a measure of parameters correlation.This coefficient evaluates nonlinear correlations between parameters with arbitrary distributions.Thus, this article presents a method of selecting the most informative parameters, which has no restrictions on the distribution of the parameters and on the correlations between the parameters.
Application of the uncertainty coefficient has made it possible to obtain interesting results in medicine [5] and, in particular, in oncology [6,7,8].The approach proposed in [6] is presented in monograph [9].

Materials
To illustrate the method, we used data from the Wisconsin Breast Cancer Database [10].These data cover 663 patients, each patient being presented by 10 cytological parameters, 9 of which are continuous (clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, and mitoses) and one is discrete (Class).

Statement of the problem
Assume that the initial data on n objects are presented in the form of a n m where each row k is an object described by m parameters.It is needed to find a parameter or a subgroup of parameters containing the greatest amount of information about all m parameters.

Description of the algorithm
The algorithm of selecting a subgroup of the most informative parameters from the entire group of parameters includes four procedures.A short description of each procedure is as follows.A more complete description of the application of information-theory analysis to the selection problem is presented in [11].

Discretization.
This procedure transforms parameters having continuous values into parameters having discrete values.
If an acceptable method of a continuous parameter discretization is unavailable, use a formal approach to the discretization [12].

Construction of the uncertainty coefficient matrix.
For i-th and j-th parameters 1 , m i j ≤ ≤ , calculate the uncertainty coefficient ij c [5] and construct m m 3. Construction of the rank matrix.
For each column of the matrix [ ] ij c , we rank its elements and assign rank 1 to the smallest element of the column.We obtain the matrix m m , where each column of the matrix contains ranks from 1 to m.
We estimate the amount of information about all m parameters contained in the i-th parameter by the sum of all the entries of the i-th row of the matrix [ ] ij r .

Application of the multiple comparison method.
Apply the multiple comparison method to the sums of [ ] ij r matrix rows [13].This gives a clustering of parameters that contains the desired subgroup of parameters.
Remark.Programs for the algorithm of uncertainty coefficient calculation and the Friedman test are implemented in the package SPSS [14].    1) columns and obtain a rank matrix [ ] ij r (Table 2).Consider Table 2 as the Friedman statistical model [15] and examine the row effect of this table.

Conclusions
In the present paper, a method of selecting parameters based on the assessment of correlations between single parameters is considered.However, problems arising in medical practice often require the selection of parameters taking into account correlations between groups of parameters.The development of methods of selecting informative parameters taking into account not only correlations between single parameters, but also correlations between groups of parameters, should be a subject of further research.