SVD B ASED L ATENT S EMANTIC I NDEXING WITH USE OF THE GPU C OMPUTATIONS

The purpose of this article is to determine the usefulness of the Graphics Processing Unit (GPU) calculations used to implement the Latent Semantic Indexing (LSI) reduction of the TERM-BY-DOCUMENT matrix. Considered reduction of the matrix is based on the use of the SVD (Singular Value Decomposition) decomposition. A high computational complexity of the SVD decomposition - O(n 3 ), causes that a reduction of a large indexing structure is a difficult task. In this article there is a comparison of the time complexity and accuracy of the algorithms implemented for two different environments. The first environment is associated with the CPU and MATLAB R2011a. The second environment is related to graphics processors and the CULA library. The calculations were carried out on generally available benchmark matrices, which were combined to achieve the resulting matrix of high size. For both considered environments computations were performed for double and single precision data.


INTRODUCTION
Nowadays, there is a tremendous increase in the number of text resources associated with various themes.The most obvious example of this phenomenon is still growing the World Wide Web.With the increase in the number of text documents placed in various databases increasingly important are the methods of automatic documents indexing.The LSI method, introduced in 1990, uses the theory of linear algebra to automate the process of indexing and retrieval [1].This method is widely used for the purpose of information retrieval systems, for example:  in [1] for indexing of medical (1033 documents) and information science (1460 documents) abstracts,  in [2] for indexing collection of 382845 documents divided into 9 subcollections,  in [3] for indexing patent documents,  in [4] for cross language information retrieval (CLIR), more precisely for the Malay-English CLIR system,  in [5] for the Greek-English CLIR system.Furthermore, application of the LSI can also be found in the field of ontology mapping [6] or even genomic studies [7].The LSI offers up to 30% better performance than traditional lexical techniques [8].Despite a number of advantages, the LSI method is computationally problematicimplementation of the algorithm requires the use of equipment with significant computing power.In particular, this drawback is visible in a case of databases of text documents with a huge number of positions (expressed in thousands), where there is a need for continuous updating.In relation to such large structures, realization of search system, based on the LSI method is a difficult task to perform.The time analysis of the LSI method, applied to a large number of documents, can be found in a relatively small number of papers.Referring to [9], application of the LSI method for a matrix of size 11,152×3891 (43392432 elements) required calculations that lasted 2.3*10 5 seconds.We used in this article a matrix with a similar number of elements (matrix of size 5896×7095 -41832120 elements).We also used the Matlab environment to determine the computation time for the CPU environment.The results are compared to computations using the GPU environment.The authors would like to point out that in literature known to us nobody made a similar comparison.Taking into account the computational power of modern graphics cards, such a comparison is desirable.Usually the computing power of the commercial GPUs is much higher for the single precision data than for the double precision.In the literature known to us there is not a study to determine the usefulness of the LSI method used for the single precision data.Therefore the authors carried out a comparison of the time complexity and correctness of the results obtained using the single precision data.
The results are promising.The use of the GPU in place of the CPU from the similar market segment has allowed to reduce the computation time almost by half.Additionally, changing the data format from the double precision to the single precision allowed to further reduce of the computation time.Moreover, regardless of the used environment and the data format, for the considered case, precision of the search system remained at a similar level.Despite the fact that the metodology proposed in this article allows several times to reduce the computation time, we realize that the use of the GPUs to LSI method has its limitations.Application of LSI for indexing of huge documents collection may be achievable with the combine of decompositotinal techniques and parallel computing using the computer cluster with the GPUs.

VECTOR SPACE MODEL
The Vector Space Model in the information retrieval is a widely used technique [10,11], in which both documents and queries are represented by vectors.Each element of the vector represents the weight of a given expression in a given document or query.
The first step of building the Vector Space Model is to identify occurrences of keywords in the entire collection of documents.Selection of the appropriate list significantly affect the performance of a retrieval system.In the next step the index structure is being built.Each keyword is mapped in a multidimensional space in such a way to make correlation of all documents containing that phrase with the same phrase.The result is the TERM-BY-DOCUMENT matrix, which contain weighted connections between the documents and the concepts.
There are several methods of selection factors lie at the intersection of the columns and rows of the TERM-BY-DOCUMENT matrix.The most commonly used are: • Boolean modelthe occurrence of concepts in the document is marked by one, the lack of concepts is marked by zero, • Term Frequency model (TF)the values contained in the matrix specifies the number of occurrences of concept in a document, • Term Frequency -Inverse Document Frequency model (TF-IDF)the model takes into account boththe frequency of occurrence of a concept in a document and its power of discrimination (measure of the prevalence of concept).
In the space of concepts (term space) each document creates a vector while each concept (term) represents dimension.In the documents space, terms are represented by vectors located between the axes defined by the documents.The two most commonly used measures of similarity between two vectors are the scalar product (1) and the cosine distance (2).
In practice, frequently used is a cosine distance.The smaller angle between two vectors means closer similarity.The value of cosine distance varies between -1 (180 degreesvectors not similar) to 1 (0 or 360 degreesa hundred percent similarity).

SVD DECOMPOSITION AND REDUCTION
The high-order matrix is decomposed using SVD transformation into product of three matrices (3). where: The In order to determine the low rank approximation of the original TERM-BY-DOCUMENT matrix, it is possible to use equation ( 4).
The low rank representation of TERM-BY-DOCUMENT matrix is obtained by removing the last (m-k) and (n-k) columns of a matrices mm U and nn V respectively and by retain only the k largest singular values of the matrix mn S .As shown in [8], the matrix mn A  is the best approximation of the matrix mn A with respect to the Frobenius norm (5).
The approximation error can be determined from (6).where σ denotes the removed singular values.
In order to reduce size of the original TERM-BY-DOCUMENT matrix, it must be mapped into the space of a smaller number of dimensions.The linear transformation (7) After the reduction of multidimensional space, linked documents are closer to each other.The relative distance between points in the reduced space represents the semantic similarity between documents.
Mapping of the user questions to the reduced vector space is done by ( 9) where q denotes pseudo-document (keywords entered by the user build the vector in the original space).

THE EXPERIMENT
This chapter provides information about the data and equipment used to perform the calculations.The results of the study are also included here.The chapter also includes the results of the experiment.

Data
The study, which is presented in this article, was made on generally available benchmark including connected sets of documents:

Implementation and hardware
The LSI algorithm has been developed for two parallel computing environments to compare the time and the accuracy of the calculations.The first environmentrelated to CPU -MATLAB R2011a, which use the Intel Math Library version 10.2, which supports multi-core and data parallelism [12].The second environmentrelated to GPUdeveloped programs in C++ language with use of CUDA, CULA [13] and CULYA [14] libraries.
The CULYA library (described in details in [15]) uses the CULA, CUBLAS and CUDA libraries to provide basic linear algebra operations for vectors and matrices.The library includes 115 methods and overloaded operators (e.g.+,-,*,^), which makes operations on matrices more intuitive.The table 1 presents the description of the selected methods of the CulyaMatrix class, which is the main class of the library.The GPUs have large number of cores, performing basic arithmetic operations and small cache.This architecture is very effective to process the matrix/block data.The basic unit of code that is responsible for performing computations using GPU is the CUDA kernel.The structure of a simple program that performs addition of two vectors, using the computing power of the graphics card, is presented below: In the lines 1 -6 there is the CUDA kernel code, which will be run on the GPU.The declaration of the function is similar to the C language, however, it must be preceded by the __global__ directive.The function cannot return a value via the stack (void).Despite that the CUDA kernel will perform the operation of addition of two vectors, the code does not contain any software loop that would occur with a classic solution to this problem.The CUDA computing environment automatically allocates the GPU computing resources, for each data element, calling kernel code many times simultaneously.
After starting the application on the GPU environment, execution of the CUDA kernel is equivalent to running multiple threads simultaneously, each of which performs the same operation (SIMT architecture -Single Instruction Multiple Thread).The threads are combined into groups called threads blocks.The threads blocks are grouped in grids, each grid contains only blocks belonging to the same CUDA kernel.Each block and each thread has its identifier that uniquely identifies their location respectively in the block and grid.Each running thread has a structure that determine the coordinates of its block in the grid -blockIdx, and coordinates of the thread in the block -threadIdx.The CUDA kernel code refers to the specific coordinates by field componens x,y,z.Threads have also ability to determine the dimension of the block and the grid with use of the structure blockDim and gridDim.
The application developer has a direct impact on the division of computational problem for a certain number of threads and blocks.This is done by specifying two parameters contained in brackets <<<…>>> in place of the CUDA kernel invocation.The first value determines the segmentation of computational problem into a certain number of blocks.The second value determines the number of threads running in each block.In the fig. 5 there is an example of division of computational problem into 4 blocks of 5 threads each.It should also be mentioned, that the modern x86 processors have a whole range of solutions that affect the calculations time.The first important solution accelerating the calculations is the multicore architecture.However the number of cores in the CPU is not as large as the number of cores in the GPU (for example, the unit used in our case -Core i7 3770 has 4 cores).Another element worthy of mention are vector extensions supporting matrix computations [16] (also used by MATLAB R2011a).The comparison of the used equipment is summarized in table 2.

The reduction error
As a reduction error, a difference between the cosine distance of vectors of the original and the reduced TERM-BY-DOCUMENT matrix is assumed.The sample error maps for the case of reduction to k=100 rows, for double precision calculations, for the MATLAB and GPU environments, are shown in fig.6.
Figure 6.Reduction error maps for k=100 (MATLABleft part, GPUright part) In order to present reduction error for the all studied cases, the mean square error was used (10).(10) where: n=7095,  The results are consistent with initial expectations.The error decreases with increase of the size of the reduced TERM-BY-DOCUMENT matrix.It should be noted, however, that the results determine only the similarity of the reduced matrix relative to the original.This measure does not take into account the positive effects of the application of the LSI method.The LSI method is able to detect the existing relationship between the semantic content of various documents.
Search systems based on the LSI are able to return a list of documents related to terms of a user's query, despite the difference in keywords used by him.

Reduction time
The reduction times, depend on environment and precision, are shown in fig.8. Due to the fact that the main operation, which affect the computation time is the SVD decomposition (which takes the same time in each case), the results are given for the reduction for k=100 rows.Summarizing the results, the calculation of the reduced TERM-BY-DOCUMENT matrix using the GPU environment brings a reduction in calculation time by almost half compared to the time obtained for the CPU environment (fig.8).In addition, the search system using the reduced TERM-BY-DOCUMENT matrix in place of the original one is able to find the closest match document (considering the cosine distance) in time which is only a fraction of the original searching time (fig.9).The charts show that regardless of the used computational environment, the precision associated with searching for the most similar documents (relative to the set obtained for the original TERM-BY-DOCUMENT matrix) is at a comparable level.What is interesting, the graphs show also, that change the arithmetic form the double precision to single precision (for calculations on the GPU environment) does not make a drastic deterioration of the results.Considering the fact that the calculation time for the single precision arithmetic is faster, this suggests the possibility of using only calculations using the single precision values in a target retrieval system.

CONCLUSIONS
The LSI method based on the SVD decomposition is an effective way to reduce the size of the original TERM-BY-DOCUMENT matrix.Referring to the literature, the method is frequently applied to a collection of homogeneous documents.Applying it to large collections of heterogeneous documents is problematic mainly due to long time of computation.This is caused by the high computational complexity of the SVD decomposition -O(n 3 ).One step towards solving this problem could be the use of parallel computing, in this article, the authors focused on the application of the graphics processor.In the literature, up to now, there was no comparison of the efficiency and precision of the two possible computing environmentsassociated with the CPU and GPU, for the purpose of the LSI method.
The article shows that the use of GPU can reduce time needed to reduce TERM-BY-DOCUMENT matrixapplication of the presented equipment allowed reducing computational time by almost half (comparing the time obtained in the GPU environment to the time for the CPU).The precision obtained in both environments was similar.We should also clearly indicate that despite the fact that we do not use the most modern equipment available on the market (mostly for financial reason), both devices were released at the similar time (Core i7 3770 -Q2'12, GTX580 -Q4'2010) and the two devices belong to the similar segment of the market.The presented methodology allows almost by half to reduce the computation time, but it does not give the possibility of reducing a huge index structures, where the number of documents is counted in the hundreds of thousands.Our future experiments will focus on the different decomposition methods of TERM-BY-DOCUMENT matrix before the reduction (like K-means and Epsilon decomposition) and the use of the Krylov methods to replace the original SVD decomposition.This will allow for multiple increase of the size of considered indexing structures.The decomposition of one large computational problem into a number of smaller has a number of advantages.The first is associated with a reduction of the computational complexity.
Assuming that the original TERM-BY-DOCUMENT matrix will be divided into k parts of equal size, we get the computational complexity equals k*O( (n/k)^3) from the original equals O(n^3).Moreover, the obtained matrices from the decomposition process, could be reduced simultaneously by using parallel computer cluster equipped with the GPU's.

Figure 1 .
Figure 1.The sample TERM-BY-DOCUMENT matrix The idea of both spaces, for example TERM-BY-DOCUMENT matrix in fig. 1, is shown in fig. 2.

Figure 2 .
Figure 2. Terms' space (left part) and Document's space (right part) of the sample TERM-BY-DOCUMENT matrix columns of the matrix mm U are orthonormal eigenvectors of the matrix product matrix containing the square roots of the eigenvalues (singular values) of the product of matrices TERM-BY-DOCUMENT matrix is organized in such a way that its columns represent the documents from collection while rows represent words.The size of the matrix is 5896 rows per 7095 columns.The structure of the original TERM-BY-DOCUMENT matrix is shown in the fig.3.

Figure 3 .
Figure 3.The structure of the TERM-BY-DOCUMENT matrix This is a sparse TF matrix with 247,157 elements different form zero.

Table 1 .
The selected methods of the CulyaMatrix class Method Description to_card copies the matrix from the main memory to the graphics card memory from_card copies the matrix from the graphics card memory to the main memory del_dev removes the matrix from the graphics card memory del_host removes the matrix from the main memory operator * matrices product operator + matrices addition operator -matrices subtraction operator ++(int) returns matrix transpose operator ++ transposes matrix, overwriting the source data dup returns a copy of the matrix inv matrix inversion qr_full the QR decomposition of the matrix.The method returns an object of class Tqr containing matrices R and Q of the decomposition eig_full(char vector) the eigenvalues and the eigenvectors of the matrix.The method returns an object of class Teig containing the eigenvalues vector, the right and the left eigenvectors svd(char param) the SVD decomposition.The method returns an object of class TSvd containing the singular values vector, the U and VT matrices.The difference in the architecture of conventional processors and GPUs mainly based on the number of cores and cache size (fig.4).

Figure 4 .
Figure 4. Comparison of CPU and GPU architectures

Figure 5 .
Figure 5. Example of division of computational problem distance between the i'th and j'th documents in original TERM-BY-DOCUMENT matrix, ij RED COS _ -the cosine distance between the i'th and j'th documents in reduced TERM-BY-DOCUMENT matrix.The MSE errors are shown in fig. 7.

Figure 7 .
Figure 7.The MSE errors of reduction

Figure 8 .
Figure 8.The reduction time depending on the environment and precision The times of search the most similar document, according to the cosine distance, depending on the size of the TERM-BY-DOCUMENT matrix are shown in fig.9.

Figure 9 .
Figure 9.The times of search the most similar document, depending on the rank of the TERM-BY-DOCUMENT matrix

Table 2 .
Comparison of used equipment