On Randomness of Compressed Data Using Non-parametric Randomness Tests

ABSTRACT


INTRODUCTION
Compression is a method of reducing the size of a file for a purpose, such as saving a space, utilizing bandwidths, or increasing transmission speeds.
Compression algorithms works by removing the redundancy in data, which is why further compressing compressed data seems impossible [1].Different approaches deal with this redundancy differently, for example, encoding the similar sequence of data (runs) in a shorter form, encoding the differences between the adjacent neighbors, and use a dictionary so that the similar sequence of data will be substituted by an index in that dictionary.
The compressed data, therefore, contains almost no redundant data, and is regarded to be random.The ordinarily meaning of randomness in statistical literature [2] or in the context of generation of random number [3] refers to the numbers of independent and identically distributed (i.i.d), or uniformly distributed numbers.
The authors in [4] used the NIST and Diehard tests to confirm the randomness of the compressed data.Their aim is to check the quality of the compressed data as a random number generator.The researchers tested the randomness of five corpora, and each was then tested as a binary sequence.In our work, we prefer converting the compressed data into unsigned integer of one-byte length, as this format is suitable for the software we selected for use.Any other format that reflects the positions or the magnitude of the data is still accepted to serve as a measure of randomness [2].For the tests, four non-parametric randomness tests were used.
This paper investigates the various relationships between the compression ratio and the selected randomness test, and between the compression ratio and statistical information accompanied by these tests.The paper, in Section II, briefly describe the randomness tests that were used, the sample images, and the compression algorithms applied to obtain the compressed data.In Section 3 (empirical study), we listed the relationships between the compression ratios and randomness tests and their corresponding statistical information.Section 4 summarizes the findings of the experiments

METHODOLOGY
Four randomness tests were used to test the randomness of the compressed data files.These data files were generated by applying four lossless algorithms on 49 gray scale images.The following subsections will describe the sample data, compression algorithms, and the randomness tests.

Compressed Data Files for Tests
Total of 49 images with a standard size of 512x512 in a raw format (PGM) were used.These gray images contain landscapes, satellite, human, animals, vehicles, buildings, and artificial images to allow the compressor algorithms to yield very different compression ratios, Figure 1 shown the sample images, which were assigned numbers 1-49.The variations in compression ratios are important to mitigate bias in testing and the accuracy of the relationships.

Randomness Tests
Four randomness tests were used.These tests are commonly used to determine the randomness of the time series data [3].
The tests were used in a twofold manner; the p-values of the tests were used as a measurement of the randomness, while the statistical information-which is used to calculate the p-values -were used as well.This statistical information helps us understand the structure of the compressed data files in the context of the four tests, and their corresponding relationship(s) to the compressed data.

Runs test (Wald-wolfowitz Runs Test)
This test is named after Abraham Wald and Jacob Wolfowitz [5].If we have  observations, the median value will be used as a threshold to identify two types of outcomes; the outcomes above and below the median, denoted as  1 and  2 , therefore,  =  1 +  2 .
The expected runs for random data is computed as follows: And the variance: Then, the statistics test: in the above equations is the total runs of both the runs of the above sequences and the runs of the below sequences.The significance level  used for all tests was 0.05, so if   > 1.96, we reject the null hypotheses, and the data files seems to be non-random [6].
Besides using the result of the test in terms of p-value, we also used the total runs, defined as the statistical information, and when divided by the size of the compressed data, the result is defined as the statistical ratio, or ratio for short.A ratio in this case is the percentage of the number of runs found in the compressed data comparing it to its size.This naming convention will be adhered to for the rest of the randomness tests.

The Difference-sign Test
This test counts the number of values of  such that   >  −1 ,  = 2, … ,  , or the number of times the difference series   −  −1 is positive and assign it to S, while  is the number of observations.
The mean and the variance for the random data are: The null hypotheses are rejected when: Where Φ 1−/2 = 1.96, when  = 0.05, [7].Besides the indicator of randomness represented by p-value, the number of positive signs () was used as a statistical information along with its ratio.
As  represents the count of turning points, the mean and the variance for a random data are as follows: The null hypotheses are rejected when: Where Φ 1−/2 = 1.96 when  = 0.05 [7].This test was published by Irénée-Jules Bienaymé in 1874 [8].
The counts of both turning points  as a statistical information and its corresponding ratio were used.Let  be the sum of positive signs, since ~(, 0.5),  −  will be a cumulative probability function for binomial distribution, and the null hypotheses will be rejected if the  −  is less than .More details can be found in [9]- [11].The test is published in 1955 by D. R. Cox and; A. Stuart [12].We used, besides the p-value, both the count of positive comparisons as a statistical information and its ratio.

Compression Algorithms
We used four lossless compression algorithms; two are dedicated to images, and two are general purpose algorithms: • JPEG-LS is lossless image compression.It works by predicting the next pixel value basing on the MED (Median Edge Detection) technique, and uses Glomb-Rice for encoding [13]- [15].The implementation provided by the Columbia University written in C language was used in this work.• JPEG-2000 is a wavelet-based lossy-to-lossless transform coder [16].IrfanView can be used to apply JPEG-2000.• 7z uses LZMA (Lempel-Ziv-Markov chain algorithm) algorithms, which includes an improved LZ77 and range encoder.7-ZIP software is used to implement this algorithm.• Bzip2 (Bz2 for short) concatenates RLE, Burrows-Wheeler transform, and Huffman coding.It also uses the 7-ZIP software.

EMPIRICAL STUDY
All the tests were conducted in the R code version 3.3.1.The open source package "randtests", which includes all the tests, was used in our work.The 49 grayscale images were compressed by the compression algorithms using the software mentioned earlier.The average of compression ratios is given in Tabel 1.
In this small and varied sample data, 7z and Bzip2 were better in compression ratios, despite 7z and Bzip2 being general purpose algorithms, while JPEG-LS and JPEG2000 are image compressors.For a large data set, this may differ [17], [18].

The Randomness Tests Results
Each randomness tests mentioned in subsection 2.2, were applied on the outputs of the compressors (subsection 3.3).The number of files that passed the randomness tests (their p-values >0.05) are shown in Fig. 2. For the run test, files compressed by Bzip2 (Bz2) shows only 3 files out of 49 that passed the test, and 19 files passing the Cox-Stuart test.In the case of the turning point test, JPEG2000 has lesser number of files that passed the test, whereas most of the files passed the Cox-Stuart test.The compressed files using JPEG-LS show similar results in runs test, difference signs, and Cox-Stuart test.In terms of the turning points, JPEG-LS comes after JPEG-2000 and near 7z in lesser number of files that passed the test.

Randomness and Compression Ratio
We did not find any significant relationship between the compression ratio and the results of the randomness tests.Table 2 shown the correlation results.

Utilizing Statistical information
It is obvious from Table 2, that there are no significant relationships between the randomness and the compression ratio.We pointed out earlier (subsection 2.2) that statistical information generated by the tests will be used to understand the structures of compressed data and their relationships to the compression ratio.
This information includes:  Number of runs (NUM.OF RUNS), this number is used in runs test as () in equation ( 3). Number of positive signs (NUM.OF P_SIGNS_D_S) resulting from comparing the adjacent numbers, this number is used in deference-sign test as () in equation ( 6). Number of turning points (NUM.OF T_POINTS), which are used in turning points test as () in equation ( 9). Number of positive signs (NUM.OF P_SIGNS_COX), resulting when the second half pair wised with the first half.The symbol () was used.This statistical information, when divided by their compressed file sizes, results in their ratios (or statistical ratios).
It should be pointed out that all these ratios are nearly equal (after rounding to 2 decimal points), regardless of whether their accompanied files passed or fail the randomness test.These values are:  Ratio of number of runs (# RUNS): 0.50 (or 50% of the file size)  Ratio number of positive signs in difference-sign test (# PS_DS):0.50 Ratio number of turning points (# TURNIGN): 0.66  Ratio number of positive signs in Cox-Stuart test (# PS_COX):0.25

Correlations between Ratios and Randomness
The correlations between the ratios and the randomness is shown in Table 3.The correlations now differ than of Table 2, no correlations shown in Table 2, significant correlations are while there were shown in Table 3. JPEG-2000 shows strong negative relationships between the ratios of number of runs, number of positive signs (difference-sign test), and the number of turning points with the corresponding randomness tests.
JPEG-LS shows the moderate negative relationship between the ratio of turning points and randomness, whereas 7z files shows the moderate positive relationships for the same test.

Correlations between Ratios and Compression Ratio
When taking the correlations between ratios and the compression ratio, JPEG-LS files are the only type that shows strong positive relationships between ratios of number of runs, number of positive signs (difference-sign test), and number of turning points, with the compression ratio, which are completely logical relationships.Bzip2 files shows negative strong relationship only between the ratio of positive signs (difference-sign test) and the compression ratio.

Statistical Information and Compression Ratio
When it comes to the correlations between statistical information and the compression ratio, it is clear that a very strong negative relationship is held inside each compressed file type, without exceptions.This relation is straightforward, i.e., the more number of statistical numbers, the less compression ratio we obtain.JPEG-2000 shows the nearly perfect relationships than 7z.Table 5 shown these relations.

CONCLUSION
There are no direct relationships between the results of randomness tests and the compression ratio, but the experiments show a strong relationship between the statistical ratios and the compression ratio for only the JPEG-LS files, except for the Cox-Stuart test.
A nearly perfect relationship was detected between the compression ratio and the ratios inside each compressed data generated by the same algorithm, which means that given two images of the same size, if compressed by the same algorithm, the compression ratios of both will conform to the statistical information of any of the four tests.
JPEG-2000 files show a strongly negative relationship between the statistical ratios and randomness results, except for the Cox-Stuart test, which means that higher ratio values result in lower p-values.
The results proved that the statistical information and statistical ratio are more beneficial when compared to the compression ratio.The results assured that, representing random files using integer representation shows that, the file harder to deal with, because of the high frequency within its data.

Figure 2 .
Figure 2. Number of files that passed the tests

Table 1 .
Average of Compression Ratios

Table 2 .
Correlation between Randomness and Compresseion Ratio

Table 3 .
Correlations between the Ratios and Randomness

Table 4 .
Table 4 shown these relations.Correlations between Ratios and Compression Ratio

Table 5 .
Correlations between Statistical Information and Compression Ratio