Okay, good afternoon everyone.
So I wanted to talk a little, well first of all, some of you may have heard my presentation
this morning and I deliberately did not want to cover the same ground that I covered this
morning for my keynote talk, so I decided to cover something a little different, which
is the topic of image quality.
We've done work in a couple different aspects of this, so that's what we'll talk about
today and I actually have three different subtopics here that we'll talk about which
will be three separate presentations.
I think it's pretty short, so hopefully we can stay on schedule.
The first one is, well actually here they are, I did put it in here.
So we're going to first talk about a three-stage spatial chromatic model for a low-level visual
system and show some applications of that.
Then I'll talk about video compression, which I guess is true multimedia.
This actually represents a fairly extensive project in what we call color-aware video
compression, but I'm mainly going to focus on one aspect of that, which is temporal
drift.
And then finally, I want to talk a little bit about the influence of local image structure
on the perception of image distortion in a near-threshold scenario where the noise is
barely visible.
So the first topic is the idea of a basic low-level spatial chromatic human visual system model.
So this model has three stages.
There's a trichromatic stage, an opponent stage, and then a spatial frequency filtering
stage.
This is very much driven, I guess I would say, by the thinking in the color science community
of which I interact with quite a bit.
And first of all, the trichromatic stage is a very simple model for the sort of analog
if motivated by the notion of three separate receptors on the retina where we have a higher
sensitivity in the long wavelength for red, high sensitivity in the medium wavelength
for green, and a high sensitivity in the short wavelength, here's the function of wavelength
lambda, for the blue, and we get a tristimulus vector of three components, three scalars,
by taking each of these sensor response functions, multiplying by the spectral stimulus and
integrating.
So that's the first stage.
And by the way, these sensor response functions do not have to actually correspond to the
cone responses of the human eye.
They could instead be a linear transformation of those cone responses, and in fact that's
what the CIE standard XYZ observer is, and this shows the X bar, Y bar, Z bar color matching
are used to compute CIE XYZ, which is typically the first stage of a model.
Now the second stage, I should have reproduced this slide, I forgot to do that.
The second stage is the opponent stage, so we'll talk about that for a minute, and there's
actually quite a few versions of this.
The one that, first, even if we talk about CIE L star, A star, B star, that is a so-called
uniform color space, one of the models that is expressed in terms of CIE XYZ as shown.
It's considered an opponent color space because there are three dimensions, one is the lightness
axis, the other, the first opponent channel, that's not an opponent channel, the first
opponent channel is red-green, which is the A star axis, the B star axis is blue-yellow.
One of the things we've done with this model is to linearize it to make it more appropriate
for use with a digital half-toning algorithm.
There are other versions, one version that I will be talking about a little bit later
in part of the presentation with a model developed by Brian Wendell at Stanford University called
the 010203 model.
Now, the next slide here is just to illustrate, and this is based on the 010203 representation
of what these opponent channels look like.
So this first one is the lightness, which we extract by essentially holding the two
opponent color channels constant.
The second one is the red-green, colors are a little off, but that's okay.
So basically we're holding the lightness and the blue-yellow channel constant, and then
finally we get the blue-yellow channel, and that looks like this.
Now, oops, I didn't quite get to it yet.
Okay, and again, the colors are a little off here in terms of how they're displayed.
There is physiological evidence for these opponent channels in the visual system.
But basically certain neurons will fire faster in response to a red stimulus and slower in
response to a green stimulus and others perform in an analogous way for a blue-yellow stimulus,
or blue or yellow, I shouldn't say.
Okay, so that's the second stage of the model.
Now we're ready to do the third stage, which is spatial frequency filtering.
Now I was trying to make this presentation fairly concise, so I left out the important
caveat about viewing distance that these models all depend on an assumed viewing distance.
And this is one of the caveats or one of the weaknesses, I think, of a lot of the work
that's been done on image quality in the image processing community, particularly SSIM,
has no assumption about viewing distance.
So we're going to be assuming here a certain viewing distance.
And this particular pair here, these are expressed in cycles per sample.
So the viewing distance is essentially taken out of that.
But the idea is that in the first channel, the luminance channel, we have much greater bandwidth.
This channel here we're showing it as 01.
And in the two color opponent channels, we have a much more limited bandwidth,
and therefore we don't perceive detail as well.
And of course, this feature has been exploited.
For example, in the JPEG, more DCT coefficients are allocated for the lightness channel
than for the CRCB channel, which are another version of color opponent channel.
So here you can see that the spatial frequency response rolls off much more quickly
as a function of frequency than in the luminance channel than in the two chromatic channels.
Okay, so now I'm going to illustrate this with a concept that I stole from Brian Lundell again.
It's very nicely illustrated in his book.
What we do is we take an original image.
We deconstruct it into those three channels, the lightness and the two opponent color channels.
And then we filter just one of the channels, put the image back together again, and look at it in RGB.
So here we've taken the original image and we filtered it only in the lightness channel.
And I think, depending on where you're sitting in the room, you should observe that the detail,
particularly in the eye of the bird, has been blurred quite a bit.
So that shows that the filtering has a pretty significant effect when we apply it to the luminance channel.
Now, we're going to use exactly the same filter, but now we're going to apply it to the chromatic channel.
And we filter O2, and there's a little bit of a sharpening effect here, which I think is an artifact,
but basically it's definitely not blurred relative to the original image.
And the same thing is true of O3 or 4O3, I guess I should say.
So here we look at the original, we look at the result of filtering just the O3 channel.
We don't see any difference, which says that our ability to perceive that detail really isn't that strong in those two chromatic channels.
Okay, now how do we use this model?
This model turns out to be very beneficial for some of the work that we've been doing with design of half-tone textures.
There's no image content.
We're looking at the basic ability of a human viewer to perceive fluctuations in texture,
and we'd like to minimize those fluctuations.
And so one of the projects that we've been currently investigating, and again, I don't know, actually it doesn't look too bad from here,
I see a little bit of an artifact here, typically when I project these half-tone images,
there isn't enough resolution in the projector to really show the effect.
But what you should see, let me just step back here a little bit, or see what it looks like.
Yeah, I think it looks pretty similar to what I see up front.
So here what we're doing, these are half-tone textures that have been designed with a model-based algorithm.
I talked about it this morning, actually, called Dragfinery Search.
And we have a specific combination of what are called Neugabar primaries, which are combinations of colors that actually get printed.
So it's 40% yellow, 20% magenta, and 0% cyan magenta, and 40% white.
So there's only really three colors here, yellow, magenta, and white.
And so the yellow is a little funny looking, but you can sort of see the zoomed in burden to get a sense of what that is.
Now, what we have here is one model that we've worked with for years, which is using nascent, a filter developed by a gentleman named Masonin,
for the Luminant Channel, and then a single filter, which is based on some data collected by Mullin, who's a psychophysicist.
And over here, we have the same chromatin channel filter, which is Mullin for both chromatin channels, and we're using a different filter
due to Scott Daly, who worked for the compression company.
I can't remember the name, the audio compression company.
And then finally, this was Juan Bell's model, which we had never really used in this application before.
And I'm going to let you folks vote. Which one do you think looks the smoothest?
Raise your hand if you think this one looks the smoothest.
Okay, raise your hand if you think this one looks the smoothest.
And raise your hand if you think, yeah, nobody, this one, purely is grainier.
Now, my student actually felt that this was smoother, but last night when I was looking at it, I thought, no, I really think this one is smoother.
So we have to do a little bit more of an evaluation of this, but it's good to get, I think the audience here shouldn't go dear to me.
Okay, so that's the conclusion of the first part of my talk, which is this three-stage basiocomatic human visual system model,
which we have very extensively used in real applications to develop real half toning algorithms for real products.
The next part of this is going to be newer work. Well, that work is actually very current. Those patterns I just showed last week when I was visiting the division of AP and Iteral.
But this is work that I don't know that it's been used yet.
So this represents some work that was done by one of my Ph.D. students. I co-supervised him with Ed Delp, who has done a lot of work with video compression.
The focus on this was temporal drift correction, and we'll talk about what that means in a minute.
And this was something we presented at the Color Image and Conference in San Diego about a month ago, I guess.
So I think some of you may be familiar with video coding.
Basically, we're going to encode the interframe difference and send that residue information.
And Mark Shaw, who has a color science background but has been very interested in video compression,
his major premise was that the way people have been doing this historically doesn't really take into account the knowledge of color science.
So here's the MPEG framework, and I'm not going to go through the details, it's a bit complex,
but basically we're going to have this residue, which is a frame-to-frame difference,
and then we transform scale and quantize it, and that's what gets encoded.
And what Mark Shaw's premise was that we should pre-process this residue to remove information that is not visually important,
and that not visually important is based on color models.
So I'll illustrate one aspect of that.
Whoops, yeah, here it is.
So we have here two frames, the frame K plus 1, the motion prediction of it.
We look at the difference between those, and the key idea is we're going into this uniform color space,
and we're looking at a color difference image expressed in terms of delta E units, which is Euclidean distance in LAB,
and we're going to suppress this image if the color difference is not very large.
So this is a mapping that we will apply to it,
and you can see that if the color difference is less than 2 delta E units, we don't send any residue information,
and then we linearly increase the malresidue information until we get to about a color difference of 4 delta E
for this particular tone map, and then we're sending all of it.
So that is a way to remove information that the observer wouldn't be sensitive to in the video sequence.
So now this was just one aspect of a series of four things, or five things, that Mark investigated.
We published this in a series of publications, including his dissertation, if you're interested.
But what I'm mainly going to focus on today is this concept of drift correction, which is the most recent work that we've done.
The idea of drift correction is that, okay, so I go along here.
First of all, this is frame count, and this is SSIM, which is the structural similarity measure
that was developed by Professor Bohmick and his students at the University of Texas.
It's very widely used, and one is very good quality, and then as the number goes down, it becomes poorer.
So what happens here?
Normally, well, this is the standard codec, and we're trying to compress more than that, but without a visual change.
So when we compress more than that, we're going to get a lower SSIM value, but this is okay because we don't have a sudden change from frame to frame.
And here, all of a sudden, we get to an eye frame, and we reset everything, and there's a big difference.
And the viewer sees that as a change, which is undesirable.
So what we do is we start to send more residual information as we get closer to the eye frame,
and that's controlled by this term called the drift factor.
Now, of course, there's no free lunch.
If we're sending more residual information, we're going to lose a little bit in terms of compression, as you'll see.
But the idea is to do this just a little bit and improve the visual quality, but not sacrifice too much of our compression ratio.
So this is where it goes. There's a lot of other stuff here I'm not going to talk about.
But in the end, we get something that looks like this, again, measuring it in terms of SSIM, which is a quantitative metric.
You can see that with one method called dynamic quantization.
That was what I showed you before.
We get a very sudden jump in quality, which is visually perceivable.
Whereas if we gradually start to increase the transmission of residual information, the size of that jump when we get to the eye frame is much smaller.
Now, we didn't think that looking at SSIM was going to be a completely reliable predictor of visual quality, so we did some psychophysics.
And we did an experiment. We followed the recommendations of a standard. It's a double stimulus impairment scale.
We asked subjects to raise the video sequence on a five-point scale, imperceptible. Perceptible is not annoying, slightly annoying, annoying, very annoying.
And then we did some other things. We had 39 subjects, 27 were produced, 12 were at the Boise site.
And so I'm going to show you the results from that in the form of pie chart.
So now this is the reference. So that's the standard encoder.
And what's interesting here is that even with the standard encoder, so we're showing two sequences in random order, sometimes we show the same sequence.
But they think they're different. Of course, they're not.
So 16% thought that there was a perceptible but not annoying difference between two sequences that were actually the same.
So that's basically noise in our measurement process that we have to take into account when we interpret the rest of this data.
So now if we do something called the dynamic tow map, which means we're modulating the amount of information that residual information is encoded based on spatial structure in the image,
we get 5% found it to be annoying, the difference between the reference and the compressed version, further compressed version.
16% said it was slightly annoying, and 34% said the difference was perceptible but not annoying.
Now if we go to this drift correction, we decrease annoying from 5% to 1%, slightly annoying from 16% to 8%,
perceptible was not annoying, didn't go down very much, just a tiny bit, and imperceptible went up from 43% to 2%.
I mean, 59%, nobody found it to be very annoying. Previously, 2% had said it was very annoying.
So we were pretty satisfied with that.
I have one more slide here to show some additional data.
So if we calculate something called percent favorable, which is imperceptible but perceptible but not annoying divided by the total number of observations,
99% found the reference to be favorable, but going from our dynamic tow map to our dynamic tow map with drift correction,
we increase favorable response from 77% to 91%. There's no compression gain here because we haven't done any further compression beyond the reference.
Here we have a compression gain of 58%. We lost a little bit, the compression gain went down to 43%.
So that's my final slide. The conclusion is that the conditional drift diminishes correction, I should say, diminishes the appearance of a sudden abrupt change to the high frame reset.
This is compatible with the standard JM coder, and we did a psychometric experiment to confirm the validity of the results.
Okay, so that concludes the second of my three topics.
Now let's go to the third one, which is here.
Now this was also some recently presented work done by a former graduate student, Wu Chenglu.
We presented this at the International Conference on Image Processing in Phoenix back in September, so it's pretty current.
I've taken out all the math and just tried to stick with the essential concepts.
Here we're looking at the notion of two different types of image quality measures, one of which is general purpose,
and then another which is focused on near threshold, which if we're working in the domain of high quality imaging, that's what we care about, near threshold distortion.
Now one of the weaknesses of this slide is that I don't think many of you can see much difference between these different images here.
These are more distorted. These are near threshold. You might see some difference between these two and between these two, where you can see that this is more distorted than this one, and this is more distorted than that one.
So we're going to be focusing on developing a new way to predict the visibility of distortion in the near threshold domain.
So the concept is there are a number of different aspects that are in the model, and I'll show you a boss diagram in the model in a minute,
but the idea is we have a reference image. It has some noise added to it, and the hypothesis which is validated by ground proof
is that people are more sensitive to noise in areas where the content is recognizable.
So in particular, in this image of this little girl, we know what a girl's face looks like, we know what her shoulders,
and we have an expectation as to what that should look like, so we're very sensitive to distortion there.
Whereas here, it's some kind of grass or weeds or something, and here's some water.
So this actually showed, now this water is fairly smooth, so there we actually are fairly sensitive to the distortion.
Now let me explain what this is. This is a threshold map.
So the lower the threshold, the more easily we see the distortion.
So blue means we see it more easily, yellow means yellow and red, and orange means we are less likely to see it.
So this part corresponds to that area right there where the girl's face was very sensitive to that distortion.
So now the question is, can we build a model that will take into account this structural nature of the image
and create a distortion map that more carefully, more accurately models what we have here,
which is, well this said the model out, but we're going to compare with ground truth.
So anyway, this shows in more detail, for those of you in the back of the room,
what the image actually looks like. It's a pretty low resolution, and actually it looks like this is a little bit closer to her shirt than her face.
But in any case, this is what the distortion prediction, how the distortion prediction works.
We have a reference image, we have a destroyed image, we break this up into patches,
and then within each patch we compute overlapping blocks, which overlap by 50%, and they are 16 by 16 sub blocks.
And then we can basically look at the difference between these, and we'll talk about the sub-block distortion model on the next slide.
That generates a distortion map, and then we pull that to get a single number.
Now you may wonder, why are we looking at these individual patches?
Well that's because we're using a database in which subjects actually provided gradings of each individual 85 by 85 pixel patch in the image.
So we are going to train our model to mimic what the subject said.
And that data, oh god, now it's going out of my head.
Well I know the acronym for the database is here.
But anyway, so this is how we have a log-gabor filter bank, and this is, if you think about the model that I talked about earlier,
we have the color model, where we have the tristimulus stage and trichromatic stage,
then we have the color opponent stage, then we have the spatial filtering stage.
This is yet another stage in that model, very commonly used, in which we look at the notion of receptor fields in the visual cortex
that are tuned to certain frequency bands in certain angles.
And we model that with a bank log-gabor filter.
So this is a very widely used approach.
And then we have some other operations, which I won't go into the details to keep this fairly on schedule.
And again, we have the reference patch, we have the distorted patch, we look at differences, sums and differences.
The sum is to get our denominator in this inner band inhibition model.
And then the numerator is the difference between the inner bands.
We put that to a non-linear.
We pull the features.
The details can be found in the conference paper as well as we change dissertation.
Now this is the part where we do the training.
So we have all these raw image sub-blocks, we extract the structure features.
That's what we hope will characterize what humans see and recognize in the image.
We do a Gaussian mixture model to model those.
And then we feed those model parameters to fit an actual image that we want to do a prediction, for which we want to do a prediction.
We have, again, the reference and the distorted.
We're going to look at the contrast between them.
We do some light adaptation.
And then we do a mixture of different models with probability weights, the classical Gaussian mixture model,
in order to generate our final representation of the distortion.
Okay, now how did we evaluate this? We did 5-fold cross-validation.
This database is CSIQ, and I may or may not be able to remember the name of the professor who supervised the work.
He was at the University of Oklahoma.
I think he moved back to Japan.
Chandler, that's it, Chandler.
So there were 10,000 local patches, each of which had subject evaluation performed as to the visibility of the distortion between the original and the reproduction.
And these are the 30 images from which the patches are extracted.
So we have a training set, another set for validation.
We use the validation to refine the model.
And finally, we do testing with a different set.
It's a classical machine learning framework.
And here's some more information about this.
Oh, these are the models we compare with.
Previous researcher had suggested 14 features.
We did an SVM for those 14 features.
Bo Watson is somebody who has worked in this field for many years and done a lot of very good work.
He has done some of the multi-channel, basic multi-channel work.
So we look at his model, SSIM, that's the one I mentioned.
And then there's also a multi-scale version of SSIM.
These were both developed at the University of Texas.
And we do optimization on the validation set before we do any testing.
And I don't know if it says what kind of old testing we did.
Anyway, okay, so now this is the data.
So this is a scale of root mean squared error in dB.
And this is our model.
This is the model based on just using the multi-channel.
This is the one that uses the features of the support vector machine.
This is our model.
And this is the Watson-Solomon SSIM and mean squared SSIM.
Oh, I know what we're looking at.
This is root mean squared error.
There should be a label up here.
This is correlation coefficient.
That's the blue.
And this is Spearman rank order correlation coefficient.
Both popularly used measure.
So you can see that we want to be as close to one as possible.
And you can see that these two methods are very, very similar.
Hours are much less computationally intensive.
And we're doing significantly better than SSIM, especially the regular SSIM.
Down here we have root mean squared error.
And it's a similar story.
This, the whiskers show you the standard deviation across the data set.
And again, these two are very, very similar.
They're based on the model that we just talked about.
And then finally, we have here that the error is considerably higher for SSIM
and the other three basically.
The last slide I want to show you illustrates the predicted threshold map.
Remember, this is the thing we were trying to improve.
So we've highlighted here the areas in red that we think subjects will recognize
and the areas in orange where they don't recognize structure.
These are the actual ground truth threshold map.
And now what we're trying to do is replicate the threshold map.
This is Watson's model and this is our model.
And so the question is, which of these two looks closer to that?
Same thing, which of these two looks closer to that and so forth.
So you do see some improvement in terms of the threshold map.
And again, you can clearly see areas where the subjects are recognizing structure
in the ground truth and we're trying to predict essentially that kind of behavior.
So that's my presentation.
Three different aspects of image quality.
And so again, I think the bottom line is that if you want to investigate image quality
and use image quality measures in your work,
you really should use measures that are based on models for the human visual system.
Because that's what each of the three models I talked about
has a strong basis on the visual system.
Thank you very much.
Thank you.
