Week 4 (June 6 - June 10, 2022)

This week, I continued investigating automated measures of visual complexity. Two of the measures I used to rate the complexity of images from the SAVOIAS and MS-COCO datasets were feature congestion and number of regions (the latter calculated after segmenting images via mean-shifting).

Feature congestion is a way of quantifying how “cluttered” an image is using a single number. A cluttered image has a “congested” feature space: it is difficult to add a new item to the image that would stand out (i.e., be visually “salient”), due to the already highly variable feature distribution. In other words, a cluttered image appears very “busy” or complex (e.g., very colorful, with variable textures). Feature congestion combines the variability of 3 features into a single metric across 3 scales: color, orientation, and luminance (Rosenholtz et al., 2007).

I computed the correlations of feature congestion and number of regions per image with human judgements of visual complexity, provided respectively by SAVOIAS and by previous vislang experiments on the MS-COCO dataset. I also built webpages displaying the SAVOIAS and COCO images sorted from most to least complex according to each complexity metric.

Feature congestion and number of regions were strongly correlated with human complexity judgements for interior design images, but were noticeably worse at predicting complexity on images in the SAVOIAS Objects and Scenes categories. I was surprised by how useful the pages displaying sorted images were for inferring why this might be. For instance, take the following photo of a bus in front of a building with many windows, from the MS-COCO dataset (Lin et al., 2014):

A bus in front of a building with many windows.

This image was mean-shift segmented into a large number of regions, presumably due to the presence of the numerous windows on the building in the background, and also possibly due to the lettering/paint on the bus. Thus, if we used number of regions as our only measure of visual complexity, this image would be rated as highly complex. However, as a human being with commonsense knowledge of the world around me, I can easily identify the bus, building, pavement, etc., and am not so preoccupied with the number of windows on the building. I probably would not consider the above image to be exceptionally complex, because I see it primarily in terms of the building
and the bus as objects, rather than considering each of their constituent parts (e.g., a single window). This example illustrates the challenge of quantifying visual complexity using low-level features like number of image regions.

During my meetings with Professor Ordóñez-Román this week, we therefore discussed alternative automated metrics for measuring visual complexity, the possibility of combining feature congestion and number of regions into a more robust metric, and newer algorithms than mean-shift for image segmentation.

Unfortunately, linearly combining feature congestion and mean-shift segmented number of regions into a single complexity metric did not significantly improve the quality of the complexity metric (as measured by its correlation with human judgements of complexity). I’m now reading more papers and documentation to understand how we might use alternative algorithms and metrics to quantify visual complexity.

In addition to meeting with Professor Ordóñez-Román this week, I also met with Ziyan, one of the PhD students in the vislang lab, to discuss how the human complexity scores were generated experimentally by vislang for the MS-COCO data. Basically, people were shown 10 images at a time and asked a fixed set of 3 questions about each image (e.g, are there people in this image?). People were allowed to repeat the task for new sets of images, and multiple people answered the questions for each of the 5,000 images. The complexity score for each image is a linear combination of how long people took to answer the questions and interrater agreement on the questions. Intuitively, a more complex image should result in longer question answering times and lower interrater agreement. Unfortunately, after computing correlations between these experimental human judgement-based visual complexity scores and the automated metrics (feature congestion, number of regions), as well as qualitatively inspecting the images ranked as most complex by the human judgement metric, it appears that this metric is strongly affected by the types of questions that were asked (and perhaps also by variability in answering times across raters).

For example, when I wondered why the following image was rated as highly complex, Professor Ordóñez-Román speculated that it was because there are people in the background of the image that are difficult to pick out, due to their small size as a result of distance and perspective. Since one of the questions was whether people were present in the image, people probably took longer to respond due to the difficulty of confirming whether the distant specks on the beach are indeed people. And a longer answering time increases the image’s complexity score.

A beach with a boat on the left, birds in the air, and people in the distance.

Professor Ordóñez-Román recommended that we nail down an automated visual complexity metric by next week. Once that’s done, we can identify complex/non-complex COCO images and train a text classifier on the COCO image captions to discern complex from non-complex images. Basically, the automated visual complexity metric will act as our “groundtruth.”

On my plate for next week:

  • Testing this implementation of the Selective Search Algorithm
  • Reading paper and documentation on Faster RCNN
  • Both of the above could be used to quantify visual complexity by counting the number of proposed locations per image for object detection.
  • Packing for & moving into my dorm at Rice!
  • (hopefully) getting the stitches from last week’s (minor) surgery taken out

It’s hard to believe next week will mark the halfway point of my DREU experience, as well as how much I’ve learned already!

References:

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common Objects in Context. In D. Fleet, T. Pajdla, B. Schiele, & T. Tuytelaars (Eds.), Computer Vision – ECCV 2014 (Vol. 8693, pp. 740–755). Springer International Publishing. https://doi.org/10.1007/978-3-319-10602-1_48

Rosenholtz, R., Li, Y., & Nakano, L. (2007). Measuring visual clutter. Journal of Vision, 7(2), 17. https://doi.org/10.1167/7.2.17

Back to homepage

Written on June 10, 2022