Structure & Context: The problem

The Effects of Congruency Between Structural & Contextual Dominance in Image Processing

The problem with pictures is that they are largely contextually defined instead of being structurally defined. That is, it is much more common for a person to describe a picture in terms of the contextual objects (people, places, or things) within the picture than in terms of the visual elements that make-up the picture (circles, squares, points, lines, colors and shading). One might say, "Of course! It would be too cumbersome to describe a picture in those terms because they are not specific enough!". This statement is true if you are only considering the contextual meaning of a picture. Circles, lines, and points, have no contextual meaning of an by themselves. However, if an artist wishes to create or modify a picture, the only elements available for this purpose are lines, circles, squares, color and shading. There are no art brushes that paint faces automatically such that when the brush is dragged across the canvas a face appears. Instead the artist must vary lines, shapes and points, combined with shading and textures, in such a way as to produce contextually recognizable elements in the picture. The question is then, how does and artist know how to vary these structural elements in such a way as to produce the desired contextual meaning of a picture if these elements have no contextual meaning of and by themselves? Even more problematic is how an instructional designer specifies a necessary picture to an artist such that the appropriate contextual meaning is communicated from the picture the artist creates.

In an instructional message there are often subtle, yet critical, aspects that must also be emphasized in supportive graphics or images. The instructional designer must be able to describe the contextual emphasis such that an artist or graphic designer will be able to manipulate the structural elements of the picture in a corresponding fashion. The goal is to make the structural emphasis congruent with the contextual emphasis. The emphasis of the contextual message and the emphasis of the structure of a picture is referred to in this paper as contextual dominance and structural dominance respectively.

In an instructional message the contextual dominance is most often conveyed in the form of printed or spoken sentences. Within any sentence used in conjunction with a picture are nouns or phrases that directly relate to contextual elements within the picture. For instance if we are presented a picture and heard "The man walked through the door." we would expect to identify a man and a door in the picture. Both "man" and "door" may be called referents in the sentence since they refer to objects perceptible in the picture. The study described herein varied the dominance of referents used in a number of sentences and compared the patterns of subsequent observations of 15 pictures. The goal was to identify structural and/or contextual elements that stimulated consistent patterns of observation. A brief discussion of the literature, the methodology used, results for two pictures, and a summary of conclusions follows.

Synthesis of the Literature

Discussing how someone views and image depends also on describing the image itself. Most of the research to date use contextual definitions to describe the images and not structural ones; but, there have been some attempts to quantify the structure in terms of variations of color, complexity and layout.

Color, Complexity, and Layout

In general color has been found to be helpful when utilized to emphasize dominant contextual features, but it is a destractor when not relevant to context (Luder & Barber, 1984; Yarbus, 1967; Reid & Miller, 1980) For instance, Luder and Barber found that color could be utilized to highlight specific elements within a complex display, while Reid and Miller found that full color illustrations of anatomical drawings presented too many destructors from what was considered important in the drawing. Complexity and picture layout has been found to elicit different viewing strategies among students. Specifically, Malcolm Fleming utilized four "layouts", one with a caption preceding an illustration and another with it following. Each of these layouts also varied by complexity. Utilizing eye movement as a dependent variable, he found that gender and previous experience with the information were strong factors as to which strategy to process the information was chosen-- image first or caption first (Fleming, 1984).

Another attribute, which relates to complexity, is that of the "degree of realism" to which an image may be attributed. The "degree" represents a continuum from concrete, or realistic, to abstract, or non-realistic. Gavriel Salomon describes this continuum operationally in terms of the degree of "coding" one must do when encountering a visual representation of something.

"Certain representations appear to be more "realistic" because their symbolic form comes closer to the way users represent the depicted entity to themselves. The less recoding something requires, the mentally "easier" it is to experience and the more "real" it appears." (Salomon, 1981 pg. 201)

Relative to this "concrete to abstract continuum", concrete images are remembered better than abstract images (Paivio, 1983; Winn, 1982; Findahl & Hoijer, 1976; Wolf, 1970). On the other hand, abstract graphics have been found to be more successful in educational contexts due to fewer structural elements (Heuvelman, 1987).

Visual Design of an Image

One aspect left largely unstudied in most of the research in image utilization is that of the visual design of the stimuli being used. Visual design here refers to the structural relationships of visual elements within an image. Layout, mentioned earlier in the Fleming study of complexity, is the closest variable this researcher found in the literature to that of visual design. Nesbit captured this very common problem in the following statement.

"Little attempt was made to examine the design qualities of the picture itself or the component cues of the picture in terms of learning theory. All modes of pictorial representation were considered as infinitely large masses of stimuli and examined as such... no attention was given to isolating those elements which make an instructionally effective visual." (Nesbit, 1978 pg. 496)

The variable of style was of interest to Molner. He believed that traditional definitions of style, usually treated as purely affective and subjective, were identifiable in terms of structural attributes of an image. His eye movement study of Renaissance and Baroque paintings illustrates that the structural attributes of paintings may be utilized in such a stylistic manner that it can influence the viewer to scan an image at a certain speed and focus. He found that art of the Renaissance was viewed with large and slow eye movements, while Baroque art produced denser and shorter eye movements (Molner, 1981).

Image and Language Processing in a Context

Utilizing a "dual-coding information processing" model of learning, Kozma points out that if information in long-term memory may be stored, not only semantically but pictorially as well, then images can be retrieved into short-term memory in response to either nonverbal or verbal stimuli. He summarizes the research by stating that if the same information is stored both pictorially and verbally, it is more likely to be retrieved (Kozma, 1986). Just as printed text under an image provides contextual information utilized to view an image, so does spoken text during or immediately preceding an image affect the processing of that image. Many media today entail the learner processing audible textual information and visual information arriving at the same time. Very few studies have examined the intricate interactions involving the chunking, sequencing, and pacing of these dual-channel stimuli. But, it has been confirmed that the majority of the contextual effect is provided by the verbal text, and when verbal contextual cues are presented prior to viewing an image, a semantic attention to the image is evoked (Koroscik, Desmond, & Brandon, 1985).

This strategic focus supports previous prescriptive conclusions relating to color, complexity, and the schematic design of an image. Both point to contextual cues dictating structural prescriptions in the visual channel and specific reliance on the verbal channel for being the major carrier of contextual information.

Koroscik, Desmond, and Brandon examined this relationship among structural, semantic, and verbal contextual information. They began the study with the following hypotheses: It was speculated that the encoding and retention of art is subject to the type of contextual information given to viewers at presentation. Verbal labels with literal references to the objects, persons, or events depicted in an artwork ought to evoke semantic encodings that differ from those generated in response to verbal references pertaining to the work's expressive qualities and/or other non-literal aspects of the depicted content. (Koroscik, 1984 pg. 332)

By presenting first a verbal contextual cue and then presenting an image, they were able to measure the effect of the context on retention of the desired message. Since the contextual cues referred literally to structural elements within the images, some information was also gained about the degree of distraction of "non-pertinent" structural elements. Results also indicated that accurate interpretation of meaning was a function of the level of abstraction that characterized each artwork and of the type of contextual information given at input (Koroscik, 1984).

These results offer very few prescriptions to a message designer other than a sense that context and the structure of an image may somehow be interdependent. On one hand structural concerns are important for immediate interpretation and long-term memory, yet on the other there is, at some point, a shift of focus to contextual aspects of an image. Which is dominant, structure or context? What is the relationship between the abstract-to-concrete word continuum and an abstract-to-concrete image continuum? The concept of congruency begins to explore the relationships posed by these questions.

Congruency Between Images and Language

In his review of the contributions of eye movement studies to research, Marschalek states:

..thus the compositional structure affects perception most dramatically when structure and meaning are united within the same areas of the picture. (Marschalek, 1986 pg. 135)

This succinct statement encapsulates the idea of congruency between words and images. The concept of congruency between the structural components of an image and the context presented is of extreme importance to the message designer. Congruency deals with the basic problem of linking words and images together. Many researchers have found that when structural features and contextual features are congruent, then attention to the message is maximized (Heuvelman, 1987; Hsia, 1977; Marschalek, 1986; Miller, 1982; Nodine, 1982; Wember, 1976)

A growing number of cognitive scientists are specifically looking at the relationship between the cognitive processing of words and the processing of images. A Dual-Coding Theory was developed by Paivio which stated that:

"The verbalization of a picture's features increases the probability that two codes are activated in the formation of stimulus memories. The argument is that sensory features of pictures are stored in imagery codes, while the products of verbalization are retained as verbal or linguistic codes." (Madigan, 1982, pg. 80)

This statement is contrary to a "common code" view that says pictures somehow possess a faster access than words to a common conceptual system. The dual-coding view, on the other hand, sees picture-word latency differences as stemming from a time consuming translation from one symbolic code to another and that semantic information required in the decision task is typically stored nonverbally. Numbers of other researchers have been testing similar concepts and have arrived at some meaningful conclusions. Segal and Fusella (1970) found, by using interference tests, that cognitive processing is modality (visual or audio) specific, i.e., the brain operates with a separate processing system for each modality. Nugent (1982) and Wickens (1984) found that "learners process pictorial and linguistic information through functionally independent, though interconnected, cognitive systems."

Whenever one deals with issues of context, they are subject to personal interpretation as to their meaning, importance, and dominance. For an individual, attribution of meaning will depend on knowledge the viewer already has, knowledge that can be associated with the incoming information (Heuvelman, 1987). This view is commonly held by many researchers and recent language comprehension studies have indicated that language processing involves a context-dependent knowledge base that operates in an integrative and elaborative manner (Anderson & Ortony, 1975; Barclay, 1973; Bransford, Barclay, & Franks, 1972; Marschark & Pavio, 1977). Dillen (1983), Braden & Walker (1980) and Wise (1982) also point to prior knowledge of the individual as a variable which significantly determines how an image will be perceived. Craik and Lockhart place prior experience as a comparative referent within the memory storage system. These researchers go on to describe two stages of the memory formation process. They are:

EARLY STAGES...: the analysis of physical features of incoming information...and
LATER STAGES...: concerned with matching input against stored knowledge from past learning, and with abstracting meaning."

Some researchers have consistently found that when images are utilized which include people, fixations center on their faces to the almost total exclusion of anything else in the image (Buswell, 1935; Chu & Schramm, 1975; Guba, et al, 1964; Yarbus, 1967). The implication of this is that people know from experience that faces and animate objects, in general, are primary providers of contextual information.

Animate objects receive a higher density of fixations than inanimate objects when both are contained in the same picture. When considering portraits, the highest density of fixations occurs on the eyes, nose, and mouth because these areas of the face tend to convey information concerning emotion and the degree of physical attractiveness of the individual. (Yarbus, 1967 pg. 28) Structural and Contextual Dominance

A primary concept within this study, in relation to the analysis of contextual and image structure, is that of dominance. It has been found that it is possible to define an image in terms of its physical structure apart from its contextual elements (Friedman & Polson, 1989; Koroscik, 1984; Marschalek, 1986). Likewise it is proposed that linguistic analysis can focus on both the structural aspects of a sentence apart from the contextual aspects (Mason, Kniseley, & Kendall, 1979; Smith & van Kleeck, 1986). In each of these four categories it is proposed that a dominance exists which may be utilized in comparing these categories, resulting in a description of congruence, non-congruence, or ambiguity. It is toward the definition of image structure, and specifically those structural elements which affect dominance, that this study is directed.

Since no models emerge from the literature, we may turn to the discipline of graphic design for a description of visual form by Wucius Wong (Wong, 1972, pg. 6) which seems to fit a hierarchical description of visual form. An adaptation of his hierarchy appears below.

Visual Form:

as BASIC ELEMENTS: of Point, Line, Plane, Volume
as STRUCTURAL ELEMENTS: of Shape, Size, Color, Texture
as RELATIONAL ELEMENTS: of Contrast, Direction, Position, Space, Gravity
as CONTEXTUAL ELEMENTS: of Representation, Function, Meaning

These 4 levels of visual form give us not only a language to describe what we see in an image, but also a hierarchy which is harmonious with our previously stated levels of analysis. Describing visual form in this manner is no simple task, for in order to compare images reliably they must be described in terms of their attributes at each of these four levels. It is not long before one realizes that in fact it indeed takes a thousand words to describe a picture. This overwhelming task of image description is reduced through the use of the concept of dominance.

In discussing the structural form of an image, the most appropriate discipline to draw from is art criticism or art history. Specialists in this field are skilled in using words to describe images. A common practice of many art historians is to reduce what may seem to be a very complex image to a few general shapes, colors and textures. An example of this is a description of Goya's "The 3rd of May 1808: The Execution of the Defenders of Madrid" (Fig. 1).

Figure 1

"Its organic structure, based on triangles and strong diagonals, is peculiarly fitting to the theme, and its neutral colors in grays and browns, with a splash of red in the pool of blood heightens its emotional impact." (Gardner, 1959, pg. 443)

One item to note from this description is the implication that these structural elements are arranged to produce an emotional impact and not a linguistic one. This underscores the dual-coding caveat of images being processed in the affective domain of cognition. This description is very brief and encapsulates hundreds of individual elements into gross generalizations. It is necessary to exclude lesser elements if we must focus on the dominant ones. It is interesting to note that even this brief description alludes to contextual elements of the "theme" and the "pool of blood" and is obviously formed as a caption intended to be read while viewing the painting. This underscores the difficulty in separating contextual elements from structural ones.

There are hundreds of individual structural elements including at least 22 people, 16 faces, all of their wearing apparel, weapons, the foreground elements, the city in the background and the dark night sky. That we can attribute meaning to the shapes and brush strokes alerts us to the fact that we are using contextual descriptions of structural elements. If we use entirely contextual descriptions Goya's painting becomes much simpler. We could describe it as an image consisting of three groups of people in front of a wall with a city in the background. Another description of the same work is presented to illustrate almost an entirely contextual description.

"Here the blazing color, broad fluid brush work, and dramatic nocturnal light are more emphatically Neo-Baroque than ever. The picture has all the emotional intensity of religious art, but these martyrs are dying for Liberty, not the kingdom of Heaven; and their executioners are not the agents of Satan but of political tyranny -- a formation of faceless automatons, impervious to their victims' despair and defiance. " (Janson, 1965, pg. 479)

It would seem to this researcher that Goya would support both this description and Gardner's earlier one because he chose structural elements and manipulated them in such a way that the later contextual description would be perceived. He chose to manipulate the structural elements of grays and browns so that the red pool of blood would be dominant. The two overlapping triangular shapes point like arrows toward the two dominant groups of people. We view the scene just an instant before the pure, bright white shirt of the rebel is to become crimson from the lines of the rifles pointing directly at it. In both examples stated above, the structural features of the image are tied directly to the contextual features. The artists have selected and arranged structural elements in such a masterful way that the desired context is effectively communicated to the observer. Another way to state this is that there is congruence between the structural dominance of the image and the contextual dominance that was the intent to communicate.

Based on the findings in the literature and translated into the terms of dominance, context, and congruency, the following hypothesis were adopted for this study.and which the methodology was designed to test.

H1.: Congruency between structurally dominant elements and contextually dominant elements will result in greater attention to these elements than in less-congruent situations.
H2.: As the complexity of an image increases, the structural dominance will decrease.
H3.: As structural dominance decreases, attention will be directed more to contextual elements.

http://silver.ucs.indiana.edu/~appelman/D_ONE.html