Brain Topography, Volume 18, Number 2, Winter 2005 (©2005) 67 DOI: 10.1007/s10548-005-0276-8
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
218
Perceiving visually presented objects: recognition, awareness, and modularity Anne M Treisman* and Nancy G Kanwisherf
Object perception may involve seeing, recognition,
preparation of actions, and emotional responses-functions
that human brain imaging and neuropsychology suggest are
localized separately. Perhaps because of this specialization,
object perception is remarkably rapid and efficient.
Representations of componential structure and interpolation
from view-dependent images both play a part in object
recognition. Unattended objects may be implicitly registered,
but recent experiments suggest that attention is required to
bind features, to represent three-dimensional structure, and to
mediate awareness.
Addresses *Department of Psychology, Princeton University, Princeton, New Jersey 08544-1010, USA; e-mail: treisman@phoenix.princeton.edu tDepartment of Brain and Cognitive Sciences, El O-243, Massachusetts Institute of Technology, Cambridge, Massachusetts 02138, USA; e-mail: ngk@psyche.mit.edu
Current Opinion in Neurobiology 1998, 8:218-226
http://biomednet.com/elecref/0959438800800218
0 Current Biology Ltd ISSN 0959-4388
Abbreviations
ERP event-related potential fMRl functional magnetic resonance imaging IT inferotemporal cortex
Introduction It is usually assumed that perception is mediated by specific patterns of neural activity that encode a selective
description of what is seen, distinguishing it from other
similar sights. When we perceive an object, we may form
multiple representations, each specialized for a different
purpose and therefore selecting different properties to
encode at different levels of detail. There is empirical
evidence supporting the existence of six different types
of object representation. First, representation as an ‘object
token’-a conscious viewpoint-dependent representation
of the object as currently seen. Second, as a ‘structural de-
scription’- a non-visually-conscious object-centered rep-
resentation from which the object’s appearance from other
angles and distances can be predicted. Third, as an
‘object type’-a recognition of the object’s identity (e.g. a
banana) or membership in one or more stored categories.
Fourth, a representation based on further knowledge
associated with the category (such as the fact that the
banana can be peeled and what it will taste like). Fifth, a
representation that includes a specification of its emotional
and motivational significance to the observer. Sixth, an
‘action-centered description’, specifying its “affordances”
[l], that is, the properties we need in order to program
appropriate motor responses to it, such as its location,
size and shape relative to our hands. These different
representations are probably formed in an interactive
fashion, with prior knowledge facilitating the extraction of
likely features and structure, and vice versa.
Evidence suggests that the first four types of encoding
depend primarily on the ventral (occipitotemporal) path-
way, the fifth on connections to the amygdala, and the
sixth on the dorsal (occipitoparietal) pathway; however,
object tokens have also been equated with action-centered
descriptions [PI. Dorsal representations appear to be
distinct from those that mediate conscious perception;
for example, grasping is unaffected by the Titchener
size illusion [3]. Emotional responses can also be evoked
without conscious recognition (e.g. see [4**]). Object
recognition models differ over whether the type or identity
of objects is accessed from the view-dependent token or
from a structural description; in some cases, it may also be
accessed directly from simpler features.
The goal of perception is to account for systematic
patterning of the retinal image, attributing features to their
real world sources in objects and in the current viewing
conditions. In order to achieve these representations,
multiple sources of information are used, such as color,
luminance, texture, relative size, dynamic cues from mo-
tion and transformations, and stereo depth; however, the
most important is typically shape. Many challenges arise in
solving the inverse problem of retrieving the likely source
of the retinal image: information about object boundaries
is often incomplete and noisy; and three-dimensional
objects are seen from multiple views, producing different
two-dimensional projections on the retina, and objects in
normal scenes are often partially occluded. The visual
system has developed many heuristics for solving these
problems. Continuity is assumed rather than random varia-
tion. Regularities in the image are attributed to regularities
in the real world rather than to accidental coincidences.
Different types of objects and different levels of specificity
require diverse discriminations, making it likely that
specialized modules have evolved, or developed through
learning, to cope with the particular demands of tasks
such as face recognition, reading, finding our way through
places, manipulating tools, and identifying animals, plants,
minerals and artifacts.
Research on object perception over the past year has made
progress on a number of issues. Here, we will discuss
recent advances in our understanding of the speed of
object recognition, object types and tokens, and attention
and awareness in object recognition. In addition, we will
Perceiving visually presented objects Treisman and Kanwisher 219
review evidence for cortical specializations for particular
components of visual recognition.
The speed of object recognition Evolutionary pressures have given high priority to speed
of visual recognition, and there is both psychological and
neuroscientific evidence that objects are discriminated
within one or two hundred milliseconds. Behavioral
studies have demonstrated that we can recognize up to
eight or more objects per second, provided they are
presented sequentially at fixation, making eye movements
unnecessary [S]. Although rate measurements cannot tell
us the absolute amount of time necessary for an individual
object to be recognized, physiological recordings reveal
the latency at which the two stimulus classes begin to
be distinguished. Thorpe et al. [6”] have demonstrated significant differences in event-related brain potential
(ERP) waveforms for viewing scenes containing animals
versus scenes not containing animals at 150 ms after stim-
ulus onset. Several other groups [7,8*,9-111 have found
face-specific ERPs and magnetoencephalography (MEG)
waveforms with latencies of 155-190 ms. DiGirolamo and
Kanwisher (G DiGirolamo, NG Kanwisher, abstract in
Psychonom Sot 1995, 305) found ERP differences for line drawings of familiar versus unfamiliar three-dimensional
objects at 170 ms (see also [S]).
Parallel results were found in the stimulus selectivity
of early responses of cells in inferotemporal (IT) cortex
in macaques, initiated at latencies of 80-looms. On
the basis that IT cells are selective for particular faces
even in the first 50ms of their response, Wallis and
Rolls [12] conclude that “visual recognition can occur
with largely feed-forward processing”. The duration of
responses by these face-selective cells was reduced from
250ms to 25 ms by a backward mask appearing 20ms
after the onset of the face, a stimulus onset asynchrony
at which human observers can still just recognize the
face. The data suggest that “a cortical area can perform
the computation necessary for the recognition of a visual
stimulus in ZO-30ms”. Thus, a consensus is developing
that the critical processes involved in object recognition
are remarkably fast, occurring within lOO-200ms of
stimulus presentation. However, it may take another
1OOms for subsequent processes to bring this information
into awareness.
Object tokens How then does the visual system solve the problems of
object perception with such impressive speed and accu-
racy? A first stage must be a preliminary segregation of the
sensory data that form separate candidate objects. Even
at this early level, familiarity can override bottom-up cues
such as common region and connectedness, supporting
an interactive cascade process in which “partial results of
the segmentation process are sent to higher level object
representations”, which, in turn, guide the segmentation
process [ 13.1.
Kahneman, Treisman, and Gibbs [14] have proposed
that conscious seeing is mediated by episodic ‘object
files’ within which the object tokens defined earlier
are constructed. Information about particular instances
currently being viewed is selected from the sensory
array, accumulates over time, and is ‘bound’ together in
structured relations. Evidence for this claim came partly
from the observation of ‘object-specific’ priming- that
is, priming that occurs only, or more strongly, when the
prime and probe are seen as a single object. This occurs
even when they appear in different locations, if the
object is seen in real or apparent motion between the
two. Object-specific priming occurs between pictures and
names when these are perceptually linked through the
frames in which they appear (RD Gordon, DE Irwin,
personal communication), suggesting that object files
accumulate information not only about sensory features
but also about more abstract identities. However, priming
between synonyms or semantic associates is not object
specific [15], that is, it occurs equally whether they
are presented in the same perceptual object or in
different objects. It appears that object files integrate
object representations with their names, but maintain
a distinct identity from other semantically associated
objects. Priming at this level would be between object
types rather than tokens. Irwin [ 161 has reviewed evidence on transsaccadic integration, suggesting that it is limited to
about four object files.
A similar distinction between tokens and types has
emerged from the study of repetition blindness, a failure
to see a second token of the same type, which was
attributed to refractoriness in attaching a new token to
a recently instantiated type [17]. Recent research has
further explored this idea. One role of object tokens is
to maintain spatiotemporal continuity of objects across
motion and change. Chun and Cavanagh [18”] confirmed
that repetition blindness is greater when repeated items
are seen to occur within the same apparent motion
sequence and hence are integrated as the same perceived
object. They suggest that perception is biased to minimize
the number of different tokens formed to account for the
sensory data. Objects that appear successively are linked
whenever the spatial and temporal separations make
this physically plausible. This generally gives veridical
perception because in the real world, objects seldom
appear from nowhere or suddenly vanish. Arnell and
Jolicoeur [ 191 have demonstrated repetition blindness for novel objects for which no pre-existing representations
existed. According to Kanwisher’s account [ 171, this implies that a single presentation is sufficient to establish
an object type to which new tokens will be matched.
The ‘attentional blink’ [ZO] describes a failure to de-
tect the second of two different targets when it is
presented soon after the first. Chun (21’1 sees both
repetition blindness and the attentional blink as failures
of tokenization, although for different reasons, because
220 Cognitive neuroscience
they can be dissociated experimentally. Attentional blinks
(reduced by target-distractor discriminability) reflect a
Di I,ollo, JT Enns, personal communication). The account proposed
is that awareness depends on a match between re-entrant
information and the current sensory input at early
visual levels. A mismatch erases the initial tentative
representation. “It is as though the visual system treats the
trailing configuration as a transformation or replacement
of the earlier one.” Conversely, repetition blindness for
locations (R Epstein, NG Kanwisher, abstract in Psychononz
Sot 1996, 593) may result when the representation of an
earlier-presented letter prevents the stable encoding of
a subsequently presented letter appearing at the same
location.
Attention and awareness in object perception Attention seems, then, to be necessary for object tokens
to mediate awareness. However, there is evidence (see
[Z-l’]) that objects can be identified without attention
and awareness. If this is so, do the representations differ
from those formed with attention? Activation (shown
by brain-imaging) in specialized regions of cortex for
processing faces [26] and visual motion [27] is reduced
when subjects direct attention away from the faces or
moving objects (respectively), even when eye movements
are controlled to guarantee identical retinal stimulation
(see also [28]), consistent with the effects of attention
on single units in macaque visual cortex. Unattended
objects are seldom reportable. However, priming studies
suggest that their shapes can be implicitly registered
[?.9,30**], although there are clear limits to the number of
unattended objects that will prime [31]. Representations
formed without attention may differ from those that
receive attention: they appear to be viewpoint-dependent
[32’], two-dimensional, with no interpretation of occlusion
or amodal completion [30”]. On the other hand, in
clinical neglect, the ‘invisible’ representations formed in
a patient’s neglected field include illusory contours and
filled-in surfaces [33-l, suggesting that neglect arises at
stages of processing beyond those that are suppressed in
normal selective attention. With more extreme inattention,
little explicit information is available beyond simple
features such as location, color, size, and gross numerosity;
even these simple features may not be available, produc-
ing ‘inattentional blindness’ [34’]. Again, however, some
implicit information is registered: unseen words may prime
word fragment completion, and there is clear selectivity
for emotionally important objects such as the person’s own
name and happy (but not sad) faces.
Binding of features to objects is often inaccurate unless
attention is focused on the relevant locations [35].