摘要
Multisensory integration has often been characterized as an automatic process. Recent findings indicate that multisensory integration can occur across various stages of stimulus processing that are linked to, and can be modulated by, attention. Stimulus-driven, bottom-up mechanisms induced by crossmodal interactions can automatically capture attention towards multisensory events, particularly when competition to focus elsewhere is relatively low. Conversely, top-down attention can facilitate the integration of multisensory inputs and lead to a spread of attention across sensory modalities. These findings point to a more intimate and multifaceted interplay between attention and multisensory integration than was previously thought. We review developments in the current understanding of the interactions between attention and multisensory processing, and propose a framework that unifies previous, apparently discordant, findings. Multisensory integration has often been characterized as an automatic process. Recent findings indicate that multisensory integration can occur across various stages of stimulus processing that are linked to, and can be modulated by, attention. Stimulus-driven, bottom-up mechanisms induced by crossmodal interactions can automatically capture attention towards multisensory events, particularly when competition to focus elsewhere is relatively low. Conversely, top-down attention can facilitate the integration of multisensory inputs and lead to a spread of attention across sensory modalities. These findings point to a more intimate and multifaceted interplay between attention and multisensory integration than was previously thought. We review developments in the current understanding of the interactions between attention and multisensory processing, and propose a framework that unifies previous, apparently discordant, findings. Attention involves mechanisms whereby processing resources are preferentially allocated toward particular locations, features or objects. Attentional orienting refers to the process responsible for moving the focus of attention from one location, feature or object, to another. Orienting can occur covertly, that is, in the absence of movements of the eyes or other sensory receptor surfaces (e.g. ears), as well as overtly, where the shift is accompanied by a reorienting of the sensory receptors (e.g. by a head turn) to the newly attended location or object. A form of stimulus-driven (see “Stimulus driven”) selection that is mainly determined by the ability of sensory events in the environment to summon processing resources. This type of selection is invoked relatively independently of voluntary control; rather, stimulus salience (see “Salience”) is the driving factor. Particularly salient stimuli (i.e. sudden motion in an otherwise still visual scene or loud sounds in an otherwise quiet room), or other stimuli for which an individual has a low detection threshold (e.g. one's own name), attract attention in a bottom-up fashion. Audiovisual illusion in which an auditory phoneme (e.g. /b/) dubbed onto incongruent visual lip movements (e.g. /g/) tends to lead to illusory auditory percepts that are typically intermediate between the actual visual and auditory inputs (i.e., /d/), are completely dominated by the visual input (i.e., /g/), or are a combination of the two (i.e., /bg/). The McGurk effect occurs in the context of isolated syllables, words or even whole sentences. This proposed principle about multisensory integration is based on the fact that some stimulus characteristics are processed more accurately in one sensory modality than in another. For instance, vision in general has a higher spatial resolution than audition, whereas audition has a higher temporal resolution than vision. According to this framework, information from visual stimuli tends to dominate the perceptual outcome of the spatial characteristics of audiovisual events (sometimes causing a shift of the apparent location of an auditory stimulus toward the location of the visual event). Conversely, the perceived temporal characteristics of an audiovisual event tend to be dominated by those of the auditory component. A perceptual phenomenon whereby a stimulus with a particularly distinctive feature relative to its surrounding background triggers quick attentional orienting and leads to rapid detection. It is often used to describe the fact that finding such particularly distinctive objects within a visual display is highly efficient and not affected by the amount of distractor elements in the scene. High stimulus salience (see “Salience”) leads to pop out. The process of extracting relevant information from an attended stimulus. Spatial organization of a group of neurons based on a topographical arrangement whereby their responses map stimulus locations in the retina in a more or less orderly fashion across a brain area. In a retinotopically organized brain area, neurons involved in processing adjacent parts of the visual field are also located adjacently. This organization is most clearly seen in early (i.e. lower-level) areas of the visual pathway, but many higher-order cortical areas involved in processing visual information also show rough retinotopic organization. Refers to a characteristic of an object or event that makes it stand out from its context. Visual objects are said to be highly salient when they have a particularly distinctive feature with respect to the neighboring items and the background, or if they occur suddenly. A bright light spot within an otherwise empty, dark context has a high saliency. Salience is often associated with being more likely to capture attention (see ‘bottom-up’ and “pop out”). An audiovisual illusion in which a single flash of light, presented concurrently with a train of various (two or three) short tone pips, is perceived as two (or more) flashes. This phenomenon is an example of the tendency of auditory stimuli to dominate in the perception of the temporal characteristics of an audiovisual event. The match of one or more features across two stimuli, stimulus components or stimulus modalities. Congruence can be defined in terms of temporal characteristics, spatial characteristics or higher-level informational content (such as semantics). In audiovisual speech perception, congruency typically refers to the matching or mismatching of a sequence of auditory speech sounds with respect to lip movements being concurrently presented. Incongruence is at the base of some multisensory phenomena, such as the McGurk illusion and the ventriloquist effects (Box 1). A process is stimulus driven if it is triggered or dominated by current sensory input; stimulus-driven mechanisms are a defining feature of bottom-up processing (see ‘bottom-up’). A mode of attentional orienting whereby processing resources are allocated according to internal goals or states of the observer. It is often used to refer to selective processing and attentional orienting directed in a voluntary fashion. An audiovisual illusion in which an auditory stimulus is perceived as occurring at or towards the location of a spatially disparate visual stimulus that occurs at the same time.