Jump to content

Viseme

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by AnomieBOT (talk | contribs) at 11:13, 2 January 2023 (Dating maintenance tags: {{No footnotes}}). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

A viseme is any of several speech sounds that look the same, for example when lip reading (Fisher 1968).

Visemes and phonemes do not share a one-to-one correspondence. Often several phonemes correspond to a single viseme, as several phonemes look the same on the face when produced, such as /k, ɡ, ŋ/, (viseme: /k/), /t͡ʃ, ʃ, d͡ʒ, ʒ/ (viseme: /ch/), /t, d, n, l/ (viseme: /t/), and /p, b, m/ (viseme: /p/). Thus words such as pet, bell, and men are difficult for lip-readers to distinguish, as all look like /pet/. However, there may be differences in timing and duration during actual speech in terms of the visual "signature" of a given gesture that cannot be captured with a single photograph. Conversely, some sounds which are hard to distinguish acoustically are clearly distinguished by the face (Chen 2001). For example, acoustically speaking English /l/ and /r/ can be quite similar (especially in clusters, such as 'grass' vs. 'glass'), yet visual information can show a clear contrast. This is demonstrated by the more frequent mishearing of words on the telephone than in person. Some linguists have argued that speech is best understood as bimodal (aural and visual), and comprehension can be compromised if one of these two domains is absent (McGurk and MacDonald 1976).

Visemes can often be humorous, as in the phrase "elephant juice", which when lip-read appears identical to "I love you".

Applications for the study of visemes include speech processing, speech recognition, and computer facial animation.

See also

References

  • Chen, T. and Rao R. R. (1998, May). "Audio-visual integration in multi-modal communication". Proceedings of the IEEE 86, 837–852. doi:10.1109/5.664274.
  • Chen, T. (2001). "Audiovisual speech processing". IEEE Signal Processing Magazine 18, 9–21. doi:10.1109/79.911195
  • Fisher, C. G. (1968). "Confusions among visually perceived consonants". Journal of Speech and Hearing Research, 11(4):796–804. doi:10.1044/jshr.1104.796.
  • McGurk, H. and MacDonald, J. (1976, December). "Hearing lips and seeing voices". Nature 264, 746–748. doi:10.1038/264746a0.
  • Patrick Lucey, Terrence Martin, Sridha Sridharan (2004). "Confusability of Phonemes Grouped According to their Viseme Classes in Noisy Environments". Presented at Tenth Australian International Conference on Speech Science & Technology, Macquarie University, Sydney, 8–10 December 2004.