Video Captioning & Accessibility

A 2024 ruling under Title II of the Americans with Disabilities Act (ADA) has specific requirements about how to ensure that web content and mobile applications (apps) are accessible to people with disabilities. This includes video.

Please closely review the points on this page to ensure any video you upload into Kaltura/Mediaspace is compliant with WCAG 2.1 ADA standards. Additionally, any 3rd party video that is embedded on your site should also be accessible (hotlinked videos that aren't produced by the U are an exception). This is your responsibility. Any video that is not compliant will be removed from the site to achieve compliance by April 24th, 2026.

Below is an example of a video that meets accessibility standards as outlined on this page.

Related Training Guides

Captioning Requirements by Content Type

Pre-recorded video
- Use closed captions (don’t burn in)
- Captions must be accurate, synced, and complete
- Include speaker IDs, sound effects, and music
- Provide audio descriptions for visuals essential to understanding
Live video (events, meetings)
- Provide real-time live captions
Audio-only content (podcasts)
- Provide a full transcript
- Transcript must be web-accessible (not PDF-only)

Caption Quality Standards

Accuracy
- Target 99%+ accuracy
- Caption verbatim (no paraphrasing)
- Use correct spelling, punctuation, and grammar
- Include nonverbal cues: [laughing], [shouting], [singing]
Completeness
- Include all meaningful speech, sound effects, and music
- Caption background sounds only when relevant
Equal Representation
- Preserve meaning, tone, emotion, accents, and profanity
- Do not omit, censor, or summarize
Consistency
- Use the same captioning style throughout
- Be consistent with speaker labels and sound formatting

Caption Formatting Rules

Use sans-serif fonts (Arial, Helvetica, Verdana)
Ensure high contrast (e.g., white text on dark background)
Place captions:
- Bottom of screen by default
- Move if covering important visuals
Max 2 lines per caption, 32–37 characters per line
Break lines at natural grammar points
Do not split:
- Names or titles
- Modifiers from nouns
- Sentences mid-phrase

Timing & Readability

Captions must sync with speech
Keep on screen 1–5 seconds
Reading speed: 160–180 words per minute max

Sound, Music & Nonverbal Audio

Caption sound effects and music only when essential to understanding program
Sound effects may be omitted if source of sound can be clearly seen onscreen
Tone or manner of speech
- [sarcastically], [nervously], [in a British accent]
Sound Effects
- Use brackets and present tense for on screen sounds: [door slams], [dog barks]
- Italicize offscreen background sounds when possible [pig squealing]
- Be specific and objective: [robin singing] (not [bird singing])
- Place sound descriptions close to the source
Music
- Caption instrumental/background music only if meaningful
- Do not caption music under 5 seconds
- Italicize offscreen background music when possible
- Describe mood objectively: [Upbeat piano music]
- Include performer/title when possible: [Aretha Franklin singing “Respect”]
- Caption lyrics verbatim using ♪ if supported: ♪ I’m pickin’ up good vibrations ♪♪

Speakers On Screen or Covered with Video

Speaker Identification
- Don’t name speakers until introduced (audio or onscreen text or graphic)
- No ID needed if the speaker is visible onscreen
- Use parenthesis not brackets
- Offscreen speakers, name precedes caption:
  - If name is known: (Jack)
  - If unknown: (speaker #1)

Definitions

Closed captions (CC)

What it is: Text on screen that represents the audio in a video.
Includes: Spoken dialogue and relevant non-speech audio (e.g., “door slams,” “[music],” “laughter,” speaker IDs when needed).
Primary audience/use: People who are Deaf or hard of hearing, or anyone viewing without sound.
Typical format: Time-synced captions that appear as the video plays.
Key point: Captions describe what you hear.

Audio description (AD) for visual-only (or highly visual) content

What it is: A narrated track (or additional narration) that describes important visual information that isn’t otherwise conveyed through audio.
Includes: Actions, scene changes, on-screen text, visual context, facial expressions, key visual cues.
Primary audience/use: People who are blind or have low vision.
Typical format: Spoken descriptions inserted during natural pauses in dialogue or audio (or provided as a separate described version).
Key point: Audio description describes what you see.

Where a user experiences audio description

1) As an alternate audio track in the player (most common for streaming/web video)

The video has two audio options:

Standard audio
Audio-described audio (the same audio, plus a narrator describing key visuals during pauses)

The viewer turns it on in the player’s Audio/Language menu (often labeled “English (Audio Description)” or similar).

2) As a separate “described video” version

Some sites provide a second video file (or link) that already includes the description narration mixed in. Users click the described version instead of toggling an audio track.

3) Broadcast/cable style “secondary audio” (SAP)

On TVs/cable, audio description is often delivered as a Secondary Audio Program (SAP) track you enable in the TV/box audio settings.

4) As a “descriptions” track intended for speech output (HTML/video tech)

On the web, there’s also a standards-based approach using a track of kind descriptions, where a user agent can present the description cues non-visually (for example via synthesized speech). Support varies by platform/player.

Transcript

What it is: A text document that presents the content of media in reading form.
Types:
- Audio-only transcript: Dialogue and meaningful audio (similar content to captions, but not necessarily time-synced).
- Video transcript (best practice): Dialogue + meaningful audio plus relevant visual information (i.e., it can include the same kinds of details as audio description, but in text form).
Primary audience/use: Useful for many people—screen reader users, people who prefer reading, searching/copying content, low bandwidth situations, and as a fallback alternative.
Typical format: Plain text or structured text; may include timestamps but often does not.
Key point: A transcript is a text alternative; it can cover audio, visuals, or both depending on how it’s written.

Practical rule of thumb

If it’s a typical prerecorded video (training, marketing, demos, webinars) and important info is shown visually (on-screen text, charts, actions, UI steps) that isn’t fully spoken:

Captions (required) + Audio description (required for AA), and a transcript is optional.

If it’s a talking-head video where everything important is already said out loud:

Captions are required;
Audio description for prerecorded video is required only if visual information is necessary to understand the content (1.2.5)
So, audio description is conditional, not automatic, in this case