Does Granite Speech T&D Work IRL?
An appeal for discussion on real-world use of the Granite Speech 4.1 2b Plus Model.
100% Human-Written Report (on a robot-written website)
I was excited to see IBM’s Granite Speech 4.1 2b on the Huggingface Open ASR Leaderboard — the Plus variant in particular offers an appealing alternative to the industry-standard WhisperX pipelines used for audio Transcription and Diarization (T&D). Appealing because it’s not a pipeline; in comparison to WhisperX, which wraps Faster Whisper in Voice Activity Detection (VAD), phoneme alignment and pyannote.audio diarization to produce accurate word-level timestamps and speaker attribution, Granite Speech Plus promises all of that natively. No need to wrap anything in anything, or wonder what obscure param in your pipeline is causing poor performance; ostensibly you should be able to put multi-speaker audio in and get timestamped, speaker attributed transcriptions out. Enticing stuff.
I was encouraged in my enthusiasm by an array of blog posts, videos and tweets reviewing the model and pointing out the same benefits that I have. This wealth of positive, seemingly editorial content meant that when I tried out the model and found it performed poorly on my test set, I assumed I had made some trivial user error(s) and spent much longer than I had intended fiddling with it. Frustratingly, I wasn’t able to improve its performance, and I returned to the reviews with renewed skepticism, looking for specific advice. I couldn’t find any: no tuned params, implementation details, or crucially, any examples of it working on real audio clips.
The cynical explanation of this is that it is now so easy to produce editorial-seeming content that it encourages reviews which haven’t actually tested out the things they are recommending, and my poor results are an accurate reflection of the current model’s real world performance. I hope that isn’t the case, as I remain excited by the premise of this model, so I’m publishing my results here in the hope that they produce that magical phenomenon of instantly summoning a person to correct someone being Wrong On The Internet. If you have had good results with this model, please let me know how; I will be indebted to you for the help. And if you have struggled with it like me, then I hope you feel less alone seeing this.
Test Set Configuration
I used 9 conversations generated via Coval, one generated by myself and my long-suffering partner, and a bonus one made by just me. All except the last are simulating phone calls between users and agents. The examples are a mixture of stereo and mono, all at 16KHz. The task was to produce word-level timestamps and word-level speaker attributions. I wasn't worried about nice segmentation, or about punctuation (which the Granite Plus variant doesn't provide). For both models I kept them pretty deterministic, with 0 temperature for WhisperX and argmax token probability selection for Granite.
Operating Environment
The tests were run on a g5.xlarge AWS machine, with these NVIDIA specs:
nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.20 Driver Version: 570.133.20 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 |
| 0% 24C P8 9W / 300W | 0MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
WhisperX Baseline
I ran a basic WhisperX pipeline, without VAD or alignment enabled, but with an initial prompt. Across the examples this gave the best baseline results without doing much tuning. For the stereo examples I ran two separate passes, one for each channel, and the merged them chronologically via their word-level timestamps. This allowed me to cheat on diarazation, as each speaker has its own channel. For mono examples I used the default pyanote diarazation with WhisperX (whisperx.DiarizationPipeline). I kept the native segmentation output by the model, which is arbitrary and needs post-processing to be useful.
The following is a stripped down version of my code, for reproducibility.
# This is run twice, once for each stereo channel, and the resultant segments
# are merged chronologically by timestamp for a unified transcription.
def diarize_and_transcribe_audio(
audio_data: io.BytesIO,
) -> dict[str, Any]:
"""
Perform speaker diarization and transcription on audio data using WhisperX.
Args:
audio_data: BytesIO object containing WAV audio data.
initial_prompt: Optional Whisper initial prompt.
Returns:
Dictionary containing diarization and transcription results.
"""
model_name = "large-v3"
model = load_whisperx_model(model_name)
# Write audio data to temporary file (WhisperX requires a file path, not BytesIO)
audio_bytes = audio_data.read()
# Create temporary file for WhisperX processing
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp_file:
tmp_file.write(audio_bytes)
tmp_file.flush()
audio_path = tmp_file.name
initial_prompt = "This is a conversation transcript between a user and an agent."
return _transcribe_audio_to_segments(
model,
audio_path,
initial_prompt=initial_prompt,
)
def _transcribe_audio_to_segments(
model,
audio_path: str,
initial_prompt: str | None = None,
) -> tuple[list[dict[str, Any]], str]:
"""Transcribe audio into raw Whisper segments with word-level timestamps."""
segments_generator, info = model.transcribe(
audio_path,
beam_size=5, # Beam search width
best_of=5, # Consider 5 candidates for each beam
temperature=0.0, # Deterministic output (no randomness)
repetition_penalty=1.1,
condition_on_previous_text=False,
initial_prompt=initial_prompt,
word_timestamps=True,
)
raw_segments: list[dict[str, Any]] = []
for segment in segments_generator:
words = []
for word in segment.words or []:
# faster-whisper occasionally emits words without timestamps;
# skip those since every downstream step needs start/end.
if word.start is None or word.end is None:
continue
words.append(
{
"word": word.word.strip(),
"start": float(word.start),
"end": float(word.end),
}
)
raw_segments.append(
{
"start": float(segment.start),
"end": float(segment.end),
"text": segment.text.strip(),
"words": words,
}
)
return raw_segments, (info.language or "unknown")
Granite Test
For stereo examples, I downmixed to mono before inference. In this setup, Granite gets a single 16kHz mono waveform rather than separate per-channel passes.
The illustrative baseline below is stripped down, but includes the chunking used in my backend experiment: longer audio is split into fixed windows, timestamp and speaker-attribution prompts are run per chunk, speakers are aligned to words inside each chunk, and then chunk-level timings are offset back into the full-conversation timeline. Notice that at 240s, chunks are longer than most of the examples, so aren't the source of the timestamping innacuracies. I kept the generation deterministic by using argmax token selection.
MODEL = "ibm-granite/granite-speech-4.1-2b-plus"
SYSTEM_PROMPT = (
"You are Granite, developed by IBM. You are a helpful AI assistant"
)
SAA_PROMPT = (
"<|audio|> Speaker attribution: Transcribe and denote who is speaking by adding "
"[Speaker 1]: and [Speaker 2]: tags before speaker turns."
)
TIMESTAMPS_PROMPT = (
"<|audio|> Timestamps: Transcribe the speech. After each word, add a timestamp "
"tag showing the end time from the begining of the audio clip in centiseconds, "
"e.g. hello [T:45] world [T:82]"
)
waveform, sr = torchaudio.load(audio_path) # [channels, samples]
if waveform.shape[0] > 1:
waveform = waveform.mean(dim=0, keepdim=True) # downmix to mono
if sr != 16000:
waveform = torchaudio.functional.resample(waveform, sr, 16000)
waveform = waveform.squeeze(0).float()
processor = AutoProcessor.from_pretrained(MODEL, trust_remote_code=True)
model = AutoModelForSpeechSeq2Seq.from_pretrained(MODEL, trust_remote_code=True).to(device).eval()
chunk_seconds = 240
chunk_samples = chunk_seconds * 16000
def generate(chunk_waveform, prompt, max_new_tokens):
chat = [{"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": prompt}]
text = processor.tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
inputs = processor(text, chunk_waveform.cpu(), sampling_rate=16000, return_tensors="pt").to(device)
output = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False, num_beams=1)
return processor.tokenizer.decode(
output[0, inputs["input_ids"].shape[-1] :],
skip_special_tokens=True
).strip()
all_words, all_turns = [], []
for chunk_start in range(0, waveform.shape[-1], chunk_samples):
chunk_end = min(chunk_start + chunk_samples, waveform.shape[-1])
chunk_waveform = waveform[chunk_start:chunk_end]
chunk_start_seconds = chunk_start / 16000.0
timestamp_text = generate(chunk_waveform, TIMESTAMPS_PROMPT, max_new_tokens=10000)
saa_text = generate(chunk_waveform, SAA_PROMPT, max_new_tokens=4096)
# The timestamps wrap every 10s, so this function parses the text
# and converts the timestamps to seconds from the start of the chunk.
chunk_words = parse_timestamp_words(timestamp_text)
chunk_turns = parse_speaker_turns(saa_text)
# Greedy speaker assignment, timestamped words are the canonical transcript, and
# speaker turns are aligned to those words. This is a simple approach, and could be
# improved with more sophisticated alignment.
chunk_words, chunk_turns = assign_speakers_to_words(chunk_words, chunk_turns)
for w in chunk_words:
w["start"] += chunk_start_seconds
w["end"] += chunk_start_seconds
for t in chunk_turns:
t["start"] += chunk_start_seconds
t["end"] += chunk_start_seconds
all_words.extend(chunk_words)
all_turns.extend(chunk_turns)
segments = build_segments_from_words(all_words) # final diarized segments
Results
Each sample shows Granite and WhisperX transcript outputs. Click any word to seek to the timestamp for the word given by the model (and hover over the word to see the timestamp).
Some general observations:
- Granite's transcription quality is pretty good-it hallucinates much less than WhisperX-but its timestamping and speaker attribution really suffers in comparison, and given that you get punctuation and capitalization from Whisper, it doesn't seem worth modifying a WhipserX pipeline by swapping out the core Faster Whisper model for Granite.
- Granite is a lot slower than WhisperX, even with VAD and alignment turned on.
- Granite seems to struggle to maintain accurate timestamps at segment boundaries, or after silences, but not in a consistent way that I can identify.
- Similarly, for speaker attribution, Granite's performance is inconsistent but not in a predictable manner.
Loading sample comparisons…
Conclusion
I hope that this report is useful to anyone considering using Granite Speech 4.1 2b Plus for real-world T&D tasks. Though its resilience to hallucination is impressive, I can't see how you could use it as a drop-in replacement for a WhisperX-like pipeline in its current form. If you have had success with it, please share your insights - I am happy to be wrong or have made a silly mistake if it means I get to make practical use of this exciting model.