TL;DR Polish is not supported as audio input for Gemma 4 12B
Google has released Gemma 4 12B, a new variant of its open-weight LLM family. It’s a medium-sized model advertised to run locally on laptop-sized devices using just 16 GB of VRAM. It also features native audio input.
Audio: We simplified audio processing even further. We removed the audio encoder entirely and projected the raw audio signal into the same dimensional space as text tokens.
Source: Gemma 4 12B Announcement
I decided to dig deeper into the model’s automatic speech recognition capabilities. I wanted to know if it supports Polish, my native tongue.
First, I checked the model card:
Automatic speech recognition (ASR) and speech-to-translated-text translation across multiple languages.
Source: Gemma 4 model card
Multiple languages, aha! That’s not very revealing. Without that information, the only way forward is to check myself.
I started by upgrading to Ollama 0.30.5, the first version advertised to work with Gemma 4 12B.
ollama pull gemma4:12b
ollama ps
# Prints:
# NAME ID SIZE PROCESSOR CONTEXT UNTIL
# gemma4:12b 4eb23ef187e2 8.2 GB 100% GPU 131072 4 minutes from now
# Let's do a text smoke test first
> curl http://127.0.0.1:11434/api/chat \
-H 'Content-Type: application/json' \
-d '{
"model": "gemma4:12b",
"stream": false,
"messages": [
{
"role": "user",
"content": "Hello!"
}
]
}' -s | jq .message.content
# > "Hello! How can I help you today?"
Text generation works; let’s move to voice.
For voice, I like to use sag, which is a CLI for ElevenLabs inspired by macOS’s say.
Before checking Polish, let’s smoke-test ASR with English audio.
export ELEVENLABS_API_KEY="..."
# Listen to the clip before generating a version for the ASR
sag speak "The quick brown fox jumps over the lazy dog."
# ... and let's get the version for the model:
sag speak --no-play --no-stream \
-o /tmp/gemma4-12b-asr-polish/english-smoke.mp3 \
"The quick brown fox jumps over the lazy dog."
# Convert it to a format Gemma4 expects
ffmpeg -y -hide_banner \
-i /tmp/gemma4-12b-asr-polish/english-smoke.mp3 \
-ar 16000 -ac 1 \
/tmp/gemma4-12b-asr-polish/english-smoke-16k-mono.wav
Time to send it to the model:
AUDIO_B64=$(base64 -i /tmp/gemma4-12b-asr-polish/english-smoke-16k-mono.wav | tr -d '\n')
jq -n --arg audio "$AUDIO_B64" '{
model: "gemma4:12b",
stream: false,
messages: [{
role: "user",
content: "Listen to the attached audio and write the exact English words spoken. Do not translate. Return only the English transcript.",
images: [$audio]
}]
}' | curl -sS http://127.0.0.1:11434/api/chat -H 'Content-Type: application/json' -d @- | jq '.message.content'
It printed the expected The quick brown fox jumps over the lazy dog..
Notice we’re using the images field to send the base64-encoded audio. As of the time of writing this,
this is the only way to use multimodal inputs with Ollama. Neither audio nor input_audio is supported.
Now it’s time for the final test. This time, the fixture is the famous opening lines by Adam Mickiewicz:
Litwo, Ojczyzno moja! ty jesteś jak zdrowie;
Ile cię trzeba cenić, ten tylko się dowie,
Kto cię stracił. Dziś piękność twą w całej ozdobie
Widzę i opisuję, bo tęsknię po tobie.
sag produces a ~15-second clip out of it.
I had to disable thinking for this clip because the model could not stop after several minutes on my M2 MacBook Air.
AUDIO_B64=$(base64 -i /tmp/gemma4-12b-asr-polish/litwo-ojczyzno-moja-16k-mono.wav | tr -d '\n')
jq -n --arg audio "$AUDIO_B64" '{
model: "gemma4:12b",
stream: false,
think: false,
options: {
num_ctx: 8192,
num_predict: 1024
},
messages: [{
role: "user",
content: "Listen to the attached audio and write the exact Polish words spoken. Do not translate. Preserve Polish spelling, diacritics, capitalization, and punctuation. Return only the Polish transcript.",
images: [$audio]
}]
}' | curl -sS http://127.0.0.1:11434/api/chat -H 'Content-Type: application/json' -d @- | jq -r .message.content
Prints:
Litfoj czs no ma, ty jests jak zdrowie, i lecz trzeba cenić, ten tylko se dowje kto cz straciu. Dżisz pęngnost dwonw czalej ozdobie widze i opisuje. Bo tenskne potens nie potolisz.
:sadpanda:
I’m not even going to compute standard ASR metrics like WER/CER here because the output is plainly unusable.
Let me just leave Whisper here for comparison:
> whisper /tmp/gemma4-12b-asr-polish/litwo-ojczyzno-moja-16k-mono.wav \
--model large-v3-turbo \
--language Polish \
--task transcribe \
--output_format txt \
--output_dir /tmp/gemma4-12b-asr-polish \
2>/dev/null
[00:00.000 --> 00:03.660] Litwo, ojczyzno moja, Ty jesteś jak zdrowie.
[00:04.340 --> 00:08.240] Ile Cię trzeba cenić, ten tylko się dowie, kto Cię stracił.
[00:09.480 --> 00:14.960] Dziś piękność Twą w całej ozdobie widzę i opisuję, bo tęsknię po Tobie.