Gemma 4 12B and Polish audio input

TL;DR Polish is not supported as audio input for Gemma 4 12B

Google has released Gemma 4 12B, a new variant of its open-weight LLM family. It’s a medium-sized model advertised to run locally on laptop-sized devices using just 16 GB of VRAM. It also features native audio input.

Audio: We simplified audio processing even further. We removed the audio encoder entirely and projected the raw audio signal into the same dimensional space as text tokens.

Source: Gemma 4 12B Announcement

I decided to dig deeper into the model’s automatic speech recognition capabilities. I wanted to know if it supports Polish, my native tongue.

First, I checked the model card:

Automatic speech recognition (ASR) and speech-to-translated-text translation across multiple languages.

Source: Gemma 4 model card

Multiple languages, aha! That’s not very revealing. Without that information, the only way forward is to check myself.

I started by upgrading to Ollama 0.30.5, the first version advertised to work with Gemma 4 12B.

ollama pull gemma4:12b
ollama ps
# Prints:
# NAME          ID              SIZE      PROCESSOR    CONTEXT    UNTIL
# gemma4:12b    4eb23ef187e2    8.2 GB    100% GPU     131072     4 minutes from now

# Let's do a text smoke test first
> curl http://127.0.0.1:11434/api/chat \
      -H 'Content-Type: application/json' \
      -d '{
    "model": "gemma4:12b",
    "stream": false,
    "messages": [
      {
        "role": "user",
        "content": "Hello!"
      }
    ]
  }' -s | jq .message.content
# > "Hello! How can I help you today?"

Text generation works; let’s move to voice.

For voice, I like to use sag, which is a CLI for ElevenLabs inspired by macOS’s say.

Before checking Polish, let’s smoke-test ASR with English audio.

export ELEVENLABS_API_KEY="..."
# Listen to the clip before generating a version for the ASR
sag speak "The quick brown fox jumps over the lazy dog."
# ... and let's get the version for the model:

sag speak --no-play --no-stream \
    -o /tmp/gemma4-12b-asr-polish/english-smoke.mp3 \
    "The quick brown fox jumps over the lazy dog."
# Convert it to a format Gemma4 expects
ffmpeg -y -hide_banner  \
    -i /tmp/gemma4-12b-asr-polish/english-smoke.mp3 \
    -ar 16000 -ac 1 \
    /tmp/gemma4-12b-asr-polish/english-smoke-16k-mono.wav

Time to send it to the model:

AUDIO_B64=$(base64 -i /tmp/gemma4-12b-asr-polish/english-smoke-16k-mono.wav | tr -d '\n')

jq -n --arg audio "$AUDIO_B64" '{
  model: "gemma4:12b",
  stream: false,
  messages: [{
    role: "user",
    content: "Listen to the attached audio and write the exact English words spoken. Do not translate. Return only the English transcript.",
    images: [$audio]
  }]
}' | curl -sS http://127.0.0.1:11434/api/chat -H 'Content-Type: application/json' -d @- | jq '.message.content'

It printed the expected The quick brown fox jumps over the lazy dog..

Notice we’re using the images field to send the base64-encoded audio. As of the time of writing this, this is the only way to use multimodal inputs with Ollama. Neither audio nor input_audio is supported.

Now it’s time for the final test. This time, the fixture is the famous opening lines by Adam Mickiewicz:

Litwo, Ojczyzno moja! ty jesteś jak zdrowie;
Ile cię trzeba cenić, ten tylko się dowie,
Kto cię stracił. Dziś piękność twą w całej ozdobie
Widzę i opisuję, bo tęsknię po tobie.

sag produces a ~15-second clip out of it.

I had to disable thinking for this clip because the model could not stop after several minutes on my M2 MacBook Air.

AUDIO_B64=$(base64 -i /tmp/gemma4-12b-asr-polish/litwo-ojczyzno-moja-16k-mono.wav | tr -d '\n')

jq -n --arg audio "$AUDIO_B64" '{
  model: "gemma4:12b",
  stream: false,
  think: false,
  options: {
    num_ctx: 8192,
    num_predict: 1024
  },
  messages: [{
    role: "user",
    content: "Listen to the attached audio and write the exact Polish words spoken. Do not translate. Preserve Polish spelling, diacritics, capitalization, and punctuation. Return only the Polish transcript.",
    images: [$audio]
  }]
}' | curl -sS http://127.0.0.1:11434/api/chat -H 'Content-Type: application/json' -d @- | jq -r .message.content

Prints:

Litfoj czs no ma, ty jests jak zdrowie, i lecz trzeba cenić, ten tylko se dowje kto cz straciu. Dżisz pęngnost dwonw czalej ozdobie widze i opisuje. Bo tenskne potens nie potolisz.

:sadpanda:

I’m not even going to compute standard ASR metrics like WER/CER here because the output is plainly unusable.

Let me just leave Whisper here for comparison:

> whisper /tmp/gemma4-12b-asr-polish/litwo-ojczyzno-moja-16k-mono.wav \
    --model large-v3-turbo \
    --language Polish \
    --task transcribe \
    --output_format txt \
    --output_dir /tmp/gemma4-12b-asr-polish \
    2>/dev/null

[00:00.000 --> 00:03.660]  Litwo, ojczyzno moja, Ty jesteś jak zdrowie.
[00:04.340 --> 00:08.240]  Ile Cię trzeba cenić, ten tylko się dowie, kto Cię stracił.
[00:09.480 --> 00:14.960]  Dziś piękność Twą w całej ozdobie widzę i opisuję, bo tęsknię po Tobie.