From the model card, sounds interesting:

The “Unified” in Gemma 4 12B Unified refers to its encoder-free architecture. Other Gemma 4 models use dedicated encoders to process multimodal data before passing it to the LLM. Gemma 4 12B eliminates these encoders entirely, projecting raw image patches and audio waveforms directly into the LLM’s embedding space through lightweight linear layers. This unified approach means all modalities flow straight into a single decoder-only transformer, reducing multimodal latency and allowing the entire model to be fine-tuned in one pass.

The benchmarks put it closer to the 26b MoE than to the E variants of the Gemma4 series, but mostly below Qwen3.5 9b.

Looking forward to giving it a shot.

  • mindbleach@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    5
    ·
    4 days ago

    I’m enjoying the descent from clever structure to just… trusting backpropagation. The Bitter Lesson is being learned at every level. Make the training process better and faster because humans understand that part.