From the model card, sounds interesting:

The “Unified” in Gemma 4 12B Unified refers to its encoder-free architecture. Other Gemma 4 models use dedicated encoders to process multimodal data before passing it to the LLM. Gemma 4 12B eliminates these encoders entirely, projecting raw image patches and audio waveforms directly into the LLM’s embedding space through lightweight linear layers. This unified approach means all modalities flow straight into a single decoder-only transformer, reducing multimodal latency and allowing the entire model to be fine-tuned in one pass.

The benchmarks put it closer to the 26b MoE than to the E variants of the Gemma4 series, but mostly below Qwen3.5 9b.

Looking forward to giving it a shot.

  • Mwa@thelemmy.club
    link
    fedilink
    English
    arrow-up
    1
    ·
    edit-2
    1 day ago

    so Qwen 9b is for like asking questions(and getting good responses) and Gemma 12b is for audio and video input aswell as roleplay,creative writing?