• 1 Post
  • 12 Comments
Joined 11 months ago
cake
Cake day: March 22nd, 2024

help-circle

  • Its semantics, and a subject of ongoing debate.

    Per wikipedia, I really like this proposal:

    Astronomer Jean-Luc Margot proposed a mathematical criterion that determines whether an object can clear its orbit during the lifetime of its host star, based on the mass of the planet, its semimajor axis, and the mass of its host star.[210] The formula produces a value called π that is greater than 1 for planets.[c] The eight known planets and all known exoplanets have π values above 100, while Ceres, Pluto, and Eris have π values of 0.1, or less. Objects with π values of 1 or more are expected to be approximately spherical, so that objects that fulfill the orbital-zone clearance requirement around Sun-like stars will also fulfill the roundness requirement[211] – though this may not be the case around very low-mass stars.

    It basically means a planet should be big enough to consolidate all the stuff in its orbital area, not be part of an asteroid field. That makes sense to me.


    https://en.wikipedia.org/wiki/Dwarf_star

    “Dwarf” stars are even more confusing, as it basically a synonym for “normal,” as opposed to “giant” stars (which are relatively puffy and big for their mass/temperature), or more exotic stars. But the term is also used for special cases, like the relatively exotic white dwarfs (remnants of exploded stars with very strange properties, extreme density, and not “burning” like a star traditionally does), or “barely a star” brown dwarfs.

    TL:DR: If an astronomer asks you to name something, you should say ‘absolutely not.’




  • You know what I mean: the rest of the world other than Russia, China, or the US.

    Even just going by that metric, that’s mostly India+neighbors (who do not like China, and are mixed on Russia I think?), and populations that are not very sympathetic to any of the three vetoing everything anymore.

    That’s going to get more pronounced, as the US/China are not very climate conscious anymore, and Russia seems to low-key want climate change, while the rest of the world’s population tends to be in very vulnerable areas.



  • I mean, there’s a real issue.

    Say you were china, or the EU, or any other country/bloc and basically your entire youth was addicted to Twitter, Facebook or whatever, and officially manipulatable by the US government…. And you got into a real conflict. Maybe even a hot war.

    Wouldn’t you be worried about the US propagandizing your population?

    I would.

    The US government’s solution is completely dysfunctional and not getting at the root of the issue because they are afraid of reducing the power projection of big tech, among other things. But the core issue doesn’t need to be trivialized.


  • To go into more detail:

    • Exllama is faster than llama.cpp with all other things being equal.

    • exllama’s quantized KV cache implementation is also far superior, and nearly lossless at Q4 while llama.cpp is nearly unusable at Q4 (and needs to be turned up to Q5_1/Q4_0 or Q8_0/Q4_1 for good quality)

    • With ollama specifically, you get locked out of a lot of knobs like this enhanced llama.cpp KV cache quantization, more advanced quantization (like iMatrix IQ quantizations or the ARM/AVX optimized Q4_0_4_4/Q4_0_8_8 quantizations), advanced sampling like DRY, batched inference and such.

    It’s not evidence or options… it’s missing features, thats my big issue with ollama. I simply get far worse, and far slower, LLM responses out of ollama than tabbyAPI/EXUI on the same hardware, and there’s no way around it.

    Also, I’ve been frustrated with implementation bugs in llama.cpp specifically, like how llama 3.1 (for instance) was bugged past 8K at launch because it doesn’t properly support its rope scaling. Ollama inherits all these quirks.

    I don’t want to go into the issues I have with the ollama devs behavior though, as that’s way more subjective.


  • It’s less optimal.

    On a 3090, I simply can’t run Command-R or Qwen 2.5 34B well at 64K-80K context with ollama. Its slow even at lower context, the lack of DRY sampling and some other things majorly hit quality.

    Ollama is meant to be turnkey, and thats fine, but LLMs are extremely resource intense. Sometimes the manual setup/configuration is worth it to squeeze out every ounce of extra performance and quantization quality.

    Even on CPU-only setups, you are missing out on (for instance) the CPU-optimized quantizations llama.cpp offers now, or the more advanced sampling kobold.cpp offers, or more fine grained tuning of flash attention configs, or batched inference, just to start.

    And as I hinted at, I don’t like some other aspects of ollama, like how they “leech” off llama.cpp and kinda hide the association without contributing upstream, some hype and controversies in the past, and hints that they may be cooking up something commercial.


  • Your post is suggesting that the same models with the same parameters generate different result when run on different backends

    Yes… sort of. Different backends support different quantization schemes, for both the weights and the KV cache (the context). There are all sorts of tradeoffs.

    There are even more exotic weight quantization schemes (ALQM, VPTQ) that are much more VRAM efficient than llama.cpp or exllama, but I skipped mentioning them (unless somedone asked) because they’re so clunky to setup.

    Different backends also support different samplers. exllama and kobold.cpp tend to be at the cutting edge of this, with things like DRY for better long-form generation or grammar.


  • So there are multiple ways to split models across GPUs, (layer splitting, which uses one GPU then another, expert parallelism, which puts different experts on different GPUs), but the way you’re interested in is “tensor parallelism”

    This requires a lot of communication between the GPUs, and NVLink speeds that up dramatically.

    It comes down to this: If you’re more interested in raw generation speed, especially with parallel calls of smaller models, and/or you don’t care about long context (with 4K being plenty), use Aphrodite. It will ultimately be faster.

    But if you simply want to stuff the best/highest quality model you can at VRAM, especially at longer context (>4K), use TabbyAPI. Its tensor parallelism only works over PCIe, so it will be a bit slower, but it will still stream text much faster than you can read. It can simply hold bigger, better models at higher quality in the same 48GB VRAM pool.