• 2 Posts
  • 98 Comments
Joined 7 months ago
cake
Cake day: March 22nd, 2024

help-circle









  • Yep. 20GB is basically 24GB, though its too tight for 70B models.

    One quirk for 7900 owners is that installing flash attention for long context usage can be a pain. Apparently it is doable now, I need to dig up the link, but it might just be easier to use kobold.cpp rocm with its native flash attention.

    As for vision models, that is a whole different can of worms. Exllama does not support this, so you’d need a framework that does.

    If you are looking for niche models, check out MiniG (which is a continued pretrain of the already very excellent GLM4-9B): https://huggingface.co/bartowski/miniG-GGUF

    Llama.cpp support is recent, though I’m not 100% sure its completely fixed. It should work in Aphrodite as well.





  • Pretty much everything has an API :P

    ollama is OK because its easy and automated, but you can get higher performance, better vram efficiency, and better samplers from either kobold.cpp or tabbyAPI, with the catch being that more manual configuration is required. But this is good, as it “forces” you to pick and test an optimal config for your system.

    I’d recommend kobold.cpp for very short context (like 6K or less) or if you need to partially offload the model to CPU because your GPU is relatively low VRAM. Use a good IQ quantization (like IQ4_M, for instance).

    Otherwise use TabbyAPI with an exl2 quantization, as it’s generally faster (but GPU only) and much better at long context through its great k/v cache quantization.

    They all have OpenAI APIs, though kobold.cpp also has its own web ui.




  • I have an old Lenovo laptop with an NVIDIA graphics card.

    @Maroon@lemmy.world The biggest question I have for you is what graphics card, but generally speaking this is… less than ideal.

    To answer your question, Open Web UI is the new hotness: https://github.com/open-webui/open-webui

    I personally use exui for a lot of my LLM work, but that’s because I’m an uber minimalist.

    And on your setup, I would host the best model you can on kobold.cpp or the built-in llama.cpp server (just not Ollama) and use Open Web UI as your front end. You can also use llama.cpp to host an embeddings model for RAG, if you wish.

    This is a general ranking of the “best” models for document answering and summarization: https://huggingface.co/spaces/vectara/Hallucination-evaluation-leaderboard

    …But generally, I prefer to not mess with RAG retrieval and just slap the context I want into the LLM myself, and for this, the performance of your machine is kind of critical (depending on just how much “context” you want it to cover). I know this is !selfhosted, but once you get your setup dialed in, you may consider making calls to an API like Groq, Cerebras or whatever, or even renting a Runpod GPU instance if that’s in your time/money budget.



  • A letter seen by Reuters, sent by Vivaldi, Waterfox, and Wavebox, and supported by a group of web developers, also supports Opera’s move to take the EC to court over its decision to exclude Microsoft Edge from being subject to the Digital Markets Act (DMA).

    OK…

    Shouldn’t they be fighting Chrome, more than anything? Surely there’s a legal avenue for that, though I guess there’s a risk of getting deprioritized by Google and basically disappearing.


  • I somehow didnt’ get a notification for its post, but thats a terrible idea lol.

    We already have AI horde, and it has nothing to do with blockchain. We also have APIs and GPU services… that have nothing to do with blockchain, and have no need for blockchain.

    Someone apparently already tried the scheme you are describing, and absolutely no one in the wider AI community uses it.