brucethemoose

joined 1 year ago
[–] brucethemoose@lemmy.world 1 points 6 hours ago

I don't mean to be rude though, PM me about stuff if you want! But I make no promises about a timely response lol.

[–] brucethemoose@lemmy.world 3 points 7 hours ago (2 children)

Probably not going to happen in the house. The 4 Senate Republicans that voted that way are known breakaways/Trump haters:

Republican Sens. Susan Collins (Maine), Mitch McConnell (Ky.) Lisa Murkowski (Alaska) and Rand Paul (Ky.) joined Democrats in voting for the measure, which passed 51-48.

https://www.axios.com/2025/04/02/senate-repeal-trump-tariffs-canada

[–] brucethemoose@lemmy.world 5 points 7 hours ago* (last edited 7 hours ago)

Yeah. Valve's 30% cut is greed. So is their (alleged) anticompetitive behavior of forcing price parity with other stores (aka devs can't price things cheaper than Steam elsewhere).

I mean, I like their store. I like most of their behavior, but I am also waiting for the hammer to drop, and everyone should.

[–] brucethemoose@lemmy.world 2 points 7 hours ago* (last edited 7 hours ago) (2 children)

Not particularly, just frequent posters I'm familiar with.

I plan to get more involved once I get some personal stuff straight.

[–] brucethemoose@lemmy.world 2 points 7 hours ago* (last edited 7 hours ago) (4 children)

Heh, I'm an EE dropout kinda in machine learning stuff now. Good luck, chemical engineering seems tough (but cool).

But yeah, on Lemmy the idea is to post in communities that fit your niches, rather than trying to follow people directly like Mastadon, Twitter or whatever. Those are a bit slim but growing (for instance, there are some active science-focused communities/servers).

[–] brucethemoose@lemmy.world 1 points 8 hours ago* (last edited 8 hours ago)

Yeah.

And again, the US was a legendary tariff dodger back in the day. It's like our claim to fame. The irony is tremendous.

[–] brucethemoose@lemmy.world 4 points 9 hours ago (3 children)

Or "launder" goods through there, like China did with Mexico and some other countries. Or the US did earlier in its colonial history.

[–] brucethemoose@lemmy.world 13 points 9 hours ago

A quirk of the deficit / imports formula they used. Small countries with relatively small trade balances with the US can hit pretty wild values.

[–] brucethemoose@lemmy.world 2 points 9 hours ago (1 children)

You can generally toggle LLM "grounding" features, aka inserting web searches into their context.

Modern LLMs have a information "cutoff" of a few months ago, at the latest, so the base models will have zero awareness of this formula.

[–] brucethemoose@lemmy.world 13 points 9 hours ago* (last edited 9 hours ago)

It's not just math, but economic theory. There's a lot of historical context here, going back to mercantalism in the 1600s, where countries were obsessed with trying to maximize exports. You may remember this from history class, and how they figured out it was, ultimately, not the best idea.

https://en.wikipedia.org/wiki/Mercantilism

Anyway, ignore the greek letters. The Trump administration is using trade deficit (how much other countries buy from us vs. how much we buy from them) as the number for how how much to tax those imports, with the idea being to this tax will "punish" and incentivize countries to not have such a big trade deficit with the US. Per mercantalism, buying more than we sell from someone is a "loss," as we are losing money to them. And US manufacturers will take up the slack.

...In practice, that's not how it works, as Europe learned in the 1600s/1700s and the US learned in the great depression, among many other times. There are a lot of fallacies, including:

  • "the popular folly of confusing wealth with money," aka assuming the trade deficit is unprofitable "loss."

  • Overestimating the US's importance. It's a big world with a lot of easy shipping, and countries have many other places to ship stuff if the US gives them a big enough middle finger.

  • Ramping up manufacturing locally is hard, depending on the industry. Could take years and billions, and in some cases is not practical at all. That's why we buy stuff from other countries, where it's easier to make. It's like the core tenant of free trade.

  • Other factors are not static. Slap a gigantic tarrif on something, and the supply/demand/pricing is not going to stay the same.

  • It's also ignoring how being the world's #1 consumer cemented the US's power across the world, and arguable stabilized a lot of geopolitics (with some unsavory complications, though). This was largely the idea behind the post-WWII world order.

[–] brucethemoose@lemmy.world 6 points 9 hours ago* (last edited 9 hours ago)

TBH it's probably human written.

I used to write small articles for a tech news outlet on the side (HardOCP), and the entire site went under well before the AI boom because no one can compete with conveyer belts of of thoughtless SEO garbage, especially when Google promotes it.

Point being, this was a problem well before the rise of LLMs.

[–] brucethemoose@lemmy.world 7 points 9 hours ago (5 children)

In this case, it's as simple as "type it into ChatGPT, like the Reddit users did" :/

 

I see a lot of talk of Ollama here, which I personally don't like because:

  • The quantizations they use tend to be suboptimal

  • It abstracts away llama.cpp in a way that, frankly, leaves a lot of performance and quality on the table.

  • It abstracts away things that you should really know for hosting LLMs.

  • I don't like some things about the devs. I won't rant, but I especially don't like the hint they're cooking up something commercial.

So, here's a quick guide to get away from Ollama.

  • First step is to pick your OS. Windows is fine, but if setting up something new, linux is best. I favor CachyOS in particular, for its great python performance. If you use Windows, be sure to enable hardware accelerated scheduling and disable shared memory.

  • Ensure the latest version of CUDA (or ROCm, if using AMD) is installed. Linux is great for this, as many distros package them for you.

  • Install Python 3.11.x, 3.12.x, or at least whatever your distro supports, and git. If on linux, also install your distro's "build tools" package.

Now for actually installing the runtime. There are a great number of inference engines supporting different quantizations, forgive the Reddit link but see: https://old.reddit.com/r/LocalLLaMA/comments/1fg3jgr/a_large_table_of_inference_engines_and_supported/

As far as I am concerned, 3 matter to "home" hosters on consumer GPUs:

  • Exllama (and by extension TabbyAPI), as a very fast, very memory efficient "GPU only" runtime, supports AMD via ROCM and Nvidia via CUDA: https://github.com/theroyallab/tabbyAPI

  • Aphrodite Engine. While not strictly as vram efficient, its much faster with parallel API calls, reasonably efficient at very short context, and supports just about every quantization under the sun and more exotic models than exllama. AMD/Nvidia only: https://github.com/PygmalionAI/Aphrodite-engine

  • This fork of kobold.cpp, which supports more fine grained kv cache quantization (we will get to that). It supports CPU offloading and I think Apple Metal: https://github.com/Nexesenex/croco.cpp

Now, there are also reasons I don't like llama.cpp, but one of the big ones is that sometimes its model implementations have... quality degrading issues, or odd bugs. Hence I would generally recommend TabbyAPI if you have enough vram to avoid offloading to CPU, and can figure out how to set it up. So:

This can go wrong, if anyone gets stuck I can help with that.

  • Next, figure out how much VRAM you have.

  • Figure out how much "context" you want, aka how much text the llm can ingest. If a models has a context length of, say, "8K" that means it can support 8K tokens as input, or less than 8K words. Not all tokenizers are the same, some like Qwen 2.5's can fit nearly a word per token, while others are more in the ballpark of half a work per token or less.

  • Keep in mind that the actual context length of many models is an outright lie, see: https://github.com/hsiehjackson/RULER

  • Exllama has a feature called "kv cache quantization" that can dramatically shrink the VRAM the "context" of an LLM takes up. Unlike llama.cpp, it's Q4 cache is basically lossless, and on a model like Command-R, an 80K+ context can take up less than 4GB! Its essential to enable Q4 or Q6 cache to squeeze in as much LLM as you can into your GPU.

  • With that in mind, you can search huggingface for your desired model. Since we are using tabbyAPI, we want to search for "exl2" quantizations: https://huggingface.co/models?sort=modified&search=exl2

  • There are all sorts of finetunes... and a lot of straight-up garbage. But I will post some general recommendations based on total vram:

  • 4GB: A very small quantization of Qwen 2.5 7B. Or maybe Llama 3B.

  • 6GB: IMO llama 3.1 8B is best here. There are many finetunes of this depending on what you want (horny chat, tool usage, math, whatever). For coding, I would recommend Qwen 7B coder instead: https://huggingface.co/models?sort=trending&search=qwen+7b+exl2

  • 8GB-12GB Qwen 2.5 14B is king! Unlike it's 7B counterpart, I find the 14B version of the model incredible for its size, and it will squeeze into this vram pool (albeit with very short context/tight quantization for the 8GB cards). I would recommend trying Arcee's new distillation in particular: https://huggingface.co/bartowski/SuperNova-Medius-exl2

  • 16GB: Mistral 22B, Mistral Coder 22B, and very tight quantizations of Qwen 2.5 34B are possible. Honorable mention goes to InternLM 2.5 20B, which is alright even at 128K context.

  • 20GB-24GB: Command-R 2024 35B is excellent for "in context" work, like asking questions about long documents, continuing long stories, anything involving working "with" the text you feed to an LLM rather than pulling from it's internal knowledge pool. It's also quite goot at longer contexts, out to 64K-80K more-or-less, all of which fits in 24GB. Otherwise, stick to Qwen 2.5 34B, which still has a very respectable 32K native context, and a rather mediocre 64K "extended" context via YaRN: https://huggingface.co/DrNicefellow/Qwen2.5-32B-Instruct-4.25bpw-exl2

  • 32GB, same as 24GB, just with a higher bpw quantization. But this is also the threshold were lower bpw quantizations of Qwen 2.5 72B (at short context) start to make sense.

  • 48GB: Llama 3.1 70B (for longer context) or Qwen 2.5 72B (for 32K context or less)

Again, browse huggingface and pick an exl2 quantization that will cleanly fill your vram pool + the amount of context you want to specify in TabbyAPI. Many quantizers such as bartowski will list how much space they take up, but you can also just look at the available filesize.

  • Now... you have to download the model. Bartowski has instructions here, but I prefer to use this nifty standalone tool instead: https://github.com/bodaay/HuggingFaceModelDownloader

  • Put it in your TabbyAPI models folder, and follow the documentation on the wiki.

  • There are a lot of options. Some to keep in mind are chunk_size (higher than 2048 will process long contexts faster but take up lots of vram, less will save a little vram), cache_mode (use Q4 for long context, Q6/Q8 for short context if you have room), max_seq_len (this is your context length), tensor_parallel (for faster inference with 2 identical GPUs), and max_batch_size (parallel processing if you have multiple user hitting the tabbyAPI server, but more vram usage)

  • Now... pick your frontend. The tabbyAPI wiki has a good compliation of community projects, but Open Web UI is very popular right now: https://github.com/open-webui/open-webui I personally use exui: https://github.com/turboderp/exui

  • And be careful with your sampling settings when using LLMs. Different models behave differently, but one of the most common mistakes people make is using "old" sampling parameters for new models. In general, keep temperature very low (<0.1, or even zero) and rep penalty low (1.01?) unless you need long, creative responses. If available in your UI, enable DRY sampling to tamp down repition without "dumbing down" the model with too much temperature or repitition penalty. Always use a MinP of 0.05 or higher and disable other samplers. This is especially important for Chinese models like Qwen, as MinP cuts out "wrong language" answers from the response.

  • Now, once this is all setup and running, I'd recommend throttling your GPU, as it simply doesn't need its full core speed to maximize its inference speed while generating. For my 3090, I use something like sudo nvidia-smi -pl 290, which throttles it down from 420W to 290W.

Sorry for the wall of text! I can keep going, discussing kobold.cpp/llama.cpp, Aphrodite, exotic quantization and other niches like that if anyone is interested.

 

cross-posted from: https://lemmy.world/post/19242887

I can run the full 131K context with a 3.75bpw quantization, and still a very long one at 4bpw. And it should barely be fine-tunable in unsloth as well.

It's pretty much perfect! Unlike the last iteration, they're using very aggressive GQA, which makes the context small, and it feels really smart at long context stuff like storytelling, RAG, document analysis and things like that (whereas Gemma 27B and Mistral Code 22B are probably better suited to short chats/code).

view more: next ›