this post was submitted on 18 Aug 2025
2 points (62.5% liked)

Large Language Models

243 readers
1 users here now

A place to discuss large language models.

Rules

  1. Please tag [not libre software] and [never on-device] services as such (those not green in the License column here).
  2. Be useful to others

Resources

github.com/ollama/ollama
https://github.com/danny-avila/LibreChat
github.com/Aider-AI/aider
wikipedia.org/wiki/List_of_large_language_models

founded 2 years ago
MODERATORS
 

Yesterday I had a brilliant idea: why not parse the wiki of my favorite table top roleplaying game into yaml via an llm? I had tried the same with beautfifulsoup a couple of years ago, but the page is very inconsistent which makes it quite difficult to parse using traditional methods.

However, my attempts where not very successful to parse with a local mistral model (the one you get with ollama pull mistral) as it first insisted on writing more than just the yaml code and later had troubles with more complex pages like https://dsa.ulisses-regelwiki.de/zauber.html?zauber=Abvenenum So I thought I had to give it some examples in the system prompts, but while one example helped a little, when I included more, it sometimes started to just return an example from the ones I gave to it via system prompt.

To give some idea: the bold stuff should be keys in the yaml structure, the part that follows the value. Sometimes values need to be parsed a bit more like separating pages from book names - I would give examples for all that.

Any idea what model to use for that or how to improve results?

you are viewing a single comment's thread
view the rest of the comments
[–] sga@piefed.social 1 points 4 days ago

and also, how are you getting the wiki? i would first scrape it . if it is something like fandom, then do not scrape directly, first host your own breeze wiki (https://docs.breezewiki.com/Running.html), then use wget with a optimal rate limit. using breeezewiki will remove some junk, and you will get cleaner html to begin with.

for small models, try to keep total input (prompt plus data) to be small, as they general can not reatin there smarts for much (even if they advertise larger contexts).