this post was submitted on 26 May 2026

12 points (92.9% liked)

Language Learning

978 readers

4 users here now

A community all about learning languages!

Ask / talk about a specific language or language learning in general.

Sopuli's instance rules apply

Remember the human! (no harassment, threats, etc.)
No racism or other discrimination
No Nazis, QAnon or similar whackos and no endorsement of them
No porn
No ads or spam
No content against Finnish law

Other active Lemmy language communities:

!duolingo@lemmy.world
!japaneselanguage@sopuli.xyz
!chinese@lemmy.world
!learn_finnish@sopuli.xyz
!german@lemmy.world
!latin@piefed.social
!estonian@sopuli.xyz
!spanish@sopuli.xyz
!translator@sopuli.xyz (translation studies)
!esperanto@sopuli.xyz

Other communities outside Lemmy:

Community banner & icon credits:

Icon: The book cover of Babel (2022 novel by R. F. Kuang)

Banner: Epic of Gilgamesh tablet (© The Trustees of the British Museum)

founded 3 years ago

MODERATORS

Lazycog@sopuli.xyz

emb@lemmy.world

Learning vocab ahead of reading? (software for epub analysis) (sh.itjust.works)

submitted 1 week ago* (last edited 1 week ago) by schipelblorp@sh.itjust.works to c/languagelearning@sopuli.xyz

7 comments fedilink hide all child comments

I just jail broke my kindle and have a few epubs and thought maybe this would be a good time to change my approach to vocabulary.

What I'd like to do is learn the vocabulary for my reading before I read it, instead of after, or as I'm reading it.

My dream piece of software would do the following:

resolve all words down to their most basic form (ie, singular for nouns, infinitive for verbs, etc.) (My Language is French)
count occurences of each word
Filter out words I already know
Define the words with a bilingual dictionary to english, including original context sentence.
Make anki cards for me to study.

(6) God-tier programming: also include idiomatic expressions as vocabulary)

Does this exist?

Edit: Or help me assemble a pipe to get all these tasks done separately.

top 7 comments

sorted by: hot top controversial new old

[–] bluGill@fedia.io 4 points 1 week ago

If you need the vocab first then it is too advanced. Pick easier works to read. As a beginner there is no option but it shouldn't take too long before you can find something you can understand without looking up words.

[–] emb@lemmy.world 3 points 1 week ago (1 children)

JPDB.io does something like this for Japanese. Not sure you can really import books, but it basically combines some kind of parser in with a dictionary API, example sentence corpus, and its own spaced repetition system.

Gotta be something along the line out there for most languages, but I can't say I know of the tools. Honestly, the breaking-down-into-a base-word part of it is probably in the dictionary's domain. If you give it a conjugated verb it should usually be able to tell. But then some ambiguities need context, not sure how to account for that.

AnkiConnect lets you tap into the Anki APIs, Wiktionary or (from a quick search) Collins should have a dictionary API available for French-English. If the dictionary APIs are good then you could probably get pretty far with basic sentence parsing.

But yeah, feels like there's gotta be something ready made for it, wish I knew and could point you in a direction.

[–] schipelblorp@sh.itjust.works 2 points 1 week ago (1 children)

I've only done enough programming to know this is very possible. A word count is probably all I'd need to do this manualy. Just wondering if this is one of those things I do instead of learning, so the less time I spend on it, the better I'll feel.

[–] emb@lemmy.world 1 points 4 days ago* (last edited 4 days ago) (1 children)

Was messing around with Jiten.moe (spiritual successor to jpdb, again boasts the utility of ingesting a book or subtitle file and creating anki cards) and it made me think of this question. (And Jiten is actually open-source, so the repo's there with how they do it... but I'm pretty sure it's mostly just wrapping a bunch of Japanese-specific tools.)

Did a little looking. Tried checking https://github.com/keon/awesome-nlp and didn't see anything French specific, but did come across https://github.com/french-ai/french-nlp which might have useful stuff. It sounds like a library called Spacy could be useful.

But then I ran across this tool, which might be pretty close to what you'd need? https://github.com/FreeLanguageTools/vocabsieve

VocabSieve is a companion program for language learning with Anki. Its primary function is sentence mining, in which sentences with vocabulary words are collected and added into Anki for long term retention. It aims to help intermediate learners gain vocabulary efficiently by allowing card creation with minimal friction. Possible use cases include sentence mining from videos, texts, asynchronously from ereader highlights, and even completely automatically from books or subtitles.

I haven't looked into exactly how the 'automatically from books' stuff would work or anything, but seems promising.

And I guess elephant in the room, NLP is the kind of task LLMs are actually pretty good at, so there's also always that lazy-ish route: convert the book to text, feed it through an LLM and ask it to identify important vocabulary words.

[–] schipelblorp@sh.itjust.works 2 points 4 days ago

Thanks! Vocab sieve looks perfect (though experimental), and it works with KOReader, too. Fuck me, I'm running out of excuses.

[–] dragontamer@lemmy.world 2 points 1 week ago (1 children)

I feel like you're approaching this incorrectly. Do you have graded readers?

An A2 graded reader would assume you knew all A2 level words and have definitions for the B1+ / B2 (or beyond) words in the text.

So instead of making software that does the work of making a graded reader, it is probably better to just start by using graded readers (where all this work has already been done).

[–] schipelblorp@sh.itjust.works 2 points 1 week ago

I feel like it's not that much work and the benefit is that it gives me a lot more freedom to read what appeals to me.

FOr instance, I found an unseeded torrent of 600 French epubs. Imagine being able to do something as simple as sorting them by lexical complexity--that is do a unique word count and rank from lowest unique word count to most unique word count. Trivially simple to do and would yield me books that are constantly in my range of proximal learning.

But, yes, thank you for the suggestion! I'll look into some readers, depending if I feel more lazy than broke.