Automatic Speech Recognition in Albanian for subtitles of old albanian movies

Hi! I have an itch to scratch: I’m learning Albanian and I want to watch the Kinostudio movies with subtitles in Albanian. Some of the movies are available on Youtube with subtitles in English, but only a few.

The problem is twofold:

  1. There are no sources for subtitles neither from Arkivi Qendror Shtetëror i Filmit (AQSHF) (which should be in the public domain somehow), nor online anywhere. I’ve desperately looked everywhere. AQSHF has renovated their website last month, dropping their flash-based system, and are restoring some of their movies.
  2. There are a few Speech to Text (STT) initiatives, but I couldn’t manage to implement them locally. Google’s Youtube doesn’t offer the speech to text transcription functionality in Albanian for Youtube. Whisper by OpenAI works well for English, but is awful for Albanian: this is addressed here.

Here are the existing initiatives.

  1. Peshperima is a model available on Hugging Face. It’s from Nullius in Verba, a software company with AI expertise based in Tirana. It can be used in Whisper. Its last update was in April 2023. I opened an issue here, got an answer but couldn’t make it work.
  2. AlbanianASR is an initiative by Florijan Qosja, born from his dissertation for his computer science degree at University of Greenwich in April 2023. He’s the most transparent about the accuracy (46.3%). He built a small community platform called uneduashqiperine to grow his project DibraSpeaks (label/validate audio data).
  3. Mozilla Common Voice opened the Albanian language in spring 2023. I haven’t figured out how to use it locally, but given the amount of training data (2 hours), it’s not ready in any case.

There are also are proprietary solutions:

  1. is a service by Quantix, a company based in Prishtina. It’s closed source.
  2. Google Cloud’s Speech to Text functionality in Albanian, I don’t know what it’s worth.

I’m convinced there is something to do here:

  1. Ask and help AQHSF to open the subtitles (license, where they can leave them):

    1. This promotes the amazing old albanian cinema in general
    2. This makes these movies more accessible for anyone who might be learning albanian (foreigners, members of the diaspora…), albanians with hearing impairment
    3. This would ease the translation process as well into other languages.
  2. Implement and try out the different initiatives I mentioned above (which I tried to do but failed). All the open source projects seem promising (especially AlbanianASR, because of the documentation, transparency and accessibility). More generally, I think this is an essential tool to learn and understand Albanian (so it fulfills my direct need), it has the ability to make Albanian culture more accessible.

I hope you’ll find this as exciting as I do !

PS: There are one or two initiatives about the Albanian cinema that I’m a little confused about, but here they are: the Albanian Cinema Project About - The Albanian Cinema Project and the Albanian Film Culture initiative which has a “contribute” button to upload subtitles.

Hi there, that seems like a very fun project!

At some point I was checking these too, to double check the gramatical errors in SQ Wikipedia - and I came across the peshperima and AlbanianASR too - but was mostly interested in doing something with spacy about this

could be a nice thing to try out - we can meet up and check these options

Taking notes from here:

@Erik this is quite interesting. Shall we have a small meetup with people that might be interested in finding a workaround? Mid January would be a good option…