Sunday, June 1, 2025

Segmentation of Mahabharata sentences

Motivation:

The main problem that I face parsing Mahabharata is due to my lacking Sanskrit skills. I'd say, due to my Marathi background, I have a basic understanding of words and their meanings (they're pretty similar to Marathi words) but it's not as great as it should. 

As an example to illustrate my point, consider this word:                 samāsīnānabhyagacchadbrahmarṣīnsaṃśitavratān.
It's a mouthful, isn't it? It's much better understood in it's unsandhied form:
[samāsīnān abhyagacchat brahmarṣīn saṃśita- vratān.]

  • samāsīnān     = sitting together  
  • abhyagacchat     = he approached
  • brahmarṣīn     =  the brahmarishis
  • saṃśita-     =  firm
  • vratān.     =    vows
"He approached the brahmarishis who were sitting firm in their vows"

I want each of my verses to have an unsandhied form accompanying it.

The problem is currently, there's no good Sanskrit sentence segmenter out there that can split sentences into words and desandhify words into their constituents. I looked up and there are two options:

  1. Vidut.Sandhi: https://vidyut.readthedocs.io/en/latest/sandhi.html
    It's a python package built by the guys at Project Ambuda. I assumed it would do the job but I saw a note somewhere on the site that the sandhi-split tool is deprecated and they suggest using the Dharmamitra API. 

  2. This brings me to Dharmamitra. I didn't use their API (for which one can refer to this python package from one of the makers of Dharmamitra), but I did use their Sanskrit model for getting each of the sentences parsed.

Execution:

In this section, I'll just list out in loose order the bunch of steps that I took and the problems I faced in getting this done.
  • My first instinct was to get it running on my home computer, locally. The model seems small (~2.5GB) which is handle-able for my machine. The problem is, the model page on Huggingface is quite bare in terms of documentation and doesn't provide enough information on how to run the model for inference - which formats should the input be in, what formats to I expect output in, and so on.
  • Fortunately, after a bit of stumbling around, I found this convenient repo by the model creator, which has example code for running the model.
  • I modified the code to test it on a single line and a batch of my Mahabharata data, which worked.
  • However, it won't scale for the amount of data that I had. For a max_length of 512, the model would take ~6hrs to go through my entire dataset, with the GPU running at full capacity. I didn't want to worsen my GPU's health in any way, so I opted for Google's cloud platform.
  • Google offers 300$ worth free compute credits when starting out but I couldn't get the VertexAI platform running. In the end, I went for Colab pro. (I tried the free version of Colab first, but it wasn't as fast, and shut down my session mid way for some reason)
  • I got 100 compute units from Google at ~11$ which gave me access to A100 GPU with 40GB memory. I ran my code at a batch size of 1024 with a max_length of 512 tokens. It ran in ~1.5hrs. (This is when it was successful. Before reaching the point of success, I still faced some problems getting it to this stage which are listed out in the next sub-section. This is just for posterity, so that I don't make the same mistakes again. You can skip it and move on the the next one) 
Things that didn't work:
Even when I had finally settled on the compute platform, I faced some problem with getting the right output.
  • I tried to experiment with max_length, by setting it to 100 (which would speed up the inference), as most input sentences where less than 100 chars long. Turns out, max_length is the max number of tokens in encoded input **and output**, which is almost always greater than the num characters in the input sentence. This resulted in most outputs being cut off and therefore unusable.
  • Second, the model gets messed up every time there's a ';' in the input sentence, as it stops reading the rest of the sentence once it encounters the semicolon. I had already split the input verses on lines, when creating the dataset, I also had to split them on semicolons
  • I tried splitting the sentences on some max character length, so that I can run it on lower max_length (token) settings, but that seemed to complicated and at this point I'd already chose to go for max_length=512 which did the job.

Making the word dict:

My end goal was to get a dict of all Mahabharata words as keys with their split forms as values.

Once I had the analyzed output from the previous step, I went through the entire dataset one sentence at time, split that sentence into words and then wrote some logic to map each word to one or more split words in the analyzed version of the sentence. 

This was the most brainy part of the project, to get that one-to-many relation right.

I used a two pointer scheme, one pointing towards the source unsplit word array (A1), and the second pointing towards the analyzed (with word-splits) word array (A2). For each word in A1, it can match to the currently-pointed-at word in A2, or an ordered combination of words in A2 starting from the currently-pointed-at word. For combining two or more words from A2, I used the python package - sandhi. The words don't always match each other exactly in terms of string similarity, but you can get away with them being almost similar.


Finally, I have the word-dict ready with around 200K entries. I'd say it has 90% accuracy in word splits, but that's to be formally verified.

Future ideas:
I could build my own, low resource cost model that only only deals with word splitting based on the data from Ambuda. For example, look at this representation of Ramayana where they already provide word splits. I think I can use their data to train a model that does the job.

No comments:

Post a Comment

The problem of fighting at night

The rule of war that battles have to commence at/post sunrise and conclude at/pre sunset was inspired by a lot of practical considerations (...