Motivation:
The main problem that I face parsing Mahabharata is due to my lacking Sanskrit skills. I'd say, due to my Marathi background, I have a basic understanding of words and their meanings (they're pretty similar to Marathi words) but it's not as great as it should.
As an example to illustrate my point, consider this word: samāsīnānabhyagacchadbrahmarṣīnsaṃśitavratān.
It's a mouthful, isn't it? It's much better understood in it's unsandhied form:
[samāsīnān abhyagacchat brahmarṣīn saṃśita- vratān.]
- samāsīnān = sitting together
- abhyagacchat = he approached
- brahmarṣīn = the brahmarishis
- saṃśita- = firm
- vratān. = vows
I want each of my verses to have an unsandhied form accompanying it.
The problem is currently, there's no good Sanskrit sentence segmenter out there that can split sentences into words and desandhify words into their constituents. I looked up and there are two options:
- Vidut.Sandhi: https://vidyut.readthedocs.io/en/latest/sandhi.html
It's a python package built by the guys at Project Ambuda. I assumed it would do the job but I saw a note somewhere on the site that the sandhi-split tool is deprecated and they suggest using the Dharmamitra API. - This brings me to Dharmamitra. I didn't use their API (for which one can refer to this python package from one of the makers of Dharmamitra), but I did use their Sanskrit model for getting each of the sentences parsed.
- My first instinct was to get it running on my home computer, locally. The model seems small (~2.5GB) which is handle-able for my machine. The problem is, the model page on Huggingface is quite bare in terms of documentation and doesn't provide enough information on how to run the model for inference - which formats should the input be in, what formats to I expect output in, and so on.
- Fortunately, after a bit of stumbling around, I found this convenient repo by the model creator, which has example code for running the model.
- I modified the code to test it on a single line and a batch of my Mahabharata data, which worked.
- However, it won't scale for the amount of data that I had. For a max_length of 512, the model would take ~6hrs to go through my entire dataset, with the GPU running at full capacity. I didn't want to worsen my GPU's health in any way, so I opted for Google's cloud platform.
- Google offers 300$ worth free compute credits when starting out but I couldn't get the VertexAI platform running. In the end, I went for Colab pro. (I tried the free version of Colab first, but it wasn't as fast, and shut down my session mid way for some reason)
- I got 100 compute units from Google at ~11$ which gave me access to A100 GPU with 40GB memory. I ran my code at a batch size of 1024 with a max_length of 512 tokens. It ran in ~1.5hrs. (This is when it was successful. Before reaching the point of success, I still faced some problems getting it to this stage which are listed out in the next sub-section. This is just for posterity, so that I don't make the same mistakes again. You can skip it and move on the the next one)
Even when I had finally settled on the compute platform, I faced some problem with getting the right output.
- I tried to experiment with max_length, by setting it to 100 (which would speed up the inference), as most input sentences where less than 100 chars long. Turns out, max_length is the max number of tokens in encoded input **and output**, which is almost always greater than the num characters in the input sentence. This resulted in most outputs being cut off and therefore unusable.
- Second, the model gets messed up every time there's a ';' in the input sentence, as it stops reading the rest of the sentence once it encounters the semicolon. I had already split the input verses on lines, when creating the dataset, I also had to split them on semicolons
- I tried splitting the sentences on some max character length, so that I can run it on lower max_length (token) settings, but that seemed to complicated and at this point I'd already chose to go for max_length=512 which did the job.
This was the most brainy part of the project, to get that one-to-many relation right.
I could build my own, low resource cost model that only only deals with word splitting based on the data from Ambuda. For example, look at this representation of Ramayana where they already provide word splits. I think I can use their data to train a model that does the job.
No comments:
Post a Comment