I’ve been playing around with Common Crawl data lately - training some hindi language models. So I had about 25~ gigs of text to process, which I suppose is tiny compared to English Common Crawl (in the TB range); but it’s right on the cusp of having to think a little bit on how to deal with it in the most efficient manner.
Picture the idyllic, perfect day. Warm, but not overly so. Sunny, but not enough to be oppressive.
Machine Learning is firmly grounded in math - but why does it never feel like that day to day?. When I sit down to solve a math problem, not only is there a definitive answer, but more often than not there’s an ideal way to approach the question. This ideal way never changes. When I sit down to solve an ML problem on the other hand, I always have to go through the same steps-
The other day, I was helping out some people at my research lab write a couple of sentiment analysis classifiers for the Hindi Language and I was surprised to find that almost every famous English Language Model(Glove, FastText, Word2vec) has a Hindi equivalent that is very easy to find. I was even more surprised to find despite this, there is a dearth of well put together, annotated datasets for modern NLP Tasks.
For the second time in two years now, I’ve decamped and moved cities. Unexpectedly, there’s a fair amount of culture shock. In the past year I’ve gotten accustomed to being in a very quiet city and Delhi is certainly not that. Plus I’ve had relatives over, other relatives passing away and social obligations that I can’t postpone, so things have been fairly busy; a far cry from how introverted I’d become in Hyderabad.
Around this time last year, I had a few goals in mind for this year. However, I didn’t exactly write them down - and as most of my friends will attest, my memory isn’t the best, so I’m kinda winging it.
This week, 2 small changes to the way I built two models resulted in an absurd boost of performance of both. To the extent where it doesn’t even make sense for the models to be this good. The obvious conclusion then becomes that I’ve made some mistake somewhere. Thing is, I have absolutely no idea where the mistake lies - but in the attempt to figure it out, I was forced to take a closer look at both the model and the data, aided by a healthy dose of skeptcisim towards what the models do and how good their performance can be.
A project underway at the lab in IIITDelhi that I’m volunteering at involves collecting a decently large collection of tweets. The bot needs to run at regular intervals, hit the search API with a series of keywords and store the returned tweets. More than that, it needed to be written completely in python and fairly difficult for a bored undergrad to break. I chose Gramex - since I work for the company that builds it, and I hadn’t had a chance to test our TwitterRESTHandler yet. The source code is available here, but the entire thing runs in 2 files - a
gramex.yamlconfiguration file that configures the endpoints and the scheduler and a
This is probably anecdotal for the moment, but the majority of tech industry people I know are all focused on solving engineering problems for people outside this country.
I’m spending most of this long discontinuous weekend to run through all the fast.ai notebooks - while I find this course really helped me get a theoretical understanding of what neural nets are and how they work, I find that it takes me an inordinate amount of time to produce actual code - probably due to the fact that I never wrote anything while doing the course.
It’s been a fairly crazy year - which makes the theme of this post all the weirder. Between weird stuff going on with my family, ripping (and subsequently getting surgery for) my ACL and moving apartments in Hyderabad, a lot has happened recently.
This post is going to be a small repository of information about Yoon Kim’s paper on using Convolutional Neural Networks for Sentence Classification.
subscribe via RSS