Levenshtein taking forever? Try using mbleven if your levenshtein algorithm takes a lot of time and if your bounding parameter is less than 3.


Levenshtein Distance is the minimal number of insertions, deletions, and symbol substitutions required to transform a string a into string b.

Example: Consider string a: mouse & string b: morse

Levenshtein distance between string a and string b is 2. You need to delete u from string a and insert r to transform string a to string b.

There is also one other modification to levenshtein distance:

  1. Damerau Levenshtein follows exactly the same approach as Levenshtein Distance but you can also use transpositions (swapping of adjacent symbols) and hence, it makes levenshtein faster and efficient.

In the above example, Damerau-Levenshtein…

An intuitive explanation to tackle overconfidence in Machine Learning.

A certain death of an artist is overconfidence — Robin Trower

Remember the time when you switched off google maps because you were so confident that you know the way to your destination but it turns out that the road was shut due to some construction.. and you had to make use of google maps after all. That was you being over-confident. How to tackle overconfidence in machines? Sometimes, it is easy to confuse one thing with other. Hence, it is better to be just a little less confident about certain things which you are 100% confident about.

source: unsplash

Label Smoothing…

Why is going back to the basics always helpful? The very basics of supervised and unsupervised learning.

With the help of Machine learning, a system can make decision which can be relatable to the decisions that humans make. Machine learning has the ability to learn from the data it inputs. For now, Machine learning is trying to imitate the way humans learn in a computationally efficient manner.

Supervised learning contains of a dataset with a set of features and the corresponding target label. These sets of features are also known as predictors because they help in predicting the target label based on each data sample.

For example: Imagine giving a kid an ice-cream whenever he finishes his…

Be truthful to your model and predictions by using LIME (Local Interpretable Explanations)

Image created by author

We like to believe that this is an era of ML and AI, but what we are forgetting is that this is also an era of black-box models which provide no justice and interpretation about the classifier and the decision it makes.

One way to make a change is to audit black-box models. Wait… what are black-box models?

A black-box model refers to something which is completely dark and hence, one can only observe the input and output variables but not what is going on inside it. …

My father gave me an excel sheet with his work details and asked me to infer anything and everything that I could. This is what I did.


Wikipedia says “EDA is an approach to analyzing datasets to summarize their main characteristics, often with visual methods”. It is an approach which will help you build a better relationship with the new dataset. If you use it wisely, half of your analysis is done.

For example: When you are buying something online, do you read the reviews? Do you see the price? Do you see the color? Do you swipe through the…

Random little things that I learnt during my internship (P.S. these are not just technical things)

Image via Author

Currently, I am a Master’s in Data Science Student at NYU, Center for Data Science. I completed my first ever internship in Data Science and I would love to tell you all about it.

I worked for a contextual-AI company, where we detect toxic behaviors from various platforms.

God! Did I have fun? The answer is — hell YES!

One major thing to take away is how Academic Data Science differs from Industry Data Science. But, the transition is what makes it so exciting.

A few random things I learnt during my first Data Science Internship —

  1. Ability to write Production-Level Code: Even though you have vast ML knowledge, you need to know…

Do not have enough resources to get your own labelled data for your classification task? Try thinking out-of-the-box and actively start learning “Active Learning”. Let your algorithm choose what to learn from.


We talk about working on data all the time to gain specific insights related to our projects, our experiences, maximising profits, increasing revenue and so many other reasons. This is what anyone thinks about all the time. But.. Let us take a step back and think in terms of “actual” data by taking a break from our “hypothetical” data.

It is not always easy to get data. No company will give it to you for free. There are a lot of trades that take place. For instance, you help the other company monetarily and they will help you with the…

What makes your data insightful? How do you rank your insights?


What is the first word that comes to your mind when you hear the word — insight? As a Data Scientist, I refer to the word insight when I collect some useful information from my data. I believe there is an order which is followed by Data, Information and Insight.

There is a famous hierarchical structure that comes to mind after looking at these three words together:

Tweet Preprocessing!

Learn how to preprocess tweets using Python


Note from the editors: Towards Data Science is a Medium publication primarily based on the study of data science and machine learning. We are not health professionals or epidemiologists, and the opinions of this article should not be interpreted as professional advice. To learn more about the coronavirus pandemic, you can click here.

Just to give you a little background as to why I am preprocessing tweets: Given the current situation as of May, 2020, I am interested in the political discourse of the US Governors with respect to the ongoing pandemic. I would like to analyse how did the…

Parthvi Shah

Masters in Data Science @ NYU

