Capstone Success

I started teaching myself Python almost 5 years ago. I have used it in professional settings, for personal projects, and for school projects. One way in which I have not used it, however, is to do machine learning tasks.

My group’s project this term involves using a vector database to provide recommendations to startup founders and funders based on the similarity of their profiles. While learning about the vector database technology that we are using, I implemented functionality to vectorize strings without the help of the vector database, using a Python library called sentence-transformers.

Figuring out how to use a machine learning library was the biggest success in the course for me because it opened a door I didn’t realize was there. I enjoyed working with the library even though I don’t have any formal education in machine learning, or a deep understand of how the underlying algorithms work. In the current moment, when the world is changing because of AI technology that depends on representing language with vectors, it feels like a breakthrough to have used that technology to accomplish a task, even in some small way.

Now I’ll provide some of the code that I wrote to create the vectors and compare their similarity.

Before that, a quick note about terminology: in Natural Language Processing (NLP), vectors that represent the semantic meaning of language are called embeddings, so I’ll use this term going forward. A vector is simply a point in multi-dimensional space, so all embeddings are vectors, but not all vectors are embeddings.

First we’ll import the relevant libraries.

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

sentence-transformers comes preloaded with models that create the embeddings, so we’ll tell the library which model to use.

model = SentenceTransformer("paraphrase-MiniLM-L6-v2")

Now we’ll create a function to generate the embeddings.

def get_embeddings(model: SentenceTransformer, strs: list[str]):
    if strs:
        return model.encode(strs)
    return None

Once we have the embeddings for a pair of strings, we can then take the average of the embeddings and compare their cosine similarity to check how semantically similar they are.

def check_similarity(s1: str | list[str], s2: str | list[str]):
    """
    returns the cosine similarity of two strings
    """
    e1 = get_embeddings([s1] if type(s1) is str else s1)
    e2 = get_embeddings([s2] if type(s2) is str else s2)
    ae1 = np.mean(e1, axis=0)
    ae2 = np.mean(e2, axis=0)
    return cosine_similarity([ae1], [ae2])

This functionality is useful in comparing the similarity of profiles across the properties of those profiles. More generally, we can use this functionality to check the semantic similarity of any two strings.

In the iPython shell, we can run this function interactively to see which strings return a strong cosine similarity. A cosine similarity closer to 1 means the strings are more similar.

In [3]: check_similarity("I love you", "I strongly dislike you")
Out[3]: array([[0.40226793]], dtype=float32)

In [16]: check_similarity("I love you", "I care about you very deeply")
Out[16]: array([[0.5438149]], dtype=float32)

In [19]: check_similarity("I love you a lot", "I love you very much")
Out[19]: array([[0.9389606]], dtype=float32)

As you can see, the model is quite good at picking up on the semantic similarity of phrases. However, it isn’t without its limitations.

In [20]: check_similarity("I'm taking my car to the shop later", "I don't need to take my car to the shop later")
Out[20]: array([[0.7792265]], dtype=float32)

As you can see, the model may return a high similarity score for phrases that have opposite semantic meaning. This is likely because the model is picking up on the lexical similarity of the words in the sentence, without taking into account the negation of the sentence’s meaning indicated by the words “I don’t need”. The paraphrase-MiniLM-L6-v2 model we’re using is a smaller, more efficient model, so a larger, more sophisticated model would return a more accurate similarity score. This highlights one of the main challenges in NLP and machine learning, which the industry has seen overcome recently with BERT models and transformers, and which has made technologies such as ChatGPT possible.

Comments

Leave a Reply Cancel reply