Learning vector databases by doing

I was assigned the InvestorMatch.ai project. I think that was my 2nd or 3rd pick, so I’m happy about it. I’m excited by the toolset, and I’m looking forward to adding this project and tools to my resume. Who knows, maybe I’ll work with them at my next job.

How are you using AI in your project?

The goal of our project is to match start-up founders with vendors who can offer them services. As the name of the project suggests, we are using AI tools to implement this matching service. In fact, we are really only using one AI tool at the moment, Weaviate. The Weaviate client is like a suite of AI tools. Weaviate is an open-source vector database that also comes with a set of helpful tools to make it useful.

Without getting to specific (NDA), we are taking unstructured data (as well as important structured data) and vectorizing it (Weaviate allows you to include modules, some of them for vectorizing different common media types, including text2vec for vectorizing text data). Then we use Weaviate’s similarity search operators to find other vector objects (that is, start-up founders or service-vendors) that are nearby in vector-space. And there is the crux of the functionality our service should provide, all from one tool.

This is a convenient and lightweight way to prototype our data pipeline at this point. Beyond that, Weaviate suits our future needs too. Our project sponsor wants to experiment with various distance metrics. Distance metrics are used to measure the similarity of data in vector space. Weaviate uses the cosine distance, which measures the similarity of the angles of vectors. However, there are a menagerie of other distance metrics, ranging from intuitive to novel. Weaviate allows you to configure its distance metric to any of five.

What are some pros and cons?

The pros are more numerous than the cons. For one, Weaviate handles scaling with large data automatically. This is an integral feature of a database tool used for machine learning and data science, which use large datasets. Relatedly, Weaviate makes it seamless and easy to use a lot of popular machine learning tools, for example, text2vec. It also has an automatic classification module that can categorize your data based on learned patterns. The latter is not something we’ve needed to use at this point, but I would like to try it.

As for the cons: Weaviate is a bit resource-intensive. I won’t be doing my work on my 12-year-old Lubuntu machine. Also, Weaviate is not the most popular vector database tool. Elasticsearch, Milvus, FAISS, Pinecone, and Annoy are all more widely used and known. Consequently, there are larger communities surrounding those tools, and more resources. However, the Weaviate documentation is pretty good, so we haven’t felt the consequences to harshly. Lastly, another possible con, is that Weaviate allows for a great deal of customization. This usually comes at a cost of simplicity and ease-of-use. Weaviate may not be as intuitive or user-friendly as, say, Pinecone. But so far it has been entirely cromulent with regards to the learning curve.

Has it made you a better programmer?

Well, certainly in the sense that the primary way a programmer improves is by learning, I have improved. Weaviate and vector dbs in general are new territory to me. I wouldn’t say Weaviate has made me a better programmer. But I will say that I am using other AI tools, namely ChatGPT, to direct my research, and it’s a fantastic way to learn quickly. Using a tool’s documentation in combination with ChatGPT sometimes gives me Neo feelings: “I know kung fu.” I know that ChatGPT is often spoken of in the context of academic dishonesty and the ethical dilemmas it raises. And those are interesting questions and situations. But like any powerful tool, it matters how you use it.

Learning vector databases by doing

Comments

Leave a Reply Cancel reply