Building a Vector-Based Index for Semantic Queries image

Building a Vector-Based Index for Semantic Queries

A Smarter Book Database: OpenAI Embeddings + Pinecone.

A common limitation of traditional databases is their inability to match semantic meaning in search, while finding only exact keywords. This is limiting, as natural language is embedded with meanings. These nuances would be perceived by a human in conversation, and technology, as it continues to bridge the communication gap, finds its solution in vectors.

Vectors are a series of numbers—1536 in this case—which have embedded meaning as a result of translation through LLMs. The embedding is semantic, creating a stronger association to the base meaning of words, with considerations to context, concepts, themes, and tone. The resulting singular vector is stored in the Pinecone database, with meaning as rich as the information used to create it. This process is called "upserting"—a combination of "update" and "insert", as vectors can be continually updated according to changes in the metadata.

For my use case, I put together a list of 100 books published 100 or more years ago, with the help of ChatGPT. These books were then detailed with these metadata fields:

  • Title
  • Author
  • Publication year
  • Genre
  • Themes
  • Summary
  • Influences
  • Impact
  • Source language
  • Word count
  • Public domain status

An index named "100-classic-books" was created in Pinecone. The records were processed in Python to achieve the following:

  1. Embedded semantic meaning with OpenAI's "text-embedding-3-small"
  2. Upserted to Pinecone index
  3. Database queried with natural language

The query output is a list of the five most relevant results, where the impact of semantic search is seen immediately. The hidden associations imbued of the LLM translation, though not perfect, are a step closer to natural understanding and communication.

Back to Posts