🎉 🎉 Exciting news from INFINI Labs! We’ve officially open-sourced our products on GitHub. 👉 Check it out here: http://github.com/infinilabs
Build a Vector Extension for Postgres - Introduction

Build a Vector Extension for Postgres - Introduction

Table of Contents

Why and What

Vector databases are really hot topics nowadays. I have always been curious about what they are and how they work under the hood, so let’s build one ourselves. Building a whole new database from scratch is not practical, we need some building blocks, or, just a real database system. Postgres has a long-standing reputation for its extensibility, which makes it a perfect fit for our needs, and projects like pgvector have already demonstrated it is viable to add vector support to Postgres as an extension.

We are going to implement vector support for Postgres, but, what are the detailed features to implement? This is not a hard question, the definition of Vector database from Wikipedia shows us the right direction:

A vector database, vector store or vector search engine is a database that can store vectors (fixed-length lists of numbers) along with other data items. Vector databases typically implement one or more Approximate Nearest Neighbor algorithms so that one can search the database with a query vector to retrieve the closest matching database records

Alright, so we need to enable Postgres to store vectors, and be able to perform Top-K queries, i.e., for a given input vector, Postgres should return the K vectors that are most similar (or closest) to it. If we express them in SQL, it would look like this:

-- Create a table, which has a column of type `vector(3)`, 3 is the dimension of the vector
CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));

-- Insert vectors, Postgres should store them!
INSERT INTO items (embedding) VALUES ('[1,2,3]'), ('[4,5,6]');

-- Now, Postgres should return the Top-5 vectors that are most similar to
-- [3, 1, 2]
SELECT * FROM items ORDER BY embedding <=> '[3,1,2]' LIMIT 5;

Example SQL makes things clear, to summarize, we need to:

  1. Implement a vector type for Postgres, it should accept a dimension parameter
  2. Implement that <=> binary operator, which should calculate the similarity of the given 2 vectors and return it

Set up the environment

I will use the Rust language and a library called pgrx, to install Rust you simply need to follow the instructions here, then you run this command to set up cargo-pgrx, a cargo sub-command to manage everything related to pgrx:

$ cargo install --locked cargo-pgrx
$ cargo pgrx --version # to verify that it gets installed

Now we need a Postgres server to run and test our project, I would just let pgrx install a brand new Postgres for me to make things easier. At the time of writing, Postgres 17 is the latest version, so I will use it.

pgrx builds Postgres from source, so you need to ensure these requirements are satisfied. pgrx also has a page about the system requirements, but Postgres is really well-documented, it deserves a read. Once you have everything set up, run:

$ cargo pgrx init --pg17 download

The initial commit

Now let’s write some code, cargo pgrx, just like cargo, provides a new sub-command to create new projects, say we call our project pg_vector_ext, run:

$ cargo pgrx new pg_vector_ext
$ cd pg_vector_ext
$ tree .
pg_vector_ext/
├── Cargo.toml
├── pg_vector_ext.control
├── sql
└── src
    ├── bin
    │   └── pgrx_embed.rs
    └── lib.rs

4 directories, 4 files

From this, we can see, pgrx creates some template files for us. For now, we only care about the src/lib.rs file.

$ bat src/lib.rs
───────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       │ File: src/lib.rs
───────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1   │ use pgrx::prelude::*;
   2   │
   3   │ ::pgrx::pg_module_magic!();
   4   │
   5   │ #[pg_extern]
   6   │ fn hello_pg_vector_ext() -> &'static str {
   7   │     "Hello, pg_vector_ext"
   8   │ }
   9   │
  10   │ #[cfg(any(test, feature = "pg_test"))]
  11   │ #[pg_schema]
  12   │ mod tests {
  13   │     use pgrx::prelude::*;
  14   │
  15   │     #[pg_test]
  16   │     fn test_hello_pg_vector_ext() {
  17   │         assert_eq!("Hello, pg_vector_ext", crate::hello_pg_vector_ext());
  18   │     }
  19   │
  20   │ }
  21   │
  22   │ /// This module is required by `cargo pgrx test` invocations.
  23   │ /// It must be visible at the root of your extension crate.
  24   │ #[cfg(test)]
  25   │ pub mod pg_test {
  26   │     pub fn setup(_options: Vec<&str>) {
  27   │         // perform one-off initialization when the pg_test framework starts
  28   │     }
  29   │
  30   │     #[must_use]
  31   │     pub fn postgresql_conf_options() -> Vec<&'static str> {
  32   │         // return any postgresql.conf settings that are required for your tests
  33   │         vec![]
  34   │     }
  35   │ }
───────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Ignore the tests module (as it is for testing), we can see that pgrx creates a function hello_pg_vector_ext(), this is something callable in SQL, if we run the project via:

Before running it, you need to edit the Cargo.toml file, within the features section, change the default feature to pg17, and optionally, you can remove the pg* features other than pg17 as they won’t be used:

[features]
default = ["pg17"]
pg17 = ["pgrx/pg17", "pgrx-tests/pg17" ]
pg_test = []
$ cargo pgrx run

It will start the Postgres 17 instance and connect to it via psql, we can install our extension and run the function:

pg_vector_ext=# CREATE EXTENSION pg_vector_ext;
CREATE EXTENSION
pg_vector_ext=# SELECT hello_pg_vector_ext();
 hello_pg_vector_ext
----------------------
 Hello, pg_vector_ext
(1 row)

This is our first attempt at pgrx and also our first commit to the project. In the next post, I will implement the vector type so that Postgres can store vectors.