CozoDB: Database for AI applications

Volodymyr Pavlyshyn
4 min readJun 10, 2024

--

CozoDB is a hidden gem for AI powered application. Hippocampus for AI with Embedded Datalog. A FOSS embeddable, transactional, relational-graph-vector database, thats works across platforms and languages, with time travelling capability, perfect as the long-term memory for LLMs and AI. The key — it is combined critical data structiures for any AI based and LLM based Applications :

  • Vector search and embeddings ,
  • Knowladge Graphs
  • classical relation databases.

Embedded Models

A database is embedded if it runs in the same process as your main program. This is in contradistinction to client-server databases, where your program connects to a database server (maybe running on a separate machine) via a client library. Embedded databases generally require no setup and can be used in a much wider range of environments.

We say CozoDB is embeddable instead of embedded since you can also use it in client-server mode, which can make better use of server resources and allow much more concurrency than in embedded mode.

Datalog

Recursion is especially important for graph queries. CozoDB’s dialect of Datalog supercharges it even further by allowing recursion through a safe subset of aggregations, and by providing extremely efficient canned algorithms (such as PageRank) for the kinds of recursions frequently required in graph analysis.

As you learn Datalog, you will discover that the rules of Datalog are like functions in a programming language. Rules are composable, and decomposing a query into rules can make it clearer and more maintainable, with no loss in efficiency. This is unlike the monolithic approach taken by the SQL select-from-where in nested forms, which can sometimes read like golfing.

Joins with a datalog

r1[] <- [[1, 'a'], [2, 'b']]
r2[] <- [[2, 'B'], [3, 'C']]
?[l1, l2] := r1[a, l1],
r2[a, l2] # reused `a`

Graph model on top of relational data

Most existing graph databases start by requiring you to shoehorn your data into the labelled-property graph model. We don’t go this route because we think the traditional relational model is much easier to work with for storing data, much more versatile, and can deal with graph data just fine. Even more importantly, the most piercing insights about data usually come from graph structures implicit several levels deep in your data. The relational model, being an algebra, can deal with it just fine. The property graph model, not so much, since that model is not very composable.

Declare Graph with relations model

?[loving, loved] <- [['alice', 'eve'],
['bob', 'alice'],
['eve', 'alice'],
['eve', 'bob'],
['eve', 'charlie'],
['charlie', 'eve'],
['david', 'george'],
['george', 'george']]
:replace love {loving, loved}

The graph we have created reads like “Alice loves Eve, Bob loves Alice”, “nobody loves David, David loves George, but George only loves himself”, and so on. Here we used :replace instead of :create. The difference is that if love already exists, it will be wiped and replaced with the new data given.

We can investigate competing interests:

?[loved_by_b_e] := *love['eve', loved_by_b_e],
*love['bob', loved_by_b_e]

One quite powerful feature is a recursion in queries

alice_love_chain[person] := *love['alice', person]
alice_love_chain[person] := alice_love_chain[in_person],
*love[in_person, person]
?[chained] := alice_love_chain[chained]

Vector Search

MinHash-LSH indices

Let’s say you collect news articles from the Internet. There will be duplicates, but these are not exact duplicates. How do you deduplicate them? Simple. Let’s say your article is stored thus:

:create article{id: Int => content: String}
To find the duplicates, you create an LSH index on it:
::lsh create article:lsh {
extractor: content,
tokenizer: Simple,
n_gram: 7,
n_perm: 200,
target_threshold: 0.7,
}

Now if you do this query:

?[id, content] := ~article:lsh {id, content | query: $q }

then articles with its content about 70% or more similar to the passed-in text in $q will be returned to you.

If you want, you can also mark the duplicates at insertion time. For this, use the following schema:

:create article{id: Int => content: String, dup_for: Int?}
Then at insertion time, use the query:
{
?[id, dup_for] := ~article:lsh {id, dup_for | query: $q, k: 1}
:create _existing {id, dup_for}
}
%if _existing
%then {
?[id, content, dup_for] := *_existing[eid, edup],
id = $id,
content = $content,
dup_for = edup ~ eid
:put article {id => content, dup_for}
}
%else {
?[id, content, dup_for] <- [[$id, $content, null]]
:put article {id => content, dup_for}
}
%end

For our own use-case, this achieves about 20x speedup compared to using the equivalent Python library. And we are no longer bound by RAM.

As with vector search, LSH-search integrates seamlessly with Datalog in CozoDB.

--

--

Volodymyr Pavlyshyn

I believe in SSI, web5 web3 and democratized open data.I make all magic happens! dream & make ideas real, read poetry, write code, cook, do mate, and love.