Database

RUM: improved inverted index for full-text search based on GIN index

RUM is an extension which adds a RUM index to Postgresql.

RUM index is based on GIN that stores additional per-entry information in a posting tree. For example, positional information of lexemes or timestamps. In comparison to GIN it can use this information to make faster index-only scans for:

  • Phrase search
  • Text search with ranking by text distance operator
  • Text SELECTs with ordering by some non-indexed additional column e.g. by timestamp.

RUM works best in scenarios when the possible keys are highly repeatable. I.e. all texts are composed of a limited amount of words, so per-lexeme indexing gives significant speed-up in searching texts containing word combinations or phrases.

Main operators for ordering are:

tsvector <=> tsquery | float4 | Distance between tsvector and tsquery. value <=> value | float8 | Distance between two values.

Where value is timestamp, timestamptz, int2, int4, int8, float4, float8, money and oid

Usage

Enable the extension

You can get started with rum by enabling the extension in your Supabase dashboard.

Syntax

For type: tsvector

To understand the following you may need first to see Official PostgreSQL documentation on text search

rum_tsvector_ops


_11
CREATE TABLE test_rum(t text, a tsvector);
_11
_11
CREATE TRIGGER tsvectorupdate
_11
BEFORE UPDATE OR INSERT ON test_rum
_11
FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger('a', 'pg_catalog.english', 't');
_11
_11
INSERT INTO test_rum(t) VALUES ('The situation is most beautiful');
_11
INSERT INTO test_rum(t) VALUES ('It is a beautiful');
_11
INSERT INTO test_rum(t) VALUES ('It looks like a beautiful place');
_11
_11
CREATE INDEX rumidx ON test_rum USING rum (a rum_tsvector_ops);

And we can execute tsvector selects with ordering by text distance operator:


_10
SELECT t, a `<=>` to_tsquery('english', 'beautiful | place') AS rank
_10
FROM test_rum
_10
WHERE a @@ to_tsquery('english', 'beautiful | place')
_10
ORDER BY a `<=>` to_tsquery('english', 'beautiful | place');
_10
t | rank
_10
---------------------------------+---------
_10
It looks like a beautiful place | 8.22467
_10
The situation is most beautiful | 16.4493
_10
It is a beautiful | 16.4493
_10
(3 rows)

rum_tsvector_addon_ops


_10
CREATE TABLE tsts (id int, t tsvector, d timestamp);
_10
CREATE INDEX tsts_idx ON tsts USING rum (t rum_tsvector_addon_ops, d)
_10
WITH (attach = 'd', to = 't');

Now we can execute the selects with ordering distance operator on attached column:


_10
SELECT id, d, d `<=>` '2016-05-16 14:21:25' FROM tsts WHERE t @@ 'wr&qh' ORDER BY d `<=>` '2016-05-16 14:21:25' LIMIT 5;
_10
id | d | ?column?
_10
-----+---------------------------------+---------------
_10
355 | Mon May 16 14:21:22.326724 2016 | 2.673276
_10
354 | Mon May 16 13:21:22.326724 2016 | 3602.673276
_10
371 | Tue May 17 06:21:22.326724 2016 | 57597.326724
_10
406 | Wed May 18 17:21:22.326724 2016 | 183597.326724
_10
415 | Thu May 19 02:21:22.326724 2016 | 215997.326724
_10
(5 rows)

For type: anyarray

rum_anyarray_ops

This operator class stores anyarray elements with length of the array. It supports operators &&, @>, <@, =, % operators. It also supports ordering by <=> operator.


_10
CREATE TABLE test_array (i int2[]);
_10
INSERT INTO test_array VALUES ('{}'), ('{0}'), ('{1,2,3,4}'), ('{1,2,3}'), ('{1,2}'), ('{1}');
_10
CREATE INDEX idx_array ON test_array USING rum (i rum_anyarray_ops);

Now we can execute the query using index scan:


_10
SELECT * FROM test_array WHERE i && '{1}' ORDER BY i `<=>` '{1}' ASC;
_10
i
_10
-----------
_10
{1}
_10
{1,2}
_10
{1,2,3}
_10
{1,2,3,4}
_10
(4 rows)

rum_anyarray_addon_ops

The does the same with anyarray index as rum_tsvector_addon_ops i.e. allows to order select results using distance operator by attached column.

Limitations

RUM has slower build and insert times than GIN due to:

  1. It is bigger due to the additional attributes stored in the index.
  2. It uses generic WAL records.

Resources