bytetrie/README.md

[![Build Status](https://drone.friedl.net/api/badges/incubator/bytetrie/status.svg)](https://drone.friedl.net/incubator/bytetrie)

# Bytetrie
A fast, dependency-free, self-compressing trie with radix 256 in pure python.

![](trie.png)

Bytetrie allows fast prefix search in a large corpus of keys. Each key can be
associated with arbitrary data. It features fast lookup times at the cost of
expensive insertion. A Bytetrie is best used if it can be pre-filled with data.
However, due to its in-band compression it can be also used for on-the-fly
updates.

## Keys
Keys are byte strings. Therefore, each node in the trie can have up to 256
children (the radix). Keys do work well with utf-8 and other encodings as long
as the encoding is consistent and deterministic. That is, grapheme clusters
are always encoded to the same byte sequence -- even if the standard allows for
ambiguity. Usually that's a non-issue as long as the same encoder is used for
insertion and lookup.

Since prefix search in unicode strings is one of the most common use-cases of
bytetrie, a unicode layer on top of bytetrie is [planned](TODO.md).

## Data
Bytetrie can associate arbitrary python objects with keys. Data (or rather a
reference thereof) is kept in-tree. No further processing is done.

In addition, bytrie allows multi-valued tries. Every key is then associated with
a sequence of arbitrary objects.

## Performance
Despite being in pure python bytetrie is _fast_. Sifting through the full
[geonames](http://download.geonames.org/export/dump/) "allCountries" dataset for
places starting with `Vienna` takes a mere 512µs. That's not even a
millisecond for searching through 12,041,359 places. For comparison, a warmed-up
ripgrep search through the same dataset takes three orders of magnitude (400ms)
longer on the same machine.

On the downside, building the trie takes about 20 minutes and considerable
memory. Also, the performance is mostly trumped by the time it takes to collect
terminal nodes. The higher up the trie the search ends (and hence the more
results the prefix search yields) the longer it takes. There are several
low-hanging fruits left and further performance improvements are in the
[pipeline](TODO.md).

## Dependencies
None. That's the point.

# Getting started
Install bytetrie via [pip](https://pip.pypa.io/en/stable/quickstart/).
```
pip install -U bytetrie
```

The public interface is `ByteTrie` with the two methods `insert` and `find`.
Find returns a list of `Terminals` from which the `key` and the `value` of the
node can be retrieved.

```python
from bytetrie import ByteTrie

t = ByteTrie(multi_value=True)
t.insert(b"Hallo", "Dutch")
t.insert(b"Hello", "English")
t.insert(b"Hug", "Gaelic")
t.insert(b"Hallo", "German")
t.insert("Hē".encode("utf-8"), "Hindi")
t.insert("Halló".encode("utf-8"), "Icelandic")
t.insert(b"Hej", "Polish")
t.insert(b"Hei", "Romanian")
t.insert(b"Hujambo", "Swahili")
t.insert(b"Hej", "Swedish")
t.insert(b"Helo", "Welsh")

print("Where to say 'Hi' with 'He'?") 
print(f"{[(n.key(), n.value()) for n in t.find(b'He')]}")
# [(b'Hei', ['Romanian']), (b'Hej', ['Swedish', 'Polish']), (b'Helo', ['Welsh']), (b'Hello', ['English'])]

print("Where to say 'Hi' with 'Ha'?") 
print(f"{[(n.key().decode(), n.value()) for n in t.find(b'Ha')]}")
# [('Halló', ['Icelandic']), ('Hallo', ['German', 'Dutch'])]

print("Where to say 'Hi' with 'Hē'?") 
print(f"Say 'Hi' with utf-8: {[(n.key().decode(), n.value()) for n in t.find('Hē'.encode())]}")
# [('Hē', ['Hindi'])]
```

# Contribute
If you want to contribute to `bytetrie` feel free to send patches to
dev[at]friedl[dot]net. Alternatviely, you can issue a pull request on GitHub
which will be cherry picked into my tree. If you plan significant long-term
contributions drop me a mail for access to the incubator repository.

# Github Users
If you are visiting this repository on GitHub, you are on a mirror of
https://git.friedl.net/incubator/bytetrie. This mirror is regularily updated
with my other GitHub mirrors.

Like with my other incubator projects, once I consider `bytetrie` reasonable
stable the main tree will move to GitHub.
Geonames 500 example trie, build status, trie graph 2020-10-11 18:51:42 +00:00			`[![Build Status](https://drone.friedl.net/api/badges/incubator/bytetrie/status.svg)](https://drone.friedl.net/incubator/bytetrie)`

First release 2020-10-08 02:52:30 +00:00			`# Bytetrie`
			`A fast, dependency-free, self-compressing trie with radix 256 in pure python.`
Initial commit 2020-10-06 21:51:40 +00:00
Geonames 500 example trie, build status, trie graph 2020-10-11 18:51:42 +00:00			`![](trie.png)`

First release 2020-10-08 02:52:30 +00:00			`Bytetrie allows fast prefix search in a large corpus of keys. Each key can be`
			`associated with arbitrary data. It features fast lookup times at the cost of`
			`expensive insertion. A Bytetrie is best used if it can be pre-filled with data.`
			`However, due to its in-band compression it can be also used for on-the-fly`
			`updates.`

			`## Keys`
			`Keys are byte strings. Therefore, each node in the trie can have up to 256`
			`children (the radix). Keys do work well with utf-8 and other encodings as long`
Add drone build, polish README 2020-10-10 00:40:52 +00:00			`as the encoding is consistent and deterministic. That is, grapheme clusters`
			`are always encoded to the same byte sequence -- even if the standard allows for`
First release 2020-10-08 02:52:30 +00:00			`ambiguity. Usually that's a non-issue as long as the same encoder is used for`
			`insertion and lookup.`

			`Since prefix search in unicode strings is one of the most common use-cases of`
			`bytetrie, a unicode layer on top of bytetrie is [planned](TODO.md).`

			`## Data`
Add drone build, polish README 2020-10-10 00:40:52 +00:00			`Bytetrie can associate arbitrary python objects with keys. Data (or rather a`
			`reference thereof) is kept in-tree. No further processing is done.`
First release 2020-10-08 02:52:30 +00:00
Add drone build, polish README 2020-10-10 00:40:52 +00:00			`In addition, bytrie allows multi-valued tries. Every key is then associated with`
First release 2020-10-08 02:52:30 +00:00			`a sequence of arbitrary objects.`

			`## Performance`
			`Despite being in pure python bytetrie is _fast_. Sifting through the full`
			`[geonames](http://download.geonames.org/export/dump/) "allCountries" dataset for`
Add drone build, polish README 2020-10-10 00:40:52 +00:00			places starting with `Vienna` takes a mere 512µs. That's not even a
			`millisecond for searching through 12,041,359 places. For comparison, a warmed-up`
First release 2020-10-08 02:52:30 +00:00			`ripgrep search through the same dataset takes three orders of magnitude (400ms)`
			`longer on the same machine.`

Add drone build, polish README 2020-10-10 00:40:52 +00:00			`On the downside, building the trie takes about 20 minutes and considerable`
			`memory. Also, the performance is mostly trumped by the time it takes to collect`
			`terminal nodes. The higher up the trie the search ends (and hence the more`
			`results the prefix search yields) the longer it takes. There are several`
First release 2020-10-08 02:52:30 +00:00			`low-hanging fruits left and further performance improvements are in the`
			`[pipeline](TODO.md).`

			`## Dependencies`
			`None. That's the point.`

			`# Getting started`
Add drone build, polish README 2020-10-10 00:40:52 +00:00			`Install bytetrie via [pip](https://pip.pypa.io/en/stable/quickstart/).`
			```
			`pip install -U bytetrie`
			```

			The public interface is `ByteTrie` with the two methods `insert` and `find`.
			Find returns a list of `Terminals` from which the `key` and the `value` of the
			`node can be retrieved.`

			```python
			`from bytetrie import ByteTrie`

			`t = ByteTrie(multi_value=True)`
			`t.insert(b"Hallo", "Dutch")`
			`t.insert(b"Hello", "English")`
			`t.insert(b"Hug", "Gaelic")`
			`t.insert(b"Hallo", "German")`
			`t.insert("Hē".encode("utf-8"), "Hindi")`
			`t.insert("Halló".encode("utf-8"), "Icelandic")`
			`t.insert(b"Hej", "Polish")`
			`t.insert(b"Hei", "Romanian")`
			`t.insert(b"Hujambo", "Swahili")`
			`t.insert(b"Hej", "Swedish")`
			`t.insert(b"Helo", "Welsh")`

			`print("Where to say 'Hi' with 'He'?")`
			`print(f"{[(n.key(), n.value()) for n in t.find(b'He')]}")`
Fix links in PyPi 2020-10-12 04:34:07 +00:00			`# [(b'Hei', ['Romanian']), (b'Hej', ['Swedish', 'Polish']), (b'Helo', ['Welsh']), (b'Hello', ['English'])]`
Add drone build, polish README 2020-10-10 00:40:52 +00:00
			`print("Where to say 'Hi' with 'Ha'?")`
Fix links in PyPi 2020-10-12 04:34:07 +00:00			`print(f"{[(n.key().decode(), n.value()) for n in t.find(b'Ha')]}")`
			`# [('Halló', ['Icelandic']), ('Hallo', ['German', 'Dutch'])]`
Add drone build, polish README 2020-10-10 00:40:52 +00:00
			`print("Where to say 'Hi' with 'Hē'?")`
Fix links in PyPi 2020-10-12 04:34:07 +00:00			`print(f"Say 'Hi' with utf-8: {[(n.key().decode(), n.value()) for n in t.find('Hē'.encode())]}")`
			`# [('Hē', ['Hindi'])]`
Add drone build, polish README 2020-10-10 00:40:52 +00:00			```

			`# Contribute`
			If you want to contribute to `bytetrie` feel free to send patches to
			`dev[at]friedl[dot]net. Alternatviely, you can issue a pull request on GitHub`
			`which will be cherry picked into my tree. If you plan significant long-term`
			`contributions drop me a mail for access to the incubator repository.`
First release 2020-10-08 02:52:30 +00:00
			`# Github Users`
			`If you are visiting this repository on GitHub, you are on a mirror of`
			`https://git.friedl.net/incubator/bytetrie. This mirror is regularily updated`
			`with my other GitHub mirrors.`

			Like with my other incubator projects, once I consider `bytetrie` reasonable
			`stable the main tree will move to GitHub.`