Add drone build, polish README
All checks were successful
continuous-integration/drone/push Build is passing
All checks were successful
continuous-integration/drone/push Build is passing
This commit is contained in:
parent
b7d8355614
commit
92077efb43
4 changed files with 93 additions and 26 deletions
22
.drone.yml
Normal file
22
.drone.yml
Normal file
|
@ -0,0 +1,22 @@
|
||||||
|
kind: pipeline
|
||||||
|
type: docker
|
||||||
|
name: default
|
||||||
|
|
||||||
|
steps:
|
||||||
|
- name: validate
|
||||||
|
image: python:3
|
||||||
|
commands:
|
||||||
|
- pip install mypy
|
||||||
|
- mypy bytetrie/bytetrie.py
|
||||||
|
|
||||||
|
- name: publish
|
||||||
|
image: python:3
|
||||||
|
environment:
|
||||||
|
TWINE_USERNAME: __token__
|
||||||
|
TWINE_PASSWORD:
|
||||||
|
from_secret: pypi_test_token
|
||||||
|
commands:
|
||||||
|
- pip install twine setuptools wheel
|
||||||
|
- python setup.py sdist bdist_wheel
|
||||||
|
- twine check dist/*
|
||||||
|
- twine upload --repository testpypi dist/*
|
68
README.md
68
README.md
|
@ -10,8 +10,8 @@ updates.
|
||||||
## Keys
|
## Keys
|
||||||
Keys are byte strings. Therefore, each node in the trie can have up to 256
|
Keys are byte strings. Therefore, each node in the trie can have up to 256
|
||||||
children (the radix). Keys do work well with utf-8 and other encodings as long
|
children (the radix). Keys do work well with utf-8 and other encodings as long
|
||||||
as the encoding is consistent and deterministic. That is, a grapheme clusters
|
as the encoding is consistent and deterministic. That is, grapheme clusters
|
||||||
are always encoded to the same byte sequence. Even if the standard allows for
|
are always encoded to the same byte sequence -- even if the standard allows for
|
||||||
ambiguity. Usually that's a non-issue as long as the same encoder is used for
|
ambiguity. Usually that's a non-issue as long as the same encoder is used for
|
||||||
insertion and lookup.
|
insertion and lookup.
|
||||||
|
|
||||||
|
@ -19,24 +19,24 @@ Since prefix search in unicode strings is one of the most common use-cases of
|
||||||
bytetrie, a unicode layer on top of bytetrie is [planned](TODO.md).
|
bytetrie, a unicode layer on top of bytetrie is [planned](TODO.md).
|
||||||
|
|
||||||
## Data
|
## Data
|
||||||
Bytetrie can associate arbitrary data (python objects) with keys. Data (or
|
Bytetrie can associate arbitrary python objects with keys. Data (or rather a
|
||||||
rather a reference thereof) is kept in-tree. No further processing is done.
|
reference thereof) is kept in-tree. No further processing is done.
|
||||||
|
|
||||||
In addition bytrie allows multi-valued tries. Every key is then associated with
|
In addition, bytrie allows multi-valued tries. Every key is then associated with
|
||||||
a sequence of arbitrary objects.
|
a sequence of arbitrary objects.
|
||||||
|
|
||||||
## Performance
|
## Performance
|
||||||
Despite being in pure python bytetrie is _fast_. Sifting through the full
|
Despite being in pure python bytetrie is _fast_. Sifting through the full
|
||||||
[geonames](http://download.geonames.org/export/dump/) "allCountries" dataset for
|
[geonames](http://download.geonames.org/export/dump/) "allCountries" dataset for
|
||||||
places starting with `Vienna` takes a mere 512µs. That's not even one
|
places starting with `Vienna` takes a mere 512µs. That's not even a
|
||||||
millisecond for searching through 12,041,359 places. For comparison a warmed-up
|
millisecond for searching through 12,041,359 places. For comparison, a warmed-up
|
||||||
ripgrep search through the same dataset takes three orders of magnitude (400ms)
|
ripgrep search through the same dataset takes three orders of magnitude (400ms)
|
||||||
longer on the same machine.
|
longer on the same machine.
|
||||||
|
|
||||||
On the downside building the trie takes about 20 minutes and considerable
|
On the downside, building the trie takes about 20 minutes and considerable
|
||||||
memory. Also the performance is mostly trumped by the time it takes to collect
|
memory. Also, the performance is mostly trumped by the time it takes to collect
|
||||||
terminal nodes. That is, the higher up the trie the search ends (and hence the
|
terminal nodes. The higher up the trie the search ends (and hence the more
|
||||||
more results the prefix search yields) the longer it takes. There are several
|
results the prefix search yields) the longer it takes. There are several
|
||||||
low-hanging fruits left and further performance improvements are in the
|
low-hanging fruits left and further performance improvements are in the
|
||||||
[pipeline](TODO.md).
|
[pipeline](TODO.md).
|
||||||
|
|
||||||
|
@ -44,7 +44,46 @@ low-hanging fruits left and further performance improvements are in the
|
||||||
None. That's the point.
|
None. That's the point.
|
||||||
|
|
||||||
# Getting started
|
# Getting started
|
||||||
TODO
|
Install bytetrie via [pip](https://pip.pypa.io/en/stable/quickstart/).
|
||||||
|
```
|
||||||
|
pip install -U bytetrie
|
||||||
|
```
|
||||||
|
|
||||||
|
The public interface is `ByteTrie` with the two methods `insert` and `find`.
|
||||||
|
Find returns a list of `Terminals` from which the `key` and the `value` of the
|
||||||
|
node can be retrieved.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from bytetrie import ByteTrie
|
||||||
|
|
||||||
|
t = ByteTrie(multi_value=True)
|
||||||
|
t.insert(b"Hallo", "Dutch")
|
||||||
|
t.insert(b"Hello", "English")
|
||||||
|
t.insert(b"Hug", "Gaelic")
|
||||||
|
t.insert(b"Hallo", "German")
|
||||||
|
t.insert("Hē".encode("utf-8"), "Hindi")
|
||||||
|
t.insert("Halló".encode("utf-8"), "Icelandic")
|
||||||
|
t.insert(b"Hej", "Polish")
|
||||||
|
t.insert(b"Hei", "Romanian")
|
||||||
|
t.insert(b"Hujambo", "Swahili")
|
||||||
|
t.insert(b"Hej", "Swedish")
|
||||||
|
t.insert(b"Helo", "Welsh")
|
||||||
|
|
||||||
|
print("Where to say 'Hi' with 'He'?")
|
||||||
|
print(f"{[(n.key(), n.value()) for n in t.find(b'He')]}")
|
||||||
|
|
||||||
|
print("Where to say 'Hi' with 'Ha'?")
|
||||||
|
print(f"{[(n.key().decode('utf-8'), n.value()) for n in t.find(b'Ha')]}")
|
||||||
|
|
||||||
|
print("Where to say 'Hi' with 'Hē'?")
|
||||||
|
print(f"Say 'Hi' with utf-8: {[(n.key().decode('utf-8'), n.value()) for n in t.find('Hē'.encode('utf-8'))]}")
|
||||||
|
```
|
||||||
|
|
||||||
|
# Contribute
|
||||||
|
If you want to contribute to `bytetrie` feel free to send patches to
|
||||||
|
dev[at]friedl[dot]net. Alternatviely, you can issue a pull request on GitHub
|
||||||
|
which will be cherry picked into my tree. If you plan significant long-term
|
||||||
|
contributions drop me a mail for access to the incubator repository.
|
||||||
|
|
||||||
# Github Users
|
# Github Users
|
||||||
If you are visiting this repository on GitHub, you are on a mirror of
|
If you are visiting this repository on GitHub, you are on a mirror of
|
||||||
|
@ -53,8 +92,3 @@ with my other GitHub mirrors.
|
||||||
|
|
||||||
Like with my other incubator projects, once I consider `bytetrie` reasonable
|
Like with my other incubator projects, once I consider `bytetrie` reasonable
|
||||||
stable the main tree will move to GitHub.
|
stable the main tree will move to GitHub.
|
||||||
|
|
||||||
If you want to contribute to `bytetrie` feel free to send patches to
|
|
||||||
dev[at]friedl[dot]net. Alternatviely, you can issue a pull request on GitHub
|
|
||||||
which will be cherry picked into my tree. If you plan significant long-term
|
|
||||||
contributions drop me a mail for access to the incubator repository.
|
|
||||||
|
|
|
@ -6,7 +6,7 @@ import logging
|
||||||
log = logging.getLogger(__name__)
|
log = logging.getLogger(__name__)
|
||||||
|
|
||||||
class ByteTrie:
|
class ByteTrie:
|
||||||
def __init__(self, multi_value=False):
|
def __init__(self, multi_value:bool=False):
|
||||||
self.root = Root([])
|
self.root = Root([])
|
||||||
self.multi_value = multi_value
|
self.multi_value = multi_value
|
||||||
|
|
||||||
|
@ -63,9 +63,9 @@ class ByteTrie:
|
||||||
ancestor.put_child(node)
|
ancestor.put_child(node)
|
||||||
return terminal
|
return terminal
|
||||||
|
|
||||||
def find(self, prefix):
|
def find(self, prefix: ByteString) -> Sequence[Terminal]:
|
||||||
node = self._find(self.root, prefix)
|
node = self._find(self.root, prefix)
|
||||||
return self._get_terminals(node, prefix)
|
return self._get_terminals(node)
|
||||||
|
|
||||||
def _find(self, node, prefix, collector=""):
|
def _find(self, node, prefix, collector=""):
|
||||||
cutoff = node.cut_from(prefix)
|
cutoff = node.cut_from(prefix)
|
||||||
|
@ -84,15 +84,14 @@ class ByteTrie:
|
||||||
log.debug(f"Found node {child} in {node} for {cutoff}. Traversing down.")
|
log.debug(f"Found node {child} in {node} for {cutoff}. Traversing down.")
|
||||||
return self._find(child, cutoff)
|
return self._find(child, cutoff)
|
||||||
|
|
||||||
def _get_terminals(self, node, label_builder):
|
def _get_terminals(self, node):
|
||||||
if not node: return []
|
if not node: return []
|
||||||
|
|
||||||
collector = []
|
collector = []
|
||||||
if isinstance(node, Terminal):
|
if isinstance(node, Terminal):
|
||||||
collector.append((node, label_builder))
|
collector.append((node))
|
||||||
for child in node.children:
|
for child in node.children:
|
||||||
l = child.extend(label_builder)
|
collector.extend(self._get_terminals(child))
|
||||||
collector.extend(self._get_terminals(child, l))
|
|
||||||
return collector
|
return collector
|
||||||
|
|
||||||
def to_dot(self) -> str:
|
def to_dot(self) -> str:
|
||||||
|
@ -275,6 +274,17 @@ class Terminal(Child):
|
||||||
return t
|
return t
|
||||||
return cls(child.label, content, child.parent, child.children, multi_value)
|
return cls(child.label, content, child.parent, child.children, multi_value)
|
||||||
|
|
||||||
|
def key(self) -> ByteString:
|
||||||
|
l = bytes(self.label)
|
||||||
|
parent = self.parent
|
||||||
|
while isinstance(parent, Child):
|
||||||
|
l = bytes(parent.label) + l
|
||||||
|
parent = parent.parent
|
||||||
|
return l
|
||||||
|
|
||||||
|
def value(self) -> Any:
|
||||||
|
return self.content
|
||||||
|
|
||||||
def to_dot(self) -> str:
|
def to_dot(self) -> str:
|
||||||
s = super().to_dot()
|
s = super().to_dot()
|
||||||
s += f"{self.dot_id()} [color=blue]\n"
|
s += f"{self.dot_id()} [color=blue]\n"
|
||||||
|
|
5
setup.py
5
setup.py
|
@ -1,11 +1,12 @@
|
||||||
import setuptools
|
import setuptools
|
||||||
|
|
||||||
|
|
||||||
with open("README.md", "r") as fh:
|
with open("README.md", "r") as fh:
|
||||||
long_description = fh.read()
|
long_description = fh.read()
|
||||||
|
|
||||||
setup(
|
setuptools.setup(
|
||||||
name="bytetrie",
|
name="bytetrie",
|
||||||
version="0.0.1",
|
version="0.0.2",
|
||||||
url="https://git.friedl.net/incubator/bytetrie",
|
url="https://git.friedl.net/incubator/bytetrie",
|
||||||
license="MIT",
|
license="MIT",
|
||||||
author="Armin Friedl",
|
author="Armin Friedl",
|
||||||
|
|
Loading…
Reference in a new issue