First release
This commit is contained in:
parent
2b29d8495e
commit
b7d8355614
6 changed files with 426 additions and 3 deletions
2
LICENSE
2
LICENSE
|
@ -1,4 +1,4 @@
|
||||||
MIT License Copyright (c) <year> <copyright holders>
|
MIT License Copyright (c) 2020 Armin Friedl <dev@friedl.net>
|
||||||
|
|
||||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||||
of this software and associated documentation files (the "Software"), to deal
|
of this software and associated documentation files (the "Software"), to deal
|
||||||
|
|
61
README.md
61
README.md
|
@ -1,3 +1,60 @@
|
||||||
# bytetrie
|
# Bytetrie
|
||||||
|
A fast, dependency-free, self-compressing trie with radix 256 in pure python.
|
||||||
|
|
||||||
A self-compressing radix trie with radix 256 in pure python
|
Bytetrie allows fast prefix search in a large corpus of keys. Each key can be
|
||||||
|
associated with arbitrary data. It features fast lookup times at the cost of
|
||||||
|
expensive insertion. A Bytetrie is best used if it can be pre-filled with data.
|
||||||
|
However, due to its in-band compression it can be also used for on-the-fly
|
||||||
|
updates.
|
||||||
|
|
||||||
|
## Keys
|
||||||
|
Keys are byte strings. Therefore, each node in the trie can have up to 256
|
||||||
|
children (the radix). Keys do work well with utf-8 and other encodings as long
|
||||||
|
as the encoding is consistent and deterministic. That is, a grapheme clusters
|
||||||
|
are always encoded to the same byte sequence. Even if the standard allows for
|
||||||
|
ambiguity. Usually that's a non-issue as long as the same encoder is used for
|
||||||
|
insertion and lookup.
|
||||||
|
|
||||||
|
Since prefix search in unicode strings is one of the most common use-cases of
|
||||||
|
bytetrie, a unicode layer on top of bytetrie is [planned](TODO.md).
|
||||||
|
|
||||||
|
## Data
|
||||||
|
Bytetrie can associate arbitrary data (python objects) with keys. Data (or
|
||||||
|
rather a reference thereof) is kept in-tree. No further processing is done.
|
||||||
|
|
||||||
|
In addition bytrie allows multi-valued tries. Every key is then associated with
|
||||||
|
a sequence of arbitrary objects.
|
||||||
|
|
||||||
|
## Performance
|
||||||
|
Despite being in pure python bytetrie is _fast_. Sifting through the full
|
||||||
|
[geonames](http://download.geonames.org/export/dump/) "allCountries" dataset for
|
||||||
|
places starting with `Vienna` takes a mere 512µs. That's not even one
|
||||||
|
millisecond for searching through 12,041,359 places. For comparison a warmed-up
|
||||||
|
ripgrep search through the same dataset takes three orders of magnitude (400ms)
|
||||||
|
longer on the same machine.
|
||||||
|
|
||||||
|
On the downside building the trie takes about 20 minutes and considerable
|
||||||
|
memory. Also the performance is mostly trumped by the time it takes to collect
|
||||||
|
terminal nodes. That is, the higher up the trie the search ends (and hence the
|
||||||
|
more results the prefix search yields) the longer it takes. There are several
|
||||||
|
low-hanging fruits left and further performance improvements are in the
|
||||||
|
[pipeline](TODO.md).
|
||||||
|
|
||||||
|
## Dependencies
|
||||||
|
None. That's the point.
|
||||||
|
|
||||||
|
# Getting started
|
||||||
|
TODO
|
||||||
|
|
||||||
|
# Github Users
|
||||||
|
If you are visiting this repository on GitHub, you are on a mirror of
|
||||||
|
https://git.friedl.net/incubator/bytetrie. This mirror is regularily updated
|
||||||
|
with my other GitHub mirrors.
|
||||||
|
|
||||||
|
Like with my other incubator projects, once I consider `bytetrie` reasonable
|
||||||
|
stable the main tree will move to GitHub.
|
||||||
|
|
||||||
|
If you want to contribute to `bytetrie` feel free to send patches to
|
||||||
|
dev[at]friedl[dot]net. Alternatviely, you can issue a pull request on GitHub
|
||||||
|
which will be cherry picked into my tree. If you plan significant long-term
|
||||||
|
contributions drop me a mail for access to the incubator repository.
|
||||||
|
|
30
TODO.md
Normal file
30
TODO.md
Normal file
|
@ -0,0 +1,30 @@
|
||||||
|
# Benchmarking
|
||||||
|
- Gather some general benchmarks and performance behavior
|
||||||
|
- Compare with other implementations:
|
||||||
|
- https://github.com/jfjlaros/dict-trie
|
||||||
|
- https://github.com/dcjones/hat-trie
|
||||||
|
- https://github.com/fnl/patricia-trie
|
||||||
|
- https://github.com/pytries/marisa-trie
|
||||||
|
- https://github.com/soumasish/poetries
|
||||||
|
- https://github.com/mischif/py-fast-trie
|
||||||
|
|
||||||
|
# Profiling
|
||||||
|
Find optimization possibilites
|
||||||
|
|
||||||
|
- Memory usage looks too high. Find if something leaks references and cannot be
|
||||||
|
garbage collected.
|
||||||
|
- Initial creation is expected to take most time. The trie optimizes for
|
||||||
|
retrieval. But if there are low hanging fruits left they could be picked up.
|
||||||
|
- Lookup time is dominated by gathering the terminals especially in nodes high
|
||||||
|
up in the trie. To some extend expected and unavoidable, still any possible
|
||||||
|
optimizations there are highly desired. Additionally, a limit search could be
|
||||||
|
introduced to stop gathering after x terminals.
|
||||||
|
|
||||||
|
# Testing and Verification
|
||||||
|
Tests and correctness verification is currently glaringly lacking. Any
|
||||||
|
improvements there are highly desired.
|
||||||
|
|
||||||
|
# Future Extensions
|
||||||
|
- A unicode layer on top of bytetrie for simpler handling of string keys
|
||||||
|
- Key deletion
|
||||||
|
- Make insertion multi-threaded
|
29
bytetrie/__init__.py
Normal file
29
bytetrie/__init__.py
Normal file
|
@ -0,0 +1,29 @@
|
||||||
|
"""
|
||||||
|
Bytetrie
|
||||||
|
========
|
||||||
|
A fast, dependency-free implementation of a compressed trie with radix 256.
|
||||||
|
|
||||||
|
Bytetrie allows fast prefix search in a large corpus of keys. Each key can
|
||||||
|
be associated with arbitrary data. The fast lookup times come at the cost of
|
||||||
|
expensive insertion. A Bytetrie is best used if it can be pre-loaded with data.
|
||||||
|
|
||||||
|
Keys and Data
|
||||||
|
-------------
|
||||||
|
Keys are byte strings. Therefore, each node in the trie can have up to 256
|
||||||
|
children. Keys do work well with utf-8 and other encodings as long as the
|
||||||
|
encoding is consistent and deterministic. I.e. a certain grapheme clusters is
|
||||||
|
always encoded to the same byte sequence. Every key can be associated with
|
||||||
|
arbitrary data. Multi-valued bytetries allow to associate a sequence of
|
||||||
|
arbitrary data with every key. Order is not guaranteed.
|
||||||
|
|
||||||
|
Usage
|
||||||
|
-----
|
||||||
|
.. code :: python
|
||||||
|
t = ByteTrie()
|
||||||
|
t.add(b"Hello", "P1")
|
||||||
|
t.add(b"Hi", "P2")
|
||||||
|
t.add(b"Hela", "P3")
|
||||||
|
t.find(b"He") # ["P1", "P3"]
|
||||||
|
"""
|
||||||
|
|
||||||
|
from .bytetrie import ByteTrie
|
281
bytetrie/bytetrie.py
Normal file
281
bytetrie/bytetrie.py
Normal file
|
@ -0,0 +1,281 @@
|
||||||
|
from __future__ import annotations
|
||||||
|
from typing import Sequence, MutableSequence, ByteString, Any, Optional
|
||||||
|
from abc import ABC, abstractmethod
|
||||||
|
|
||||||
|
import logging
|
||||||
|
log = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
class ByteTrie:
|
||||||
|
def __init__(self, multi_value=False):
|
||||||
|
self.root = Root([])
|
||||||
|
self.multi_value = multi_value
|
||||||
|
|
||||||
|
def insert(self, label: ByteString, content: Any):
|
||||||
|
log.info(f"Inserting {label} into Trie")
|
||||||
|
start = self.root.child_by_common_prefix(label)
|
||||||
|
if not start:
|
||||||
|
log.debug(f"Creating new terminal for {label} at root")
|
||||||
|
new_node = Terminal(label, content, self.root, [], self.multi_value)
|
||||||
|
self.root.put_child(new_node)
|
||||||
|
return new_node
|
||||||
|
log.debug(f"Found match {start} for {label}. Traversing down")
|
||||||
|
self._insert(start, label, content)
|
||||||
|
|
||||||
|
def _insert(self, node, label, content):
|
||||||
|
log.info(f"Inserting {label} into Trie at {node}")
|
||||||
|
if node.has_label(label):
|
||||||
|
log.debug(f"{node} equals {label}. Wrapping node as Terminal.")
|
||||||
|
if isinstance(node, Terminal) and not self.multi_value:
|
||||||
|
log.warning(f"{node} is already a Terminal. Content will be overwritten.")
|
||||||
|
terminal = Terminal.from_child(node, content, self.multi_value)
|
||||||
|
node.replace_with(terminal)
|
||||||
|
return terminal
|
||||||
|
|
||||||
|
if node.is_prefix_of(label):
|
||||||
|
log.debug(f"{node} is prefix of {label}")
|
||||||
|
cutoff = node.cut_from(label)
|
||||||
|
next_node = node.child_by_common_prefix(cutoff)
|
||||||
|
if not next_node:
|
||||||
|
log.debug(f"No matching child found for {cutoff}. Creating new child terminal.")
|
||||||
|
terminal = Terminal(cutoff, content, node, [], self.multi_value)
|
||||||
|
node.put_child(terminal)
|
||||||
|
return terminal
|
||||||
|
else:
|
||||||
|
log.debug(f"Found match {next_node} for {cutoff}. Traversing down.")
|
||||||
|
return self._insert(next_node, cutoff, content)
|
||||||
|
|
||||||
|
if node.starts_with(label):
|
||||||
|
log.debug(f"{label} is part of {node}. Creating new parent from {label}")
|
||||||
|
new_node = Terminal(label, content, node.parent, [], self.multi_value)
|
||||||
|
node.replace_with(new_node)
|
||||||
|
node.strip_prefix(label)
|
||||||
|
new_node.put_child(node)
|
||||||
|
return new_node
|
||||||
|
|
||||||
|
log.debug(f"{label} and {node} have a common ancestor")
|
||||||
|
common_prefix = node.common_prefix(label)
|
||||||
|
log.debug(f"Creating new ancestor for {common_prefix}")
|
||||||
|
ancestor = Child(common_prefix, node.parent, [])
|
||||||
|
node.replace_with(ancestor)
|
||||||
|
terminal = Terminal(cut_off_prefix(common_prefix, label), content, ancestor, [], self.multi_value)
|
||||||
|
node.strip_prefix(common_prefix)
|
||||||
|
ancestor.put_child(terminal)
|
||||||
|
ancestor.put_child(node)
|
||||||
|
return terminal
|
||||||
|
|
||||||
|
def find(self, prefix):
|
||||||
|
node = self._find(self.root, prefix)
|
||||||
|
return self._get_terminals(node, prefix)
|
||||||
|
|
||||||
|
def _find(self, node, prefix, collector=""):
|
||||||
|
cutoff = node.cut_from(prefix)
|
||||||
|
log.debug(f"Searching for {cutoff} in {node}")
|
||||||
|
child = node.child_by_prefix_match(cutoff)
|
||||||
|
if not child and not cutoff:
|
||||||
|
return node
|
||||||
|
elif not child and cutoff:
|
||||||
|
log.debug(f"Leftover cutoff {cutoff}. Trying to find node with prefix {cutoff}")
|
||||||
|
child = node.child_by_common_prefix(cutoff)
|
||||||
|
if not child or not child.starts_with(cutoff):
|
||||||
|
return None
|
||||||
|
log.debug(f"Found child {child} starting with {cutoff}")
|
||||||
|
return child
|
||||||
|
else: # child must be not None
|
||||||
|
log.debug(f"Found node {child} in {node} for {cutoff}. Traversing down.")
|
||||||
|
return self._find(child, cutoff)
|
||||||
|
|
||||||
|
def _get_terminals(self, node, label_builder):
|
||||||
|
if not node: return []
|
||||||
|
|
||||||
|
collector = []
|
||||||
|
if isinstance(node, Terminal):
|
||||||
|
collector.append((node, label_builder))
|
||||||
|
for child in node.children:
|
||||||
|
l = child.extend(label_builder)
|
||||||
|
collector.extend(self._get_terminals(child, l))
|
||||||
|
return collector
|
||||||
|
|
||||||
|
def to_dot(self) -> str:
|
||||||
|
return "graph {\n\n"+self.root.to_dot()+"\n}"
|
||||||
|
|
||||||
|
def has_common_prefix(label: ByteString, other_label: ByteString) -> bool:
|
||||||
|
""" Whether label and other_label have a prefix in common. """
|
||||||
|
assert label and other_label
|
||||||
|
return True if label[0] == other_label[0] else False
|
||||||
|
|
||||||
|
def common_prefix(label: ByteString, other_label: ByteString) -> ByteString:
|
||||||
|
""" Get the common prefix of label and other_label. """
|
||||||
|
buffer = bytearray()
|
||||||
|
for (a,b) in zip(label, other_label):
|
||||||
|
if a == b: buffer.append(a)
|
||||||
|
else: break
|
||||||
|
return buffer
|
||||||
|
|
||||||
|
def is_prefix_of(prefix: ByteString, label: ByteString) -> bool:
|
||||||
|
""" Whether label starts with prefix """
|
||||||
|
if len(prefix) > len(label):
|
||||||
|
return False
|
||||||
|
for (a,b) in zip(prefix, label):
|
||||||
|
if a != b: return False
|
||||||
|
return True
|
||||||
|
|
||||||
|
def find_first(predicate, iterable):
|
||||||
|
""" Return the first element in iterable that satisfies predicate or None """
|
||||||
|
try: return next(filter(predicate, iterable))
|
||||||
|
except StopIteration: return None
|
||||||
|
|
||||||
|
def cut_off_prefix(prefix: ByteString, label: ByteString) -> ByteString:
|
||||||
|
""" Cut prefix from start of label. Return rest of label. """
|
||||||
|
assert is_prefix_of(prefix, label)
|
||||||
|
return bytes(label[len(prefix):])
|
||||||
|
|
||||||
|
class Node(ABC):
|
||||||
|
def __init__(self, children: MutableSequence[Child]):
|
||||||
|
self.children = children
|
||||||
|
|
||||||
|
def child_by_common_prefix(self, label: ByteString) -> Optional[Child]:
|
||||||
|
""" Return Child that has a common prefix with label if one exists. """
|
||||||
|
def by_common_prefix(child: Child):
|
||||||
|
return has_common_prefix(child.label, label)
|
||||||
|
return find_first(by_common_prefix, self.children)
|
||||||
|
|
||||||
|
def child_by_prefix_match(self, label: ByteString) -> Optional[Child]:
|
||||||
|
""" Return Child which label is a prefix of the given label if one exists. """
|
||||||
|
def by_prefix_match(child: Child):
|
||||||
|
return is_prefix_of(child.label, label)
|
||||||
|
return find_first(by_prefix_match, self.children)
|
||||||
|
|
||||||
|
def put_child(self, child: Child):
|
||||||
|
""" Put child into this node's children. Replacing existing children. """
|
||||||
|
if child in self.children:
|
||||||
|
log.warning(f"Replacing child {child.label}")
|
||||||
|
self.remove_child(child)
|
||||||
|
child.parent = self
|
||||||
|
self.children.append(child)
|
||||||
|
|
||||||
|
def replace_child(self, child: Child, replacement: Child):
|
||||||
|
""" Remove child from this node's children and add replacement. """
|
||||||
|
self.remove_child(child)
|
||||||
|
self.put_child(replacement)
|
||||||
|
|
||||||
|
def remove_child(self, child: Child):
|
||||||
|
""" Remove child from this node's children """
|
||||||
|
if not child in self.children:
|
||||||
|
log.warning(f"Trying to delete {child.label} but it does not exist.")
|
||||||
|
self.children.remove(child)
|
||||||
|
|
||||||
|
@abstractmethod
|
||||||
|
def dot_label(self) -> str:
|
||||||
|
""" Readable label for this node in a dot graph """
|
||||||
|
...
|
||||||
|
|
||||||
|
@abstractmethod
|
||||||
|
def dot_id(self) -> str:
|
||||||
|
""" Technical id for this node in a dot graph. Must be unique. """
|
||||||
|
...
|
||||||
|
|
||||||
|
@abstractmethod
|
||||||
|
def cut_from(self, label: ByteString) -> ByteString:
|
||||||
|
""" Cut off node's label considered as prefix from label. """
|
||||||
|
...
|
||||||
|
|
||||||
|
def to_dot(self) -> str:
|
||||||
|
s = f'{self.dot_id()} [label="{self.dot_label()}"]\n'
|
||||||
|
for child in self.children:
|
||||||
|
s += f"{self.dot_id()} -- {child.dot_id()}\n"
|
||||||
|
s += child.to_dot()
|
||||||
|
return s
|
||||||
|
|
||||||
|
class Root(Node):
|
||||||
|
def cut_from(self, label: ByteString) -> ByteString:
|
||||||
|
return label
|
||||||
|
|
||||||
|
def dot_label(self):
|
||||||
|
return "root"
|
||||||
|
|
||||||
|
def dot_id(self):
|
||||||
|
return "root"
|
||||||
|
|
||||||
|
class Child(Node):
|
||||||
|
def __init__(self, label: ByteString, parent: Node, children: MutableSequence[Child]):
|
||||||
|
self.label = label
|
||||||
|
self.parent = parent
|
||||||
|
self.children = children
|
||||||
|
|
||||||
|
def __eq__(self, other_child):
|
||||||
|
return (isinstance(other_child, Child)
|
||||||
|
and self.label == other_child.label)
|
||||||
|
|
||||||
|
def __hash__(self):
|
||||||
|
return hash(self.label)
|
||||||
|
|
||||||
|
def __str__(self):
|
||||||
|
return self.label.decode('utf-8', 'replace').replace('"', '\\"')
|
||||||
|
|
||||||
|
def dot_label(self):
|
||||||
|
return self.label.decode('utf-8', 'replace').replace('"', '\\"')
|
||||||
|
|
||||||
|
def dot_id(self):
|
||||||
|
return id(self)
|
||||||
|
|
||||||
|
def has_label(self, label):
|
||||||
|
return self.label == label
|
||||||
|
|
||||||
|
def is_prefix_of(self, label):
|
||||||
|
return is_prefix_of(self.label, label)
|
||||||
|
|
||||||
|
def replace_with(self, new_child: Child):
|
||||||
|
new_child.parent = self.parent
|
||||||
|
self.parent.replace_child(self, new_child)
|
||||||
|
|
||||||
|
def starts_with(self, label: ByteString) -> bool:
|
||||||
|
return is_prefix_of(label, self.label)
|
||||||
|
|
||||||
|
def cut_from(self, label: ByteString) -> ByteString:
|
||||||
|
""" Cut node's label from (start of) label """
|
||||||
|
return cut_off_prefix(self.label, label)
|
||||||
|
|
||||||
|
def strip_prefix(self, prefix: ByteString):
|
||||||
|
""" Cut off prefix from node's label """
|
||||||
|
self.label = cut_off_prefix(prefix, self.label)
|
||||||
|
|
||||||
|
def extend(self, label: ByteString) -> ByteString:
|
||||||
|
""" Extend label by node's label """
|
||||||
|
return bytes(label) + bytes(self.label)
|
||||||
|
|
||||||
|
def split_label_at(self, index):
|
||||||
|
return (self.label[:index], self.label[index:])
|
||||||
|
|
||||||
|
def contains(self, label):
|
||||||
|
if len(label) > len(self.label):
|
||||||
|
return False
|
||||||
|
for (a,b) in zip(self.label, label):
|
||||||
|
if a != b: return False
|
||||||
|
return True
|
||||||
|
|
||||||
|
def common_prefix(self, label):
|
||||||
|
return common_prefix(self.label, label)
|
||||||
|
|
||||||
|
class Terminal(Child):
|
||||||
|
def __init__(self, label: ByteString, content: Any, parent: Node, children: MutableSequence[Child], multi_value: bool):
|
||||||
|
super().__init__(label, parent, children)
|
||||||
|
self.multi_value = multi_value
|
||||||
|
self.content = [content] if multi_value else content
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_child(cls, child: Child, content: Any, multi_value: bool):
|
||||||
|
# multi_value param has no effect if already a Terminal. I.e.
|
||||||
|
# from_child cannot change the multi-value stage of a child that
|
||||||
|
# is already a Terminal
|
||||||
|
if isinstance(child, Terminal) and child.multi_value:
|
||||||
|
# Create a new Terminal instance. Although not needed this is what is expected
|
||||||
|
# and compatible to the non-multi-value behaviour.
|
||||||
|
t = cls(child.label, content, child.parent, child.children, child.multi_value)
|
||||||
|
t.content.extend(child.content) # add back original content
|
||||||
|
return t
|
||||||
|
return cls(child.label, content, child.parent, child.children, multi_value)
|
||||||
|
|
||||||
|
def to_dot(self) -> str:
|
||||||
|
s = super().to_dot()
|
||||||
|
s += f"{self.dot_id()} [color=blue]\n"
|
||||||
|
return s
|
26
setup.py
Normal file
26
setup.py
Normal file
|
@ -0,0 +1,26 @@
|
||||||
|
import setuptools
|
||||||
|
|
||||||
|
with open("README.md", "r") as fh:
|
||||||
|
long_description = fh.read()
|
||||||
|
|
||||||
|
setup(
|
||||||
|
name="bytetrie",
|
||||||
|
version="0.0.1",
|
||||||
|
url="https://git.friedl.net/incubator/bytetrie",
|
||||||
|
license="MIT",
|
||||||
|
author="Armin Friedl",
|
||||||
|
author_email="dev@friedl.net",
|
||||||
|
|
||||||
|
description="A self-compressing radix trie with radix 256 in pure python",
|
||||||
|
long_description=long_description,
|
||||||
|
long_description_content_type="text/markdown",
|
||||||
|
|
||||||
|
packages=setuptools.find_packages(exclude=("tests",)),
|
||||||
|
include_package_data=True,
|
||||||
|
|
||||||
|
classifiers=[
|
||||||
|
"License :: OSI Approved :: MIT License",
|
||||||
|
"Programming Language :: Python :: 3",
|
||||||
|
"Programming Language :: Python :: 3.7",
|
||||||
|
]
|
||||||
|
)
|
Loading…
Reference in a new issue