First release

This commit is contained in:
Armin Friedl 2020-10-08 04:52:30 +02:00
parent 2b29d8495e
commit b7d8355614
6 changed files with 426 additions and 3 deletions

View file

@ -1,4 +1,4 @@
MIT License Copyright (c) <year> <copyright holders>
MIT License Copyright (c) 2020 Armin Friedl <dev@friedl.net>
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal

View file

@ -1,3 +1,60 @@
# bytetrie
# Bytetrie
A fast, dependency-free, self-compressing trie with radix 256 in pure python.
A self-compressing radix trie with radix 256 in pure python
Bytetrie allows fast prefix search in a large corpus of keys. Each key can be
associated with arbitrary data. It features fast lookup times at the cost of
expensive insertion. A Bytetrie is best used if it can be pre-filled with data.
However, due to its in-band compression it can be also used for on-the-fly
updates.
## Keys
Keys are byte strings. Therefore, each node in the trie can have up to 256
children (the radix). Keys do work well with utf-8 and other encodings as long
as the encoding is consistent and deterministic. That is, a grapheme clusters
are always encoded to the same byte sequence. Even if the standard allows for
ambiguity. Usually that's a non-issue as long as the same encoder is used for
insertion and lookup.
Since prefix search in unicode strings is one of the most common use-cases of
bytetrie, a unicode layer on top of bytetrie is [planned](TODO.md).
## Data
Bytetrie can associate arbitrary data (python objects) with keys. Data (or
rather a reference thereof) is kept in-tree. No further processing is done.
In addition bytrie allows multi-valued tries. Every key is then associated with
a sequence of arbitrary objects.
## Performance
Despite being in pure python bytetrie is _fast_. Sifting through the full
[geonames](http://download.geonames.org/export/dump/) "allCountries" dataset for
places starting with `Vienna` takes a mere 512µs. That's not even one
millisecond for searching through 12,041,359 places. For comparison a warmed-up
ripgrep search through the same dataset takes three orders of magnitude (400ms)
longer on the same machine.
On the downside building the trie takes about 20 minutes and considerable
memory. Also the performance is mostly trumped by the time it takes to collect
terminal nodes. That is, the higher up the trie the search ends (and hence the
more results the prefix search yields) the longer it takes. There are several
low-hanging fruits left and further performance improvements are in the
[pipeline](TODO.md).
## Dependencies
None. That's the point.
# Getting started
TODO
# Github Users
If you are visiting this repository on GitHub, you are on a mirror of
https://git.friedl.net/incubator/bytetrie. This mirror is regularily updated
with my other GitHub mirrors.
Like with my other incubator projects, once I consider `bytetrie` reasonable
stable the main tree will move to GitHub.
If you want to contribute to `bytetrie` feel free to send patches to
dev[at]friedl[dot]net. Alternatviely, you can issue a pull request on GitHub
which will be cherry picked into my tree. If you plan significant long-term
contributions drop me a mail for access to the incubator repository.

30
TODO.md Normal file
View file

@ -0,0 +1,30 @@
# Benchmarking
- Gather some general benchmarks and performance behavior
- Compare with other implementations:
- https://github.com/jfjlaros/dict-trie
- https://github.com/dcjones/hat-trie
- https://github.com/fnl/patricia-trie
- https://github.com/pytries/marisa-trie
- https://github.com/soumasish/poetries
- https://github.com/mischif/py-fast-trie
# Profiling
Find optimization possibilites
- Memory usage looks too high. Find if something leaks references and cannot be
garbage collected.
- Initial creation is expected to take most time. The trie optimizes for
retrieval. But if there are low hanging fruits left they could be picked up.
- Lookup time is dominated by gathering the terminals especially in nodes high
up in the trie. To some extend expected and unavoidable, still any possible
optimizations there are highly desired. Additionally, a limit search could be
introduced to stop gathering after x terminals.
# Testing and Verification
Tests and correctness verification is currently glaringly lacking. Any
improvements there are highly desired.
# Future Extensions
- A unicode layer on top of bytetrie for simpler handling of string keys
- Key deletion
- Make insertion multi-threaded

29
bytetrie/__init__.py Normal file
View file

@ -0,0 +1,29 @@
"""
Bytetrie
========
A fast, dependency-free implementation of a compressed trie with radix 256.
Bytetrie allows fast prefix search in a large corpus of keys. Each key can
be associated with arbitrary data. The fast lookup times come at the cost of
expensive insertion. A Bytetrie is best used if it can be pre-loaded with data.
Keys and Data
-------------
Keys are byte strings. Therefore, each node in the trie can have up to 256
children. Keys do work well with utf-8 and other encodings as long as the
encoding is consistent and deterministic. I.e. a certain grapheme clusters is
always encoded to the same byte sequence. Every key can be associated with
arbitrary data. Multi-valued bytetries allow to associate a sequence of
arbitrary data with every key. Order is not guaranteed.
Usage
-----
.. code :: python
t = ByteTrie()
t.add(b"Hello", "P1")
t.add(b"Hi", "P2")
t.add(b"Hela", "P3")
t.find(b"He") # ["P1", "P3"]
"""
from .bytetrie import ByteTrie

281
bytetrie/bytetrie.py Normal file
View file

@ -0,0 +1,281 @@
from __future__ import annotations
from typing import Sequence, MutableSequence, ByteString, Any, Optional
from abc import ABC, abstractmethod
import logging
log = logging.getLogger(__name__)
class ByteTrie:
def __init__(self, multi_value=False):
self.root = Root([])
self.multi_value = multi_value
def insert(self, label: ByteString, content: Any):
log.info(f"Inserting {label} into Trie")
start = self.root.child_by_common_prefix(label)
if not start:
log.debug(f"Creating new terminal for {label} at root")
new_node = Terminal(label, content, self.root, [], self.multi_value)
self.root.put_child(new_node)
return new_node
log.debug(f"Found match {start} for {label}. Traversing down")
self._insert(start, label, content)
def _insert(self, node, label, content):
log.info(f"Inserting {label} into Trie at {node}")
if node.has_label(label):
log.debug(f"{node} equals {label}. Wrapping node as Terminal.")
if isinstance(node, Terminal) and not self.multi_value:
log.warning(f"{node} is already a Terminal. Content will be overwritten.")
terminal = Terminal.from_child(node, content, self.multi_value)
node.replace_with(terminal)
return terminal
if node.is_prefix_of(label):
log.debug(f"{node} is prefix of {label}")
cutoff = node.cut_from(label)
next_node = node.child_by_common_prefix(cutoff)
if not next_node:
log.debug(f"No matching child found for {cutoff}. Creating new child terminal.")
terminal = Terminal(cutoff, content, node, [], self.multi_value)
node.put_child(terminal)
return terminal
else:
log.debug(f"Found match {next_node} for {cutoff}. Traversing down.")
return self._insert(next_node, cutoff, content)
if node.starts_with(label):
log.debug(f"{label} is part of {node}. Creating new parent from {label}")
new_node = Terminal(label, content, node.parent, [], self.multi_value)
node.replace_with(new_node)
node.strip_prefix(label)
new_node.put_child(node)
return new_node
log.debug(f"{label} and {node} have a common ancestor")
common_prefix = node.common_prefix(label)
log.debug(f"Creating new ancestor for {common_prefix}")
ancestor = Child(common_prefix, node.parent, [])
node.replace_with(ancestor)
terminal = Terminal(cut_off_prefix(common_prefix, label), content, ancestor, [], self.multi_value)
node.strip_prefix(common_prefix)
ancestor.put_child(terminal)
ancestor.put_child(node)
return terminal
def find(self, prefix):
node = self._find(self.root, prefix)
return self._get_terminals(node, prefix)
def _find(self, node, prefix, collector=""):
cutoff = node.cut_from(prefix)
log.debug(f"Searching for {cutoff} in {node}")
child = node.child_by_prefix_match(cutoff)
if not child and not cutoff:
return node
elif not child and cutoff:
log.debug(f"Leftover cutoff {cutoff}. Trying to find node with prefix {cutoff}")
child = node.child_by_common_prefix(cutoff)
if not child or not child.starts_with(cutoff):
return None
log.debug(f"Found child {child} starting with {cutoff}")
return child
else: # child must be not None
log.debug(f"Found node {child} in {node} for {cutoff}. Traversing down.")
return self._find(child, cutoff)
def _get_terminals(self, node, label_builder):
if not node: return []
collector = []
if isinstance(node, Terminal):
collector.append((node, label_builder))
for child in node.children:
l = child.extend(label_builder)
collector.extend(self._get_terminals(child, l))
return collector
def to_dot(self) -> str:
return "graph {\n\n"+self.root.to_dot()+"\n}"
def has_common_prefix(label: ByteString, other_label: ByteString) -> bool:
""" Whether label and other_label have a prefix in common. """
assert label and other_label
return True if label[0] == other_label[0] else False
def common_prefix(label: ByteString, other_label: ByteString) -> ByteString:
""" Get the common prefix of label and other_label. """
buffer = bytearray()
for (a,b) in zip(label, other_label):
if a == b: buffer.append(a)
else: break
return buffer
def is_prefix_of(prefix: ByteString, label: ByteString) -> bool:
""" Whether label starts with prefix """
if len(prefix) > len(label):
return False
for (a,b) in zip(prefix, label):
if a != b: return False
return True
def find_first(predicate, iterable):
""" Return the first element in iterable that satisfies predicate or None """
try: return next(filter(predicate, iterable))
except StopIteration: return None
def cut_off_prefix(prefix: ByteString, label: ByteString) -> ByteString:
""" Cut prefix from start of label. Return rest of label. """
assert is_prefix_of(prefix, label)
return bytes(label[len(prefix):])
class Node(ABC):
def __init__(self, children: MutableSequence[Child]):
self.children = children
def child_by_common_prefix(self, label: ByteString) -> Optional[Child]:
""" Return Child that has a common prefix with label if one exists. """
def by_common_prefix(child: Child):
return has_common_prefix(child.label, label)
return find_first(by_common_prefix, self.children)
def child_by_prefix_match(self, label: ByteString) -> Optional[Child]:
""" Return Child which label is a prefix of the given label if one exists. """
def by_prefix_match(child: Child):
return is_prefix_of(child.label, label)
return find_first(by_prefix_match, self.children)
def put_child(self, child: Child):
""" Put child into this node's children. Replacing existing children. """
if child in self.children:
log.warning(f"Replacing child {child.label}")
self.remove_child(child)
child.parent = self
self.children.append(child)
def replace_child(self, child: Child, replacement: Child):
""" Remove child from this node's children and add replacement. """
self.remove_child(child)
self.put_child(replacement)
def remove_child(self, child: Child):
""" Remove child from this node's children """
if not child in self.children:
log.warning(f"Trying to delete {child.label} but it does not exist.")
self.children.remove(child)
@abstractmethod
def dot_label(self) -> str:
""" Readable label for this node in a dot graph """
...
@abstractmethod
def dot_id(self) -> str:
""" Technical id for this node in a dot graph. Must be unique. """
...
@abstractmethod
def cut_from(self, label: ByteString) -> ByteString:
""" Cut off node's label considered as prefix from label. """
...
def to_dot(self) -> str:
s = f'{self.dot_id()} [label="{self.dot_label()}"]\n'
for child in self.children:
s += f"{self.dot_id()} -- {child.dot_id()}\n"
s += child.to_dot()
return s
class Root(Node):
def cut_from(self, label: ByteString) -> ByteString:
return label
def dot_label(self):
return "root"
def dot_id(self):
return "root"
class Child(Node):
def __init__(self, label: ByteString, parent: Node, children: MutableSequence[Child]):
self.label = label
self.parent = parent
self.children = children
def __eq__(self, other_child):
return (isinstance(other_child, Child)
and self.label == other_child.label)
def __hash__(self):
return hash(self.label)
def __str__(self):
return self.label.decode('utf-8', 'replace').replace('"', '\\"')
def dot_label(self):
return self.label.decode('utf-8', 'replace').replace('"', '\\"')
def dot_id(self):
return id(self)
def has_label(self, label):
return self.label == label
def is_prefix_of(self, label):
return is_prefix_of(self.label, label)
def replace_with(self, new_child: Child):
new_child.parent = self.parent
self.parent.replace_child(self, new_child)
def starts_with(self, label: ByteString) -> bool:
return is_prefix_of(label, self.label)
def cut_from(self, label: ByteString) -> ByteString:
""" Cut node's label from (start of) label """
return cut_off_prefix(self.label, label)
def strip_prefix(self, prefix: ByteString):
""" Cut off prefix from node's label """
self.label = cut_off_prefix(prefix, self.label)
def extend(self, label: ByteString) -> ByteString:
""" Extend label by node's label """
return bytes(label) + bytes(self.label)
def split_label_at(self, index):
return (self.label[:index], self.label[index:])
def contains(self, label):
if len(label) > len(self.label):
return False
for (a,b) in zip(self.label, label):
if a != b: return False
return True
def common_prefix(self, label):
return common_prefix(self.label, label)
class Terminal(Child):
def __init__(self, label: ByteString, content: Any, parent: Node, children: MutableSequence[Child], multi_value: bool):
super().__init__(label, parent, children)
self.multi_value = multi_value
self.content = [content] if multi_value else content
@classmethod
def from_child(cls, child: Child, content: Any, multi_value: bool):
# multi_value param has no effect if already a Terminal. I.e.
# from_child cannot change the multi-value stage of a child that
# is already a Terminal
if isinstance(child, Terminal) and child.multi_value:
# Create a new Terminal instance. Although not needed this is what is expected
# and compatible to the non-multi-value behaviour.
t = cls(child.label, content, child.parent, child.children, child.multi_value)
t.content.extend(child.content) # add back original content
return t
return cls(child.label, content, child.parent, child.children, multi_value)
def to_dot(self) -> str:
s = super().to_dot()
s += f"{self.dot_id()} [color=blue]\n"
return s

26
setup.py Normal file
View file

@ -0,0 +1,26 @@
import setuptools
with open("README.md", "r") as fh:
long_description = fh.read()
setup(
name="bytetrie",
version="0.0.1",
url="https://git.friedl.net/incubator/bytetrie",
license="MIT",
author="Armin Friedl",
author_email="dev@friedl.net",
description="A self-compressing radix trie with radix 256 in pure python",
long_description=long_description,
long_description_content_type="text/markdown",
packages=setuptools.find_packages(exclude=("tests",)),
include_package_data=True,
classifiers=[
"License :: OSI Approved :: MIT License",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.7",
]
)