RangeMap: A Simple Interval Query Datastructure

Maps are a fundamental data structure. Their prevalence is a testament to their importance. Indeed, many search problems can be reduced to the construction of an appropriate map. However, a search problem occasionally arises that is difficult to solve, at least directly, with a map. The interval query problem is one such problem.

Continue reading “RangeMap: A Simple Interval Query Datastructure”

TVLib: A C++ Text Vectorization Library with Python Bindings

I am a big fan of the CountVectorizer class in scikit-learn [1]. With a robust and easy interface that produces (sparse!) matrices, what’s not to love? Well, it’s… pretty… slow…

The performance is okay for 10s of MB of text, but GBs take minutes or more. It terms out that CountVectorizer is implemented in pure Python. The functions are single threaded too. It seems like low-hanging fruit. Just whip up some parallel C++, right? Well, not quite, but I’m getting ahead of myself.

Continue reading “TVLib: A C++ Text Vectorization Library with Python Bindings”