broca
At this year's OpenNews Code Convening, Alex Spangher of the New York Times and I worked on broca
, which is a Python library for rapidly experimenting with new NLP approaches.
Conventional NLP methods - bag-of-words vector space representations of documents, for example - generally work well, but sometimes not well enough, or worse yet, not well at all. At that point, you might want to try out a lot of different methods that aren't available in popular NLP libraries.
Prior to the Code Convening, broca
was little more than a hodgepodge of algorithms I'd implemented for various projects. During the Convening, we restructured the library, added some examples and tests, and implemented in the key piece of broca
: pipelines.
Pipelines
The core of broca
is organized around pipes, which take some input and produce some output, which are then chained into pipelines.
Pipes represent different stages of an NLP process - for instance, your first stage may involve preprocessing or cleaning up the document, the next may be vectorizing it, and so on.
In broca
, this would look like:
from broca.pipeline import Pipeline
from broca.preprocess import Cleaner
from broca.vectorize import BoW
docs = [
# ...
# some string documents
# ...
]
pipeline = Pipeline(
Cleaner(),
BoW()
)
vectors = pipeline(docs)
Since a key part of broca
is rapid prototyping, it makes it very easy to simultaneously try different pipelines which may vary in only a few components:
from broca.vectorize import DCS
pipeline = Pipeline(
Cleaner(),
[BoW(), DCS()]
)
This would produce a multi-pipeline consisting of two pipelines: one which vectorizes using BoW
, the other using DCS
.
Multi-pipelines often have shared components. In the example above, Cleaner()
is in both pipelines. To avoid redundant processing, a key part of broca
's pipelines is that the output for each pipe is "frozen" to disk.
These frozen outputs are identified by a hash derived from the input data and other factors. If frozen output exists for a pipe and its input, that frozen output is "defrosted" and returned, saving unnecessary processing time.
This way, you can tweak different components of the pipeline without worrying about needing to re-compute a lot of data. Only the parts that have changed will be re-computed.
Included pipes
broca
includes a few pipes:
broca.tokenize
includes various tokenization methods, using lemmas and a few different keyword extractors.broca.vectorize
includes a traditional bag-of-words vectorizer, an implementation of "dismabiguated core semantics", and Doc2Vec.broca.preprocess
includes common preprocessors - cleaning punctuation, HTML, and a few others.
Other tools
Not everything in broca
is a pipe. Also included are:
broca.similarity
includes similarity methods for terms and documents.broca.distance
includes string distance methods (this may be renamed later).broca.knowledge
includes some tools for dealing with external knowledge sources (e.g. other corpora or Wikipedia).
Though at some point these may also become pipes.
Give us your pipes!
We made it really easy to implement your own pipes. Just inherit from the Pipe
class, specify the class's input
and output
types, and implement the __call__
method (that's what's called for each pipe).
For example:
from broca.pipeline import Pipe
class MyPipe(Pipe):
input = Pipe.type.docs
output = Pipe.type.vecs
def __init__(self, some_param):
self.some_param = some_param
def __call__(self, docs):
# do something with docs to get vectors
vecs = make_vecs_func(docs, self.some_param)
return vecs
We hope that others will implement their own pipes and submit them as pull requests - it would be great if broca
becomes a repository of sundry NLP methods which makes it super easy to quickly try a battery of techniques on a problem.
broca
is available on GitHub and also via pip
:
pip install broca