broca

· 07.31.2015 · etc

Broca's Area

At this year's OpenNews Code Convening, Alex Spangher of the New York Times and I worked on broca, which is a Python library for rapidly experimenting with new NLP approaches.

Conventional NLP methods - bag-of-words vector space representations of documents, for example - generally work well, but sometimes not well enough, or worse yet, not well at all. At that point, you might want to try out a lot of different methods that aren't available in popular NLP libraries.

Prior to the Code Convening, broca was little more than a hodgepodge of algorithms I'd implemented for various projects. During the Convening, we restructured the library, added some examples and tests, and implemented in the key piece of broca: pipelines.

Pipelines

The core of broca is organized around pipes, which take some input and produce some output, which are then chained into pipelines.

Pipes represent different stages of an NLP process - for instance, your first stage may involve preprocessing or cleaning up the document, the next may be vectorizing it, and so on.

In broca, this would look like:

from broca.pipeline import Pipeline
from broca.preprocess import Cleaner
from broca.vectorize import BoW

docs = [
    # ...
    # some string documents
    # ...
]

pipeline = Pipeline(
        Cleaner(),
        BoW()
)

vectors = pipeline(docs)

Since a key part of broca is rapid prototyping, it makes it very easy to simultaneously try different pipelines which may vary in only a few components:

from broca.vectorize import DCS

pipeline = Pipeline(
        Cleaner(),
        [BoW(), DCS()]
)

This would produce a multi-pipeline consisting of two pipelines: one which vectorizes using BoW, the other using DCS.

Multi-pipelines often have shared components. In the example above, Cleaner() is in both pipelines. To avoid redundant processing, a key part of broca's pipelines is that the output for each pipe is "frozen" to disk.

These frozen outputs are identified by a hash derived from the input data and other factors. If frozen output exists for a pipe and its input, that frozen output is "defrosted" and returned, saving unnecessary processing time.

broca's Cryo

This way, you can tweak different components of the pipeline without worrying about needing to re-compute a lot of data. Only the parts that have changed will be re-computed.

Included pipes

broca includes a few pipes:

  • broca.tokenize includes various tokenization methods, using lemmas and a few different keyword extractors.
  • broca.vectorize includes a traditional bag-of-words vectorizer, an implementation of "dismabiguated core semantics", and Doc2Vec.
  • broca.preprocess includes common preprocessors - cleaning punctuation, HTML, and a few others.

Other tools

Not everything in broca is a pipe. Also included are:

  • broca.similarity includes similarity methods for terms and documents.
  • broca.distance includes string distance methods (this may be renamed later).
  • broca.knowledge includes some tools for dealing with external knowledge sources (e.g. other corpora or Wikipedia).

Though at some point these may also become pipes.

Give us your pipes!

We made it really easy to implement your own pipes. Just inherit from the Pipe class, specify the class's input and output types, and implement the __call__ method (that's what's called for each pipe).

For example:

from broca.pipeline import Pipe

class MyPipe(Pipe):
    input = Pipe.type.docs
    output = Pipe.type.vecs

    def __init__(self, some_param):
        self.some_param = some_param

    def __call__(self, docs):
        # do something with docs to get vectors
        vecs = make_vecs_func(docs, self.some_param)
        return vecs

We hope that others will implement their own pipes and submit them as pull requests - it would be great if broca becomes a repository of sundry NLP methods which makes it super easy to quickly try a battery of techniques on a problem.

broca is available on GitHub and also via pip:

pip install broca