Reel Two Classification System compares favorably with popular document search systems.
Jan 1, 2003
Text Mining: Help is On the Way
Poor Dataslave. We've all been there,
trying to cull a few interesting documents from a large pile of
boring ones. The only solution, until recently, was brain grease:
read every document - the abstracts, anyway - and make a judgment
But that's starting to change. Software
vendors have felt our pain and are busily at work crafting new
products that automate literature searching. There are many choices:
Some products try to understand the content; others try to learn
what you're looking for by watching you classify a small number of
examples; some analyze the entire literature, creating yet another
database for you to search; others work on subsets that you select;
others build on top of these products and let you combine analyses
from multiple tools. There are some open-source packages, as
Dataslave needed some help, so I grabbed
one product and two open- source tools and went to work. Before
long, I had the answer and some confidence that it was right. The
software helped and I'm glad I had it, but it wasn't as easy as I
Where to Start
The starting point for any literature
search is, for most of us, Medline. I generally access Medline
through the Entrez query system operated by NCBI. The European
Bioinformatics Institute also provides access through SRS, a data
query and integration system originally developed at EBI and now
commercialized by Lion Bioscience.
The US National Library of Medicine's
Medline is an indexed database of biomedical articles and abstracts.
"Indexed" is used here in its traditional literary sense: human
experts assign index terms, such as "mice" or "Huntington's
disease," to each citation. Readers can retrieve articles by index
terms, as well as full-text search, and other fields. The index
terms are drawn from a hierarchical, controlled vocabulary called
Medical Subject Headings.
NCBI offers PubMed, a superset of
Medline that includes some non-biomedical articles from journals
that are indexed by Medline, articles from journals that submit full
text to PubMedCentral, and articles from journals now covered by
Medline that were published before coverage began.
In simple cases, the query syntax of
Entrez / PubMed looks a lot like a web search engine: just type in
some words and it'll do something sensible. But beware. There's more
going on than meets the eye.
The first step is automatic term
recognition: the software looks in a database of common biomedical
phrases and translates your query into what it thinks you meant. In
my experience, this usually does the right thing. But it doesn't
always work as expected, and you have to be fairly attentive to
avoid turning the feature on or off inadvertently in complex
queries. Fortunately, PubMed provides a "details" button you can hit
to see how the software translated your query.
PubMed also lets you specify query terms
by field - to search for articles by a particular author or to look
for words that appear in the title or abstract. You can also combine
search terms using AND, OR, NOT, and parentheses. There's a great
PubMed tutorial online with many more details.
An important subtlety is that the
Medline indexers, while quite good, do mis-index some documents. A
related problem is that documents can sit in the database for a
while before the indexers get to them. To capture such documents,
you have to add search terms that look for words that you expect to
find in the documents you want. This is definitely a double-edged
sword, as such words will often pick up documents that are only
marginally related to your question.
SRS processes queries differently from
PubMed and often gives different results for seemingly equivalent
Grabbing the Big Pile
The query I used for this article is:
Huntingtons disease AND (mice [Title/ Abstract] OR mouse
[Title/Abstract] OR murine [Title/Abstract]). The leading phrase
"Huntingtons disease" triggers automatic term recognition and is
translated into "Huntingtons disease [MeSH Terms] OR Huntingtons
disease [Text Word]."
The overall query finds documents that
are indexed under the term "Huntington's disease" or that contain
the phrase "Huntington's disease" in the title, abstract, or other
text fields of the entry, and that contain the words "mice,"
"mouse," or "murine" in the title or abstract. This query found 261
documents when I ran it to prepare this article.
This proved to be a good starting point
for Dataslave's task, far better than the clunker Carbonoid dropped
on him. Naturally, queries that differ even slightly may produce
radically different answers. I tried lots of formulations before
settling on the one above.
For verification purposes, I also worked
with a more expansive query - (Huntingtons disease OR Huntingtons
[Title/Abstract]) AND (mice OR mouse [Title/Abstract] OR murine
[Title/ Abstract]) - that retrieved 445 documents.
I downloaded the search results from
PubMed using the "save" button they so nicely provide. This produces
a text file containing the basic citation information for each
The next step was to get the abstracts.
I planned to use BioPerl for this, but the PubMed interface wasn't
quite ready when I needed it. Fortunately, NCBI provides an easy way
to download batch datasets from Entrez, called Entrez Utilities
(E-Utilities). You just have to cobble together a URL that lists the
identifiers of the entries you want, send it to NCBI (via the Perl
GET utility or lwp module, for example), and they send back your
data in XML, HTML, or text.
I got the abstracts in XML and processed
them with BioPerl's Biblio module. Biblio can split the XML stream
into separate records for each abstract, and provides functions for
grabbing fields like authors, title, body of the abstract, etc. I
wrote a little script that created a separate text file for each
abstract containing the information I cared about.
Finding the Gems
This is the point where I turned to a
fancy new knowledge-mining tool, Classification System from Reel
Two. My goal was to divide the big pile of documents in two - into a
small pile of papers about mouse drug studies in Huntington's
disease, and a larger pile about other things.
I should mention that although this
example only involves binary classification - drug studies vs. other
- Classification System is more general. It can handle any number of
classes, the classes can be arranged hierarchically, and documents
can be placed into multiple classes.
Reel Two's CS uses an approach called
machine learning - a rather grand term for a simple idea. You give
the program examples of documents that fall into each class, and it
builds a mathematical model that can reproduce this categorization.
Then you feed in new documents, and the program uses the model to
put each document into the best class. The initial examples are
called the training set and the new ones are called the test
I also tried two open-source packages
based on similar methods: Rainbow by Andrew McCallum, now at the
University of Massachusetts at Amherst, and AI::Categorize by Ken
Williams, a Perl guru and the original Dr. Math of the Math Forum.
The central modeling method in all three
programs is naive Bayesian inference. Here's the basic idea: To
build the model, the program calculates a word profile for each
class that tells how often each word appears in the class's
documents. To use the model, the program finds the class whose
profile best matches the profile of a new document, using Bayes's
theorem to calculate "best." It takes a lot of tricks to turn this
basic idea into a practical program.
CS is a much more polished program than
the other two. It's written in Java and has an easy-to-use,
point-and-click user interface. Rainbow, written in C, is the most
full- functioned of the bunch, providing a wide range of modeling
methods (only some of which seem to work), and the most detailed
output reports as to what the model is doing. AI::Categorize is a
Perl module and is the easiest choice if you're looking for
something to plug into a Perl script.
I trained each program on eight positive
papers provided by a colleague and 25 arbitrarily chosen negatives.
I used the same negatives for each program and, of course, manually
checked to make sure they were true negatives.
I also went through the entire dataset
by hand so I would know the correct answer. I found 17 papers that
clearly belonged to the positive class, and another nine that were
marginal in that they were about non-drug therapeutics -
transplantation, gene therapy, and environmental enrichment. I
decided to expand my positive definition to include all 26 of these
To verify that my initial PubMed search
was reasonable, I ran the more expansive query mentioned above and
manually classified the extra 184 documents it found. Only four of
these fit the positive definition, and only one was really relevant.
(The other three were commentaries and such.)
I then ran the entire dataset through
each program to see how many documents it could classify correctly.
I quickly learned that none of the programs could do a complete job
in one try.
The typical outcome was to predict 10 or
20 new positives, of which maybe half were correct. This isn't what
I had hoped for, but on reflection I realized it wasn't so bad.
Since only 10 percent of the documents in the initial dataset were
true positives, having the program find a class that's 50 percent
positive is a considerable step forward.
So, I switched to an iterative strategy.
I trained each program as above. Then I ran the program to get an
enriched class of predicted positives. Then I manually classified
the predicted positives as true positives or true negatives, adding
each to the training set. Then I retrained the program and did it
all again, and again and again until the results no longer changed.
Using this approach, Reel Two's product
got to the correct answer in four iterations. Neither open-source
program could get all the way, getting stuck at 16 answers each.
This new literature-searching software
is not a panacea, but it certainly helps. I was hoping for an
automatic solution, but instead had to resort to an iterative
approach. Reel Two's Classification System outperformed the two
open-source packages, getting the correct answer after a few
iterations. It was harder than I wanted, but at least it
The literature is a very messy dataset.
I suspect the products will need a few more revs to really get their
arms around it, and I hope the vendors have enough patience (and
money) to stick with it. In the meantime, Dataslave - and many of us
- can look forward to reading lots of boring abstracts.