Linguistics Meets Linux: Morphix-NLP
Natural Language Processing
NLP (Natural Language Processing) has been an important research field for
about 40 years. Incorporating lingustics and computer science methods, this
broad topic has many different application areas as well as important problems to solve.
Analysing huge files of different languages, discovering patterns, making transformations,
categorizing text, analysing and synthesizing speech, making the computers understand
the semantics of language data, etc. are all related to NLP in one way or the other.
As universal data processing machines, computers are an essential part of linguistic
analysis. Many different kinds of software had been developed for diving into the
deep structures of words, sentences, stories, etc. Anybody interested in this
kind of study is going to need some kind of integrated platform to start.
The traditional method is to search for software and then, after finding best and up-to-date
tools of trade, download, compile, configure them. Even the beginning of this process,
namely the search process can be a burden by itself, let alone downloading, compiling,
configuring them one by one. However, life can be easier for NLP newbies. Much easier.
Booting the PC For Language Analysis
Zhang Le, a Chinese scientist working on NLP has decided to pack the most important
language analysis and processing applications into
a single bootable CD: Morphix-NLP. More than
640 MB of NLP specific software is included and there's still a lot of place on the
CD which uses a compressed filesystem for bringing us the best of both worlds.
The CD is based on Morphix that is based on famous Knoppix which, in turn, is based on
Debian GNU/Linux distribution. What does that mean? This means that after downloading
the ISO and burning the CD you can simply boot your machine with it. Within a few
seconds you'll find yourself at the familiar bash command line. A few directions
in the lines above will give you hints about how to read the documentation for the
CD and also how to enter the GUI which is a nice looking and easy to use XFce
desktop environment.
After starting the shiny XFce GUI by issuing
startx command you'll be
looking at the main screen which is similar
to this one. Morpix-NLP
has an adequate documentation in HTML which can be viewed by simply clicking on the book
icon residing at leftmost part of the icon panel at the bottom. Mozilla web browser
(actually Phonenix) will guide you through the rest of the documentation.
If you are one of those impatient people who like to discover systems by dirty and
quick methods like trial and error then you'll notice that XFce icon panel contains
some linguistic applications that can be directly invoked.
These are Link Grammar Parser, AntConc 2.2, WordNet Browser and
CMU Festival Speech Synthesis System.
To get a taste of what these mean just enter an English sentence into the Link
Grammar Parser window and watch it in action parsing your sentence into subjects,
verbs, objects, relative clauses, etc. Then visit WordNet browser window, enter
some word (e.g. "language" ;-) to search and after finding its meanings start
an interesting search about the other concepts that this word is related to simply
by clicking on the buttons below the search box. "What are the parts of an engine?",
"What takes an engine as a part of itself?", these kind of questions and many others
can be simply answered by using WordNet, a famous cross linked, associative system
which is used in many different projects including ones related to Artificial Intelligence,
Cognitive Science, etc. This graphical WordNet browser is only one of the applications.
Since Morphix-NLP includes programming environments like Python, Perl, Tcl/Tk you have
the chance of experimenting with the system in your programs using one of those programming
languages.
We all know that language is not limited to writing and reading but everything
started with "speech" which made us "human" and Morphix-NLP stresses this point
by including a famous software based speech synthesizer developed at University of Edinburgh.
Simply clicking on the loudspeaker icon on the panel will invoke the system and it
will greet you by speaking (!) to you and saying "Welcome to the Morphix NLP Live CD" ;-)
Dive into this simple looking but very complex and capable system to learn how you
can integrate smooth voiced dialogs into your games, user interfaces, etc.
What you've experienced up to this point is just the tip of the iceberg. The real
treasure is not placed under fancy icons and you'll have to browse the
documentation and explore what lies beneath. Your efforts will be appreciated.
The journey into real world of NLP starts with tokenisers: A tokeniser is a piece of
software that splits a text into its component elements. These are typically individual
words, but also punctuation marks and other symbols which are not normally considered to
be words. The collective term for these elements that make up a text is tokens. So, the
tokeniser takes as input a text, and splits it into its tokens. This is usually done
by inserting separator, either blank spaces or linebreaks, so that subsequent
programs (like a parts-of-speech tagger) can easily read in the tokens
and process them further.
Next comes part-of-speech taggers. They are programs that read text and for each token
in the text return the part-of-speech (e.g. noun, verb, punctuation, etc). And you are
not limited to English. Actually these software can be adapted to any language you
work on and you'll find working examples for English, German, Italian.
At that point things are going to get more complex. The third part of the documentation
describes a few very important language parsing systems. You've already played
with one of them: Link Grammar Parser. But you're not limited with that. There are
many different grammar theories that try to formalise the structure of languages like
English and you can have an idea about some of them using Morphix-NLP. As an example
you can experiment with LoPar: a left corner parser for head-lexicalised probabilistic
context-free grammars. What? Didn't understand a single word? Hmm, well I thought
you were interested in linguistics! ;-) Go and start reading some Chomsky.
A scietific field without any statistics is a pretty rare animal and linguistic analysis is
not an exception in this respect. So the fourth part of the documentation is about
the statistical tools for analysing the linguistic data, the art of looking for
patterns and deducing rules to generalise them. Word frequency lists, vocabularies,
word bigram and trigram counts, vocabulary-specific word bigram and trigram counts will
be what you're looking for while you are involved with these kind of software. "What
words come in pairs?", "What is the correlation between them?", "Does the word 'patient'
augurs the imminent appearance of 'drug'?" will be the kind of questions you'll ask
and try to answer by analysing documents consisting millions of words.
And the final part is dedicated to one of the most ambitious of goals NLP: Machine
learning. This is where you'll find the tools to create systems that can
learn how to classify thousands of text documents into sensible categories, handle word
sense disambiguation, evaluate the text data according to your criteria, etc.
Here you'll see that Zhang Le is not only a CD creator but also the software developer
of an important machine learning toolkit: Maximum Entropy Modeling Toolkit.
Conclusion: A Few Points To Be Considered
Zhang Le has done a cool job of bringing us a lot of important natural language processing
software in an easy to use and evaluate format under the name of Morphix-NLP.
The documentation is the most important guide for this CD and even though I think its
structure is clear and concise I'm sure there's room for improvement. Giving hints about
running every software is a nice idea and I believe putting pointers to related documents
on the CD and explaining people how to view them (or making them viewable directly in
the browser) will make it one step closer to the perfect state. During our personal
communication Zhang said that a Wiki style interactively developed live documentation
can be a good idea and I completely agree with that.
A very important type of software that looks like missing in this system is speech recognition.
It is cool to hear my PC speak to me but the real fun begins when I can speack back
and see it transcribing my speech, taking actions accordingly, etc. "Hey, computer, can
you hear me?" ;-)
Another NLP software category is SQL to NL translation systems. I mean natural language
query systems that accept questions in one of the natural languages, maps this input to
a valid SQL statement and brings back the result from a database. "Hey computer, what are
the names of students that passed ling-101 course last semester?"
I'd like to thank Zhang Le for this innovative idea and spending the time to create
a system dedicated to NLP research. I also want to thank for the
nice CD cover design which
makes the CD case look quite cool. While I'm burried deep into the applications provided
with Morphix-NLP I'll be waiting for the next version.
Emre "FZ" Sevinc
2003-12-08