Linguistics Meets Linux: Morphix-NLP

Natural Language Processing

NLP (Natural Language Processing) has been an important research field for about 40 years. Incorporating lingustics and computer science methods, this broad topic has many different application areas as well as important problems to solve. Analysing huge files of different languages, discovering patterns, making transformations, categorizing text, analysing and synthesizing speech, making the computers understand the semantics of language data, etc. are all related to NLP in one way or the other.

As universal data processing machines, computers are an essential part of linguistic analysis. Many different kinds of software had been developed for diving into the deep structures of words, sentences, stories, etc. Anybody interested in this kind of study is going to need some kind of integrated platform to start. The traditional method is to search for software and then, after finding best and up-to-date tools of trade, download, compile, configure them. Even the beginning of this process, namely the search process can be a burden by itself, let alone downloading, compiling, configuring them one by one. However, life can be easier for NLP newbies. Much easier.

Booting the PC For Language Analysis

Zhang Le, a Chinese scientist working on NLP has decided to pack the most important language analysis and processing applications into a single bootable CD: Morphix-NLP. More than 640 MB of NLP specific software is included and there's still a lot of place on the CD which uses a compressed filesystem for bringing us the best of both worlds.

The CD is based on Morphix that is based on famous Knoppix which, in turn, is based on Debian GNU/Linux distribution. What does that mean? This means that after downloading the ISO and burning the CD you can simply boot your machine with it. Within a few seconds you'll find yourself at the familiar bash command line. A few directions in the lines above will give you hints about how to read the documentation for the CD and also how to enter the GUI which is a nice looking and easy to use XFce desktop environment.

After starting the shiny XFce GUI by issuing startx command you'll be looking at the main screen which is similar to this one. Morpix-NLP has an adequate documentation in HTML which can be viewed by simply clicking on the book icon residing at leftmost part of the icon panel at the bottom. Mozilla web browser (actually Phonenix) will guide you through the rest of the documentation.

If you are one of those impatient people who like to discover systems by dirty and quick methods like trial and error then you'll notice that XFce icon panel contains some linguistic applications that can be directly invoked. These are Link Grammar Parser, AntConc 2.2, WordNet Browser and CMU Festival Speech Synthesis System.

To get a taste of what these mean just enter an English sentence into the Link Grammar Parser window and watch it in action parsing your sentence into subjects, verbs, objects, relative clauses, etc. Then visit WordNet browser window, enter some word (e.g. "language" ;-) to search and after finding its meanings start an interesting search about the other concepts that this word is related to simply by clicking on the buttons below the search box. "What are the parts of an engine?", "What takes an engine as a part of itself?", these kind of questions and many others can be simply answered by using WordNet, a famous cross linked, associative system which is used in many different projects including ones related to Artificial Intelligence, Cognitive Science, etc. This graphical WordNet browser is only one of the applications. Since Morphix-NLP includes programming environments like Python, Perl, Tcl/Tk you have the chance of experimenting with the system in your programs using one of those programming languages.

We all know that language is not limited to writing and reading but everything started with "speech" which made us "human" and Morphix-NLP stresses this point by including a famous software based speech synthesizer developed at University of Edinburgh. Simply clicking on the loudspeaker icon on the panel will invoke the system and it will greet you by speaking (!) to you and saying "Welcome to the Morphix NLP Live CD" ;-) Dive into this simple looking but very complex and capable system to learn how you can integrate smooth voiced dialogs into your games, user interfaces, etc.

What you've experienced up to this point is just the tip of the iceberg. The real treasure is not placed under fancy icons and you'll have to browse the documentation and explore what lies beneath. Your efforts will be appreciated.

The journey into real world of NLP starts with tokenisers: A tokeniser is a piece of software that splits a text into its component elements. These are typically individual words, but also punctuation marks and other symbols which are not normally considered to be words. The collective term for these elements that make up a text is tokens. So, the tokeniser takes as input a text, and splits it into its tokens. This is usually done by inserting separator, either blank spaces or linebreaks, so that subsequent programs (like a parts-of-speech tagger) can easily read in the tokens and process them further.

Next comes part-of-speech taggers. They are programs that read text and for each token in the text return the part-of-speech (e.g. noun, verb, punctuation, etc). And you are not limited to English. Actually these software can be adapted to any language you work on and you'll find working examples for English, German, Italian.

At that point things are going to get more complex. The third part of the documentation describes a few very important language parsing systems. You've already played with one of them: Link Grammar Parser. But you're not limited with that. There are many different grammar theories that try to formalise the structure of languages like English and you can have an idea about some of them using Morphix-NLP. As an example you can experiment with LoPar: a left corner parser for head-lexicalised probabilistic context-free grammars. What? Didn't understand a single word? Hmm, well I thought you were interested in linguistics! ;-) Go and start reading some Chomsky.

A scietific field without any statistics is a pretty rare animal and linguistic analysis is not an exception in this respect. So the fourth part of the documentation is about the statistical tools for analysing the linguistic data, the art of looking for patterns and deducing rules to generalise them. Word frequency lists, vocabularies, word bigram and trigram counts, vocabulary-specific word bigram and trigram counts will be what you're looking for while you are involved with these kind of software. "What words come in pairs?", "What is the correlation between them?", "Does the word 'patient' augurs the imminent appearance of 'drug'?" will be the kind of questions you'll ask and try to answer by analysing documents consisting millions of words.

And the final part is dedicated to one of the most ambitious of goals NLP: Machine learning. This is where you'll find the tools to create systems that can learn how to classify thousands of text documents into sensible categories, handle word sense disambiguation, evaluate the text data according to your criteria, etc. Here you'll see that Zhang Le is not only a CD creator but also the software developer of an important machine learning toolkit: Maximum Entropy Modeling Toolkit.

Conclusion: A Few Points To Be Considered

Zhang Le has done a cool job of bringing us a lot of important natural language processing software in an easy to use and evaluate format under the name of Morphix-NLP.

The documentation is the most important guide for this CD and even though I think its structure is clear and concise I'm sure there's room for improvement. Giving hints about running every software is a nice idea and I believe putting pointers to related documents on the CD and explaining people how to view them (or making them viewable directly in the browser) will make it one step closer to the perfect state. During our personal communication Zhang said that a Wiki style interactively developed live documentation can be a good idea and I completely agree with that.

A very important type of software that looks like missing in this system is speech recognition. It is cool to hear my PC speak to me but the real fun begins when I can speack back and see it transcribing my speech, taking actions accordingly, etc. "Hey, computer, can you hear me?" ;-) Another NLP software category is SQL to NL translation systems. I mean natural language query systems that accept questions in one of the natural languages, maps this input to a valid SQL statement and brings back the result from a database. "Hey computer, what are the names of students that passed ling-101 course last semester?"

I'd like to thank Zhang Le for this innovative idea and spending the time to create a system dedicated to NLP research. I also want to thank for the nice CD cover design which makes the CD case look quite cool. While I'm burried deep into the applications provided with Morphix-NLP I'll be waiting for the next version.

Emre "FZ" Sevinc
2003-12-08