Friday, May 09, 2008
Register  |  Login
What is NLP?
  Contact  

To contact us, or submit a new resource, please send us a mail at webmaster@proxem.com.

    
  What is Natural Language Processing?  

Natural Language Processing (NLP) is a subfield of artificial intelligence and computational linguistics. It studies the problems of automated generation and understanding of natural human languages. […] Natural language understanding systems convert samples of human language into more formal representations that are easier for computer programs to manipulate.”

(This citation -and some other text fragments you can read on this page- comes from Wikipedia, the free encyclopedia.)

    
  Some NLP applications  

Machine Translation

This is one of the most important applications of Natural Language Processing. Translation is an activity comprising the interpretation of the meaning of a text in one language (the source text) and the production, in another language, of a new and equivalent text (the target text): the translation. Traditionally, translation has been a human activity, although attempts have been made to automate and computerize the translation of natural-language texts (machine translation) or to use computers as an aid to translation (computer-assisted translation).

Information Retrieval

Information retrieval (IR) is the science of searching for information in documents, searching for documents themselves, searching for metadata that describe documents, or searching within databases, whether relational stand-alone databases or hypertext networked databases such as the Internet or intranets, for text, sound, images or data.

Information Extraction

Information extraction (IE) is a type of information retrieval whose goal is to automatically extract structured or semi structured information from unstructured machine-readable documents. A typical example is the extraction of information on corporate merger events, whereby instances of the relation ”MERGE (company1, company2, date)” are extracted from online news (“Yesterday, New-York based Foo Inc. announced their acquisition of Bar Corp.”).

A typical subtask of IE is Named Entity Recognition: recognition of entity names (for people and organizations), place names, temporal expressions, and certain types of numerical expressions (currency amounts…)

Automatic Summarization

Automatic summarization is the creation of a shortened version of a text by a computer program. The product of this procedure still contains the most important points of the original text.

Speech Recognition

Speech recognition is the process of converting a speech signal (i.e. voice) to a set of words, by means of an algorithm implemented as a computer program.

    
  Some problems faced by NLP systems  

Sentence Boundary Disambiguation

What is a sentence? At first glance, it is a set of words ending with a dot. However, in a filename with extension (foo.pdf) or an IP address (127.0.0.1), this rule is not accurate enough. We should say that either a final dot is the last character of the input sentence, or it is followed by a space. Nevertheless, this improved rule fails on “John F. Kennedy”. Therefore, Sentence boundary disambiguation into sentences is a not-so-trivial task.

Lexical ambiguity

Many words have more than one meaning; we have to select the meaning that makes the most sense in context. (A “bank” can be a financial institution or a part of a river.) The task of find the right sense within a context is called Word Sense Disambiguation. Lexical ambiguity may involve one word having two parts of speech or homonyms. (“Time flies like an arrow.”)

Syntactic ambiguity

The grammar for natural languages is ambiguous, i.e. there are often multiple possible parse trees for a given sentence. (“The man saw the boy with the telescope”: is the man or the boy who has a telescope?) Choosing the most appropriate one usually requires semantic and contextual information.

Semantic ambiguity

Understanding the context is even more necessary to solve semantic ambiguities. For instance, incorrect anaphora resolution can lead to misunderstanding (in “every farmer who owns a donkey beats it”, is the pronoun “it” a reference to “farmer” or “donkey”?). Identifying the logical subject is also sometimes difficult (in “John asks his mother to do that”, is John or his mother that is supposed to do the action?).