Friday, July 03, 2009
Register  |  Login
Resources
  Contact  

To contact us, or submit a new resource, please send us a mail at webmaster@proxem.com.

    
  Resources  

Lexical level resources

WordNet

WordNet
WordNet® is an online lexical reference system whose design is inspired by current psycholinguistic theories of human lexical memory. English nouns, verbs, adjectives and adverbs are organized into synonym sets, each representing one underlying lexical concept. Different relations link the synonym sets.

Word Sense Disambiguation in a Slot Grammar Framework
This is a preliminary report on a system for word sense disambiguation (WSD) for unrestricted vocabulary, which requires no training on tagged text. Disambiguation is done to WordNet word senses. The “disambiguating power” of the system comes from three sources: (A) Parsing by English Slot Grammar (ESG), (B) the WordNet relation system, and (C) the WordNet sense frequency data.

Mapping of EuroWordnet Top Ontology to Upper Cyc Ontology
A mapping of EuroWordnet Top Ontology into Upper Cyc Ontology is presented. The mapping is expressed in terms of a CycL microtheory encoding of the EuroWordnet Top Ontology, because it is impossible to be made just by means of equivalence and subsumption relations.

WordNet::Similarity
This is a CPAN module that implements a variety of semantic similarity measures that can be used in conjunction with WordNet. In particular, it supports the measures of Resnik, Lin, Jiang-Conrath, Leacock-Chodorow, Hirst-St.Onge, Wu-Palmer, Banerjee-Pedersen, and Patwardhan-Pedersen.

eXtended WordNet
The goal of this project is to develop a tool that takes as input the current or future versions of WordNet and automatically generates an eXtended WordNet that provides several important enhancements intended to remedy the present limitations of WordNet.

NameNet: a Self-Improving Resource for Name Classification
This paper presents a semantically structured resource of more than 1,600 Name Classes. This structure is based on the noun hyperonymy hierarchies in WordNet, expanded and validated by corpus evidence collected from the World Wide Web. The set of seed examples provided by WordNet is boostrapped and the used to automatically construct an annotated training corpus for each Name Class. The resulting Named Entity resource enables a supervised Named Entity Recognizer to identify all the encoded Name Classes with high accuracy and without any human intervention.

Balkanet
The Balkan WordNet aims at the development of a multilingual lexical database comprising of individual WordNets for the Balkan languages. The most ambitious feature of the BalkaNet is its attempt to represent semantic relations between words in each Balkan language and link them together in order to develop an on line multilingual semantic network. The main objective is the development of each's languages WordNet from available resources covering the general vocabulary of each language. Semantic relations will be classified in the independent WordNets according to a shared ontology. Then, all individual WordNets will be organized into a common database providing linking across them. Each of the WordNets will be structured along the same lines as the EuroWordNet through a WordNet Management System. This project is an excellent opportunity to explore the less studied Balkan languages and combine and compare them cross-linguistically.

WordNet.Net
WordNet.Net library - the .Net Framework library for WordNet.

WordNet-based semantic similarity measurement
Semantic similarity is a confidence score that reflects the semantic relation between the meanings of two sentences. It is difficult to gain a high accuracy score because the exact semantic meanings are completely understood only in a particular context.


MindNet

MindNet
MindNet is knowledge representation project that uses our broad-coverage parser to build semantic networks from dictionaries, encyclopedias, and free text. MindNets are produced by a fully automatic process that takes the input text, sentence-breaks it, parses each sentence to build a semantic dependency graph (Logical Form), aggregates these individual graphs into a single large graph, and then assigns probabilistic weights to subgraphs based on their frequency in the corpus as a whole. The project also encompasses a number of mechanisms for searching, sorting, and measuring the similarity of paths in a MindNet. We believe that automatic procedures such as MindNets provide the only credible prospect for acquiring world knowledge on the scale needed to support common-sense reasoning.


Antonymy resources

Vecteurs conceptuels et fonctions lexicales : application à l'antonymie.
Ce mémoire porte sur la représentation de l'aspect thématique des segments textuels (documents, paragraphes, syntagmes, etc). Nous nous basons sur une approche mixte (symbolique et vectorielle) qui vise à combiner les informations déductibles des structures syntaxiques et les informations issues des représentations de sémantique lexicale. Certaines formes syntaxiques sont indirectement porteuses de sens et d'une facon générale peuvent être modélisées à l'aide de la théorie sens-texte et des fonctions lexicales. La négation, très fréquente dans les textes, peut permettre, entre autres, d'éviter les répétitions, ou de produire des énoncés dont la forme n'est pas lexicalement avérée comme, par exemple, les syntagmes "il n'est pas sérieux", "il n'est pas aimable". Les mots ,sérieux - ou aimable- n'ont pas de contraires bien avérées. Les termes ,léger - et désagréable- ne sont tout au plus que des approximations. La négation ne signifie pas toujours le contraire d'une affirmation, comme dans le cas de la phrase, "elle n'est pas belle, elle est superbe". Par contre, dans le cas, "il n'est pas mort" la négation exprime, a priori, l'idée opposée "il est vivant" avec cependant les problèmes de la polysémie et des sens figurés. On peut parler de "vivant" dans le sens gai, tonique.

Antonymy and Semantic Range in English
This dissertation investigates what makes two words antonyms. Previous research has not adequately explained why some words seem to contrast in meaning but are still not considered antonyms (e.g. large and little) nor can it explain why some words have two antonyms (e.g., happy/sad and happy/unhappy). An explanation is given here using the notion of "semantic range" (a description of a word's typical collocation patterns); antonyms are shown to be words which have a great deal of semantic range in common.


Similarity measure between words

WordNet::Similarity
This is a CPAN module that implements a variety of semantic similarity measures that can be used in conjunction with WordNet. In particular, it supports the measures of Resnik, Lin, Jiang-Conrath, Leacock-Chodorow, Hirst-St.Onge, Wu-Palmer, Banerjee-Pedersen, and Patwardhan-Pedersen.


Other resources on lexical level

HyperLex algorithm for automatic discrimination of word uses in a textual database
Jean Véronis describes the HyperLex algorithm for automatic discrimination of word uses in a textual database. The algorithm does not require a dictionary. It detects high density components in the word-cooccurrence graph, and, contrary to previous methods (word vectors), enables the recognition of very low frequency uses. HyperLex is associated with a graphic representation technique that makes it possible to navigate through the lexicon and explore visually the various themes corresponding to the discriminated uses.

COMLEX
COMLEX Syntax is a monolingual English Dictionary consisting of 38,000 head words intended for use in natural language processing.



Ontologies

SUMO

SUMO
The Suggested Upper Merged Ontology (SUMO) and its domain ontologies form the largest formal public ontology in existence today. They are being used for research and applications in search, linguistics and reasoning. SUMO is the only formal ontology that has been mapped to all of the WordNet lexicon. SUMO is written in the SUO-KIF language. SUMO is free and owned by the IEEE.

KIF
Knowledge Interchange Format (KIF) is a computer-oriented language for the interchange of knowledge among disparate programs. It has declarative semantics (i.e. the meaning of expressions in the representation can be understood without appeal to an interpreter for manipulating those expressions); it is logically comprehensive (i.e. it provides for the expression of arbitrary sentences in the first-order predicate calculus); it provides for the representation of knowledge about the representation of knowledge; it provides for the representation of nonmonotonic reasoning rules; and it provides for the definition of objects, functions, and relations.


Other resources on ontologies

Sites Relevant to Ontologies and Knowledge Sharing
A list of resources on Ontologies and Knowledge Sharing.

John Bateman's ontology portal
This page is a collection of starting points for information on ontologies gathered together for ease of reference for our own ontology-related projects.

Fine-Grained Proper Noun Ontologies for Question Answering
The WordNet lexical ontology, which is primarily composed of common nouns, has been widely used in retrieval tasks. Here, we explore the notion of a finegrained proper noun ontology and argue for the utility of such an ontology in retrieval tasks. To support this claim, we build a fine-grained proper noun ontology from unrestricted news text and use this ontology to improve performance on a question answering task.

An introduction to Ontology by John F. Sowa
Ontology is the study of existence. An ontology is a system of categories for classifying and talking about the things that are assumed to exist. This directory contains a summary of the ontology developed and used in the KR book by John Sowa.

KBS / Ontology Projects Worldwide
Some ongoing KBS/Ontology projects and groups.

OWL Web Ontology Language Reference
The Web Ontology Language OWL is a semantic markup language for publishing and sharing ontologies on the World Wide Web. OWL is developed as a vocabulary extension of RDF (the Resource Description Framework) and is derived from the DAML+OIL Web Ontology Language.


Gellish

Gellish - A Generic Extensible Ontological Language
A Generic Extensible Ontological Language

Design and Application of a Universal Data Structure
Thesis describing Gellish. The problem statement of this research is the question whether it is possible to provide a formal generic artificial language for an unambiguous description of reality, that is based on natural language, is defined in a formal ontology, and is practically applicable, at least for technical artifacts such that it is suitable to express and exchange information in the form of electronic data in a structure that is system and natural language independent.


Named Entity Hierarchies

Sekine's Extended Named Entity Hierarchy
The Extended Named Entity Hierarchy is designed and developed to meet increasing needs for wider range of NE types. It originates from the first Named Entity set defined by MUC (Grishman et al., 1996), the Named Entity set developed by IREX (Sekine et al., 2000), and the Extended Named Entity hierarchy which contains approximately 150 NE types (Sekine et al., 2002). But now it extened again t 200 NE types. The applications include Questions and Answering (Q&A) system that analyzes general texts such as newspaper articles, as well as Information Extraction (IE), Machine Translation (MT), Summarization and Information Retrieval (IR) systems that meet variety of NLP applications. We designe the Extended Named Entity Hierarchy, so that Q&A system or IE system assuming that information one wants know is basically in a form of noun phrase with specific names, time expression or numerical values.

NameNet: a Self-Improving Resource for Name Classification
This paper presents a semantically structured resource of more than 1,600 Name Classes. This structure is based on the noun hyperonymy hierarchies in WordNet, expanded and validated by corpus evidence collected from the World Wide Web. The set of seed examples provided by WordNet is boostrapped and the used to automatically construct an annotated training corpus for each Name Class. The resulting Named Entity resource enables a supervised Named Entity Recognizer to identify all the encoded Name Classes with high accuracy and without any human intervention.



POS tagger

Eric Brill's Tagger

Eric Brill's trainable rule-based part of speech tagger
The NLP programs that you can download: a supervised part of speech tagger, an unsupervised part of speech tagger, and a prepositional phrase attachment program. This tagger is based on transformation-based error-driven learning, a technique that has been effective in a number of natural language applications, including part of speech and word sense tagging, prepositional phrase attachment, and syntactic parsing.


TreeTagger - a language independent part-of-speech tagger

TreeTagger
The TreeTagger is a tool for annotating text with part-of-speech and lemma information which has been developed within the TC project at the Institute for Computational Linguistics of the University of Stuttgart. The TreeTagger has been successfully used to tag German, English, French, Italian, Greek and old French texts and is easily adaptable to other languages if a lexicon and a manually tagged training corpus are available.


SVMTagger

SVMTool
The SVMTool is a simple and effective generator of sequential taggers based on Support Vector Machines. We have appied the SVMTool to the problem of part-of-speech tagging. By means of a rigorous experimental evaluation, we conclude that the proposed SVM-based tagger is robust and flexible for feature modelling (including lexicalization), trains efficiently with almost no parameters to tune, and is able to tag thousands of words per second, which makes it really practical for real NLP applications. Regarding accuracy, the SVM-based tagger significantly outperforms the TnT tagger exactly under the same conditions, and achieves a very competitive accuracy of 97.2% for English on the Wall Street Journal corpus, which is comparable to the best taggers reported up to date.


Stanford Tagger

Stanford Log-linear POS Tagger download
This is a Java implementation of the log-linear part-of-speech (POS) taggers.


SS Tagger

SS Tagger - a part-of-speech tagger for English
Tagging speed is crucial in large-scale information extraction and real-time NLP applications. This part-of-speech (POS) tagger offers fast tagging (2400 tokens/sec) with a state-of-the-art accuracy (97.10% on the WSJ corpus). The tagger uses an extension of Maximum Entropy Markov Models (MEMM), in which tags are determined in the easiest-first mannar.



Chunkers

CASS chunker

CASS chunker
Cass. A fast, robust partial parser developed by Steven Paul Abney. CASS is a partial parser designed for use with large amounts of noisy text.


Ramshaw and Marcaus BaseNP chunker

Noun Phrase Chunker
This application is a Java implementation of the Ramshaw and Marcaus BaseNP chunker (in fact the files in the resources directory are taken straight from their original distribution) which attempts to insert brackets marking noun phrases in text which have been marked with POS tags in the same format as the output of Eric Brill's transformational tagger. The output from this version should be identical to the output of the original C++/Perl version released by Ramshaw and Marcus. A wrapper is also included which allows the easy use of this chunker within the GATE framework.


YamCha

YamCha: Yet Another Multipurpose CHunk Annotator
YamCha is a generic, customizable, and open source text chunker oriented toward a lot of NLP tasks, such as POS tagging, Named Entity Recognition, base NP chunking, and Text Chunking. YamCha is using a state-of-the-art machine learning algorithm called Support Vector Machines (SVMs), first introduced by Vapnik in 1995.


Papers about chunking

A Divide-and-Conquer Strategy for Parsing
In this paper, we propose a novel strategy which is designed to enhance the accuracy of the parser by simplifying complex sentences before parsing. This approach involves the separate parsing of the constituent sub-sentences within a complex sentence. To achieve that, the divide-and-conquer strategy first disambiguates the roles of the link words in the sentence and segments the sentence based on these roles. The separate parse trees of the segmented sub-sentences and the noun phrases within them are then synthesized to form the final parse. To evaluate the effects of this strategy on parsing, we compare the original performance of a dependency parser with the performance when it is enhanced with the divide-and-conquer strategy. When tested on 600 sentences of the IPSM'95 data sets, the enhanced parser saw a considerable error reduction of 21.2% in its accuracy.



Syntactic level resources

Stanford Parser

Stanford Parser
Statistical parsers use knowledge of language gained from hand-parsed sentences to try to produce the correct analysis of new sentences. These parsers still make some mistakes, but commonly work rather well. Their development was one of the biggest breakthroughs in natural language processing in the last decade. This package is a Java implementation of probabilistic natural language parsers, both highly optimized PCFG and dependency parsers, and a lexicalized PCFG parser.


Link Grammar Parser

Link Grammar Parser
The Link Grammar Parser is a syntactic parser of English, based on link grammar, an original theory of English syntax. Given a sentence, the system assigns to it a syntactic structure, which consists of a set of labeled links connecting pairs of words. The parser also produces a "constituent" representation of a sentence (showing noun phrases, verb phrases, etc.).


Probabilistic Dependency Parser for English

A low-complexity, broad-coverage probabilistic Dependency Parser for English
Large-scale parsing is still a complex and timeconsuming process, often so much that it is infeasible in real-world applications. The parsing system described here addresses this problem by combining finite-state approaches, statistical parsing techniques and engineering knowledge, thus keeping parsing complexity as low as possible at the cost of a slight decrease in performance. The parser is robust and fast and at the same time based on strong linguistic foundations.


MINIPAR

MINIPAR
MINIPAR is a broad-coverage parser for the English language. An evaluation with the SUSANNE corpus shows that MINIPAR achieves about 88% precision and 80% recall with respect to dependency relationships. MINIPAR is very efficient, on a Pentium II 300 with 128MB memory, it parses about 300 words per second.


HPSG (Head-Driven Phrase Structure Grammar)

HPSG
This page provides information about Head-Driven Phrase Structure Grammar (HPSG) related activities at the Center for the Study of Language and Information (CSLI) at Stanford University, and pointers to other resources on the web.


RRG (Role and Reference Grammar)

RRG
Role and Reference Grammar [RRG] (Van Valin 1993) takes language to be a system of communicative social action, and accordingly, analyzing the communicative functions of grammatical structures plays a vital role in grammatical description and theory fr om this perspective.


AGFL

AGFL
The goal of the AGFL-project (Affix Grammars over a Finite Lattice) is the development of a technology for Natural Language Processing available in the public domain.
The AGFL formalism for the syntactic description of Natural Languages has been developed by the Computer Science Department of the Radboud University of Nijmegen. It is a formalism in which large context free grammars can be described in a compact way. AGFLs belong to the family of two level grammars, along with attribute grammars: a first, context-free level is augmented with set-valued features for expressing agreement between parts of speech.
The AGFL parser generation system for Natural Languages generates efficient parsers from AGFL grammars. It includes a lexicon system suitable for the large lexica needed in real-life NLP applications.


XTAG Project

XTAG
XTAG is an on-going project to develop a wide-coverage grammar for English using a lexicalized Tree Adjoining Grammar (TAG) formalism. XTAG also serves as an system for the development of TAGs and consists of a parser, an X-windows grammar development interface and a morphological analyzer.


Other resources on syntactic level

GRAC: GRAmmar Checker
GRAC is a GRAmmar Checker written in Python. GRAC is based on learning algorithms, it needs tagged text corpus and mistake-free corpus to learn grammar rules and then detects grammar errors on a sentence or text you give.

Terminologie : des Théories Linguistiques à l'Extraction Automatique
Comment améliorer l'extraction automatique de la terminologie d'un corpus de textes ? Quels peuvent être les apports des théories linguistiques dans ce domaine ?



Semantic level resources

eXtended WordNet

eXtended WordNet
The goal of this project is to develop a tool that takes as input the current or future versions of WordNet and automatically generates an eXtended WordNet that provides several important enhancements intended to remedy the present limitations of WordNet.


Conceptual graphs

An introduction to Conceptual graphs by John Sowa
Conceptual graphs (CGs) are a system of logic based on the existential graphs of Charles Sanders Peirce and the semantic networks of artificial intelligence. They express meaning in a form that is logically precise, humanly readable, and computationally tractable. With a direct mapping to language, conceptual graphs serve as an intermediate language for translating computer-oriented formalisms to and from natural languages. With their graphic representation, they serve as a readable, but formal design and specification language. CGs have been implemented in a variety of projects for information retrieval, database design, expert systems, and natural language processing.


Deep understanding

Story understanding resources
A list of resources on story understanding

Story understanding through multi-representation model construction
We present an implemented model of story understanding and apply it to the understanding of a children’s story. We argue that understanding a story consists of building multirepresentation models of the story and that story models are efficiently constructed using a satisfiability solver. We present a computer program that contains multiple representations of commonsense knowledge, takes a narrative as input, transforms the narrative and representations of commonsense knowledge into a satisfiability problem, runs a satisfiability solver, and produces models of the story as output. The narrative, models, and representations are expressed in the language of Shanahan’s event calculus.

Understanding script-based stories using commonsense reasoning
This paper investigates the use of commonsense reasoning to understand texts involving stereotypical activities or scripts. We present a system that understands news stories involving four terrorism scripts. The system (1) builds a commonsense reasoning problem given an information extraction template representing a terrorist incident, and (2) uses commonsense reasoning and a commonsense knowledge base to build a model of the terrorist incident. The reasoning problem, commonsense knowledge base, and model are expressed in the classical logic event calculus. The system was developed using the MUC3 and MUC4 development data set. We present the results of running the system on the MUC3 and MUC4 test data sets, using manually generated answer key templates and templates generated automatically by two MUC4 information extraction systems. We present a detailed analysis of the models produced by the system given automatically generated templates. We present methods for answering questions based on the models produced by our system. We assess the portability of the system by extending it to handle 10 scripts frequent in Project Gutenberg American literature texts.

Prospects for in-depth story understanding by computer (Erik T. Mueller - November 29, 1999)
While much research on the hard problem of in-depth story understanding by computer was performed starting in the 1970s, interest shifted in the 1990s to information extraction and word sense disambiguation. Now that a degree of success has been achieved on these easier problems, I propose it is time to return to in-depth story understanding. In this paper I examine the shift away from story understanding, discuss some of the major problems in building a story understanding system, present some possible solutions involving a set of interacting understanding agents, and provide pointers to useful tools and resources for building story understanding systems.

The Plots of Children and Machines: The Statistical and Symbolic Semantic Analysis of Narratives
This thesis presents a method of automatic plot analysis of narrative texts that uses both components of traditional symbolic analysis of natural language and statistical machine-learning. In particular, we are investigating the story rewriting task. In the story rewriting task, an exemplar story is read to the pupils and the pupils rewrite the story in StoryStation, which allows them to concentrate more on diction and grammar than on content creation. However, often in the process of content creation the pupil improperly recalls the story. Our method of automatic plot analysis should allow the tutoring system to automatically analyze the plot of the story and provide relevant feedback to both the pupil and teacher.
(Harry Reeves Halpin, Master of Science - School of Informatics - University of Edinburgh, 2003)


Natural Semantic Metalanguage (NSM)

The Natural Semantic Metalanguage homepage
This site contains information and resources about the 'natural semantic metalanguage' (NSM) approach to semantic analysis, which can lay claim to being the most well-developed, comprehensive and practical approach to cross-cultural semantics on the contemporary scene.
The approach is based on evidence that there is a small core of basic, universal meanings, known as semantic primes, which can be found as words or other linguistic expressions in all languages. This common core of meaning can be used as a tool for linguistic and cultural analysis: to explicate complex and culture-specific words and grammatical constructions, and to articulate culture-specific values and attitudes (cultural scripts), in terms which are maximally clear and translatable. The theory also provides a semantic foundation for universal grammar and for linguistic typology. It has applications in intercultural communication, lexicography (dictionary making), language teaching, the study of child language acquisition, legal semantics, and other areas.
The main author is Anna Wierzbicka, who is the originator of the theory, but she has many colleagues and collaborators whose works are also listed here.

Semantics: Primes and Universals (Anna Wierzbicka)
Conceptual primitives and semantic universals are the cornerstones of a semantic theory which Anna Wierzbicka has been developing for many years. Semantics: Primes and Universals is a major synthesis of her work, presenting a full and systematic exposition of that theory in a non-technical and readable way. It delineates a full set of universal concepts, as they have emerged from large-scale investigations across a wide range of languages undertaken by the author and her colleagues. On the basis of empirical cross-linguistic studies it vindicates the old notion of the "psychic unity of mankind", while at the same time offering a framework for the rigorous description of different languages and cultures.

Definition of "Natural semantic metalanguage"
From Wikipedia, the free encyclopedia.



Pragmatic level resources

ConceptNet : A Very-Large Semantic Network of Common Sense Knowledge

ConceptNet
ConceptNet is a freely available commonsense knowledgebase and natural-language-processing toolkit which supports many practical textual-reasoning tasks over real-world documents right out-of-the-box (without additional statistical training).


Open Mind Commonsense

Open Mind Commonsense
Computers today are just plain dumb! The Open Mind Commonsense project is an attempt to make computers smarter by making it easy and fun for people all over the world to work together to give computers the millions of pieces of ordinary knowledge that constitute "common-sense", all those aspects of the world that we all understand so well we take them for granted. This repository of knowledge will enable us to create more intelligent and sociable software, build human-like robots, and better understand the structure our own minds. We hope you will join us by registering below!

Resources
Various software modules and data sets that are/were used in Rada Mihalcea's research at University of North Texas.



Multi-level resources

FrameNet

FrameNet
The Berkeley FrameNet project is a lexicon-building effort in which we
(1) study words;
(2) describe the frames or conceptual structures which underlie these;
(3) examine sentences, using a very large corpus of contemporary English that contains these words;
and (4) record the ways in which information from the associated frames are expressed in these sentences.


VerbNet

VerbNet
Verbnet is a class-based verb lexicon.


PropBank frames

PropBank Frames
Scheme of rolesets for English verbs. A roleset, defined for a verb acting in one of its verb senses, specifies the arguments that a verb accepts, and defines which semantic role each argument is playing with respect to the verb.


MIKROKOSMOS

MIKROKOSMOS
A full NLP project including an ontology an Text Meaning Representations (TMRs).


ThoughtTreasure

ThoughtTreasure
ThoughtTreasure is a commonsense knowledge base and architecture for natural language processing that uses multiple representations including logic, finite automata, grids, and scripts.


Ellogon

Ellogon Language Engineering Platform
NLP is an extremely exciting and useful task, NLP can be really complicated and building NLP systems can be absolutely, outrageously, often unaffordably expensive.
Ellogon is an effort that tries to keep all the excitement and reduce all the complexity... Ellogon is different from other similar software. First of all, it respects the user's time by offering a simple and user friendly graphical interface. But beneath this simple appearance a powerful engine is hidden, that has been proved to be able to support a wide range of uses, from simple research prototypes to commercial applications.
Ellogon is licensed under the GNU LGPL license, is easy to install and administer and is reliable. Running under all major operating systems, Ellogon offers a comfortable environment for computational linguists, language engineers or plain users.


GATE - General Architecture for Text Engineering

GATE
GATE is one of the most widely used human language processing systems in the world. It is a tool for:
scientists performing experiments that involve processing human language;
companies developing applications with language processing components;
teachers and students of courses about language and language computation.
GATE comprises an architecture, framework (or SDK) and graphical development environment, and has been built over the past eight years in the Sheffield NLP group. The system has been used for many language processing projects; in particular for Information Extraction in many languages. The system supports the full lifecycle of language processing components, from corpus collection and annotation through system evaluation. GATE is funded by the EPSRC and the EU.


McCullough Knowledge Explorer

McCullough Knowledge Explorer and the MKR language
MKR is a user-friendly RI ("Real Intelligence") language which combines the best features of English, UNIX shell, Unicon and CycL. MKR propositions have a terse English-like format which helps a human user focus on essential characteristics and avoid floating abstractions. MKR is a very-high-level knowledge representation language with a rigorous epistemological foundation including context, genus-differentia definitions, ECP hierarchies (knits) and a unique characterization of the changes associated with actions.


OpenCyc

OpenCYC
OpenCyc is the open source version of the Cyc technology, the world's largest and most complete general knowledge base and commonsense reasoning engine. Cycorp, the builders of Cyc, have set up an independent organization, OpenCyc.org, to disseminate and administer OpenCyc, and have committed to a pipeline through which all current and future Cyc technology will flow into ResearchCyc (available for R&D in academia and industry) and then OpenCyc.
NB: the version 0.9 that can be downloaded on opencyc.org using BitTorrent had a bug that kept OpenCyc from running on about half of the Windows machines that tried it. This bug is fixed in version 0.9.5 that can be downloaded on https://sourceforge.net/projects/opencyc.

Mapping of EuroWordnet Top Ontology to Upper Cyc Ontology
A mapping of EuroWordnet Top Ontology into Upper Cyc Ontology is presented. The mapping is expressed in terms of a CycL microtheory encoding of the EuroWordnet Top Ontology, because it is impossible to be made just by means of equivalence and subsumption relations.


KIM (Knowledge and Information Management) Platform

Ontotext KIM
The KIM Platform provides a novel Knowledge and Information Management (KIM) infrastructure and services for automatic semantic annotation, indexing, and retrieval of unstructured and semi-structured content. The most direct applications of KIM are:
Generation of meta-data for the Semantic Web, which allows hyper-linking and advanced visualization and navigation;
Knowledge Management, enhancing the efficiency of the existing indexing, retrieval, classification and filtering applications. Ontotext is a Sirma laboratory for R&D related to knowledge representation, linguistics, and web services. We provide core technology with applications in Knowledge Management, Semantic Web, and integration. Read more about us and about our Products, Mission, Skills, and Focus. Ontotext is proven to be knowledgeable, reliable, and cost-effective in:
development of tools and solutions: knowledge management; language engineering; semantic web services; custom reasoning services;
ontology design, evaluation, and mapping: domain analysis and modelling; application-specific ontologies.
Our most popular product is the KIM platform for semantic annotation, indexing and retrieval.


Disambiguation

SensEval
There are now many computer programs for automatically determining the sense of a word in context (Word Sense Disambiguation or WSD). The purpose of Senseval is to evaluate the strengths and weaknesses of such programs with respect to different words, different varieties of language, and different languages.

Word Sense Disambiguation in a Slot Grammar Framework
This is a preliminary report on a system for word sense disambiguation (WSD) for unrestricted vocabulary, which requires no training on tagged text. Disambiguation is done to WordNet word senses. The “disambiguating power” of the system comes from three sources: (A) Parsing by English Slot Grammar (ESG), (B) the WordNet relation system, and (C) the WordNet sense frequency data.


MontyLingua

MontyLingua V.2.1 (Python and Java)
MontyLingua is a free, commonsense-enriched, end-to-end natural language understander for English. Feed raw English text into MontyLingua, and the output will be a semantic interpretation of that text. Perfect for information retrieval and extraction, request processing, and question answering. From English sentences, it extracts subject/verb/object tuples, extracts adjectives, noun phrases and verb phrases, and extracts people's names, places, events, dates and times, and other semantic information. MontyLingua makes traditionally difficult language processing tasks trivial!


NooJ

NooJ
NooJ is a linguistic development environment that includes large-coverage dictionaries and grammars, and parses corpora in real time. NooJ includes tools to create and maintain large-coverage lexical resources, as well as morphological and syntactic grammars. Dictionaries and grammars are applied to texts in order to locate morphological, lexical and syntactic patterns and tag simple and compound words. NooJ can build complex concordances, with respect to all types of Finite State and Context-Free patterns. NooJ users can easily develop extractors to identify semantic units in large texts, such as names of persons, locations, dates, technical expressions of finance, etc.


Thesis & Papers

A syntax / semantic interface using broad-coverage resources in English
In Natural Language Processing, we must first compute a semantic representation of a text prior to “understanding” it. We describe here how to pass from a syntactic structure (generated by a syntactic parser of English) to a semantic form (in the form of predicates and relations between these predicates).
Our approach is based on the interoperability between several resources, covering syntactical (Link Grammar Parser), lexical (WordNet) and semantic (VerbNet) aspects of English. The joint use of these broad-coverage resources leads to encouraging results on lexical and syntactical disambiguation. That also makes it possible to assign a “semantic probability” to each interpretation of a sentence.
(MSc dissertation of François-Régis Chaumartin)

A Practical Semantic Representation For Natural Language Parsing
This thesis deals with the problem of building fast, accurate and portable parsers for natural language understanding. Our focus is a multi-domain dialogue system in which we need a deep linguistically-motivated parser to produce the representations of the input suitable for reasoning. In this dissertation, we are concerned with building parsers which have the wide coverage and portability o ered by a general syntactic grammar without sacri cing parsing speed and accuracy.

Synchronisation des connaissances syntaxiques et sémantiques pour l'analyse d'énoncés en langage naturel à l'aide des grammaires d'arbres adjoints lexicalisées - Djamé Seddah
A interface between syntax and semantic aims to propose a logical formalization of the relations between the parts of a sentence. This thesis is a proposal based upon the analysis of problematic linguistic phenomena in the Lexicalized Tree Adjunct Grammars (LTAG) framework. LTAG is a linguistic formalism which provides two structures of representation, derived tree and derivation tree. The last one is an almost perfect structure to be used as a canvas for semantic analysis. However, the derivation tree cannot represent coindexations in an autonomous way. We based our proposition upon the study of linguistic phenomena induced by control verbs. In order to allow their treatment and their complete formalization, we modify the initial LTAG formalism by introducing a new lexical information: the control canvas. Its purpose is to integrate inference of missing argumental links into a synchronous course of derived trees and derivations trees via a shared forest. We propose a dynamic reconstruction algorithm based on inference rules. These rules are executed during the derivation tree extraction process from the shared forest. As we use tabular techniques, we can extract, into a dependency graph, all the argumental relations described by one shared forest.



Knowledge management

KIF (Knowledge Interchange Format)

KIF
Knowledge Interchange Format (KIF) is a computer-oriented language for the interchange of knowledge among disparate programs. It has declarative semantics (i.e. the meaning of expressions in the representation can be understood without appeal to an interpreter for manipulating those expressions); it is logically comprehensive (i.e. it provides for the expression of arbitrary sentences in the first-order predicate calculus); it provides for the representation of knowledge about the representation of knowledge; it provides for the representation of nonmonotonic reasoning rules; and it provides for the definition of objects, functions, and relations.

SUMO
The Suggested Upper Merged Ontology (SUMO) and its domain ontologies form the largest formal public ontology in existence today. They are being used for research and applications in search, linguistics and reasoning. SUMO is the only formal ontology that has been mapped to all of the WordNet lexicon. SUMO is written in the SUO-KIF language. SUMO is free and owned by the IEEE.


GKB / GFP

GKB-Editor (Generic Knowledge Base Editor)
The GKB-Editor (Generic Knowledge Base Editor) is a tool for graphically browsing and editing knowledge bases across multiple Frame Representation Systems (FRSs) in a uniform manner. It offers an intuitive user interface, in which objects and data items are represented as nodes in a graph, with the relationships between them forming the edges. Users edit a KB through direct pictorial manipulation, using a mouse or pen. A sophisticated incremental browsing facility allows the user to selectively display only that region of a KB that is currently of interest, even as that region changes.

Generic Frame Protocol (GFP)
The Generic Frame Protocol (GFP), jointly developed at SRI International and Stanford University, provides a set of functions that support a generic interface to underlying frame representation systems (FRSs). The interface layer allows an application some independence from the idiosyncrasies of specific FRS software and enables the development of generic tools that operate on many FRSs.


Other resources on Knowledge Management

Knowledge Representation - Lecture Listing
Représentation des connaissances - Cours de Chris Thornton

Sites Relevant to Ontologies and Knowledge Sharing
A list of resources on Ontologies and Knowledge Sharing.



Question Answering / Information Extraction

Webclopedia

Webclopedia
Webclopedia - Targeted Delivery of Multilingual Information
Webclopedia is a multilingual natural-language processing research project that aims to answer questions based on texts available on the Web or other text collections.


AnswerFinder

AnswerFinder
This is a general purpose open-domain question answering system (written in Java) that draws it's answers from the Internet.


Search engine API

Google Web APIs
With the Google Web APIs service, software developers can query billions of web pages directly from their own computer programs. Google uses the SOAP and WSDL standards so a developer can program in his or her favorite environment - such as Java, Perl, or Visual Studio .NET.

Yahoo! Search Web Services
Yahoo! Search Web Services allow you to access Yahoo content and services in your favorite programming languages. This means you can now build Yahoo directly into your own applications.

MSN Web Search SDK
The MSN Search SDK provides documentation that describes the core concepts, requirements, development guidelines, and class library for the MSN Search Web Service. The SDK also contains sample code that demonstrates application development techniques using the MSN Search Web Service.


Advanced Web search engine

AnswerBus
AnswerBus is an open-domain question answering system (QA) based on intelligent information retrieval. It accepts your questions in natural languages and extracts answers from the Web. Currently, You can use English, German, French, Italian, Spanish and Portuguese as your languages.

KartOO
KartOO est un méta moteur de recherche qui présente ses résultats sous forme de carte. Les sites trouvés sont représentés par des pages plus ou moins grosses en fonction de leur pertinence. Entre ces sites figurent des thèmes qu'il suffit de cliquer pour préciser votre recherche.

WebClust
WebClust is a meta search engine based on a technology called "Document clustering": the automatic organization of documents into meaningful groups. WebClust queries one or more web search engines, parses their result pages to extract the documents (titles, URLs, and short descriptions) and groups the documents based on this information. This process presents the best results of the web in a "horizontal" topical arrangement in addition to a single vertical list.


Search engine

DotLucene
DotLucene is a powerful open-source search engine for .NET.

Lucene Java
The Apache Lucene project develops open-source search software, including Lucene Java, our flagship sub-project, provides Java-based indexing and search technology.


Books on IE

Extraction automatique d'information, du texte brut au web sémantique
Les entreprises et les particuliers sont confrontés à une masse d'information sans cesse croissante. Partant de ce constat, de nombreux systèmes ont été conçus pour filtrer, trier et catégoriser l'information. L'offre est en revanche beaucoup plus faible en ce qui concerne l'analyse du contenu. Extraction automatique d'information - du texte brut au web sémantique présente les progrès récents en extraction d'information et en compréhension de textes. Les recherches effectuées ces dernières années dans le domaine du traitement automatique des langues rendent en effet possible l'annotation sémantique de documents, l'extraction d'information pertinente et la création de bases de connaissances structurées à partir de textes en langage naturel. L'ouvrage rappelle les grands courants de recherche qui ont marqué le domaine de la compréhension automatique de textes par ordinateur. Il se poursuit par la présentation détaillée d'un système appelé SEMTEX, qui est appliqué à une grande variété de textes et de situations différentes. Les applications détaillées donnent des perspectives sur le web sémantique et l'ingénierie des connaissances.


FASTUS

FASTUS - Extracting Information from Real-World Texts
FASTUS is a (slightly permuted) acronym for Finite State Automata-based Text Understanding System. It is a system for extracting information from free text. Currently English and Japanese versions of the system exist. Typical applications mark text with annotations that indicate items of interest, such as names of people or companies, or it fills database templates with information that could be then entered into a relational database.


Other resources on Question Answering / Information Extraction

Sekine's Extended Named Entity Hierarchy
The Extended Named Entity Hierarchy is designed and developed to meet increasing needs for wider range of NE types. It originates from the first Named Entity set defined by MUC (Grishman et al., 1996), the Named Entity set developed by IREX (Sekine et al., 2000), and the Extended Named Entity hierarchy which contains approximately 150 NE types (Sekine et al., 2002). But now it extened again t 200 NE types. The applications include Questions and Answering (Q&A) system that analyzes general texts such as newspaper articles, as well as Information Extraction (IE), Machine Translation (MT), Summarization and Information Retrieval (IR) systems that meet variety of NLP applications. We designe the Extended Named Entity Hierarchy, so that Q&A system or IE system assuming that information one wants know is basically in a form of noun phrase with specific names, time expression or numerical values.



Semantic Web

OWL

OWL Web Ontology Language Reference
The Web Ontology Language OWL is a semantic markup language for publishing and sharing ontologies on the World Wide Web. OWL is developed as a vocabulary extension of RDF (the Resource Description Framework) and is derived from the DAML+OIL Web Ontology Language.


Jena 2 - A Semantic Web Framework

Jena 2
Jena is a Java framework for writing Semantic Web applications.

Jena 2 source code
Jena is a Java framework for writing Semantic Web applications.

Wicked Cool Java: Crawling the Semantic Web (Get started with RDF)
Brian Eubanks explains how Java developers can participate in the Semantic Web, a project that strives to create a universal medium for information exchange by linking concepts together. He introduces the Resource Description Framework standard and presents some APIs that aid in producing or consuming content.


TAP

TAP - Building the Semantic Web
The TAP KB is a shallow but broad knowledge base containing basic lexical and taxonomic information about a wide range of popular objects. Our goal is bootstrap the Semantic Web by providing a comprehensive source of basic information about popular objects.


Semantic MediaWiki

Semantic MediaWiki
The WikiProject "Semantic MediaWiki" provides a common platform for discussing extensions of the MediaWiki software that allow for simple, machine-based processing of Wiki-content. This usually requires some form of "semantic annotation," but the special Wiki environment and the multitude of envisaged applications impose a number of additional requirements. The overall objective of the project is to develop a single solution for semantic annotation that fits the needs of most Wikimedia projects and still meets the Wiki-specific requirements of usability and performance. It is understood that ad hoc implementations (i.e. "hacks") may sometimes solve single problems, but agreeing on common editing syntax, underlying technology, exchange formats, etc. bears huge advantages for all participants.


Papers on Semantic Web

Introduction to Semantic Web Technologies
This is intended to give someone new to the Semantic Web a basic overview of the technologies involved, and a guide to where to go to find out more.

What is the Semantic Web?
Currently the focus of a W3C working group, the Semantic Web vision was conceived by Tim Berners-Lee, the inventor of the World Wide Web. The World Wide Web changed the way we communicate, the way we do business, the way we seek information and entertainment – the very way most of us live our daily lives. Calling it the next step in Web evolution, Berners-Lee defines the Semantic Web as “a web of data that can be processed directly and indirectly by machines.”

Cerebra Technologies
A list of technologies used by the Cerebra Semantic Web product.



Encyclopedia - Specialized dictionaries

Encyclopedia

Encyclopedia.com
Encyclopedia.com, the Internet's premiere free encyclopedia, provides users with more than 57,000 frequently updated articles from the Columbia Encyclopedia, Sixth Edition. Each article is enhanced with links to newspaper and magazine articles as well as pictures and maps - all provided by HighBeam Research.

Wikipedia
Wikipedia is a free–content encyclopedia that anyone can edit.

Probert Encyclopaedia
The Probert Encyclopaedia is a trully independent reference work covering all aspects of human knowledge. Because we only use independent researchers, not sponsored by industry, corporations, governments or advertisers, our data is reliable and unbiased. Unlike many other sources including the so-called 'experts' one reads, sees and hears in the media. The data within The Probert Encyclopaedia is arranged into concise articles which are classified by their scope: people, places, nature, food and drink, costume &c. These articles are fully inter-linked allowing additional data to be quickly and easily obtained as required, and most articles have a single click research link allowing retrieval of all related data to be requested with a single mouse click. You can easily search for a specific topic or subject, or if you prefer browse through the over 235,000 fully inter-linked articles to discover specific information about everything imaginable, whether it's a famous actor or a particular warship, revolver, phobia or medical complaint.

Probert Encyclopaedia former (freeware) HTML content
This text-only encyclopedia (not a program) is comprehensive enough to be useful. It is divided into major topical sections, within which entries are sorted alphabetically. There are some areas covered which some "popular" CD-ROM dictionaries often neglect. This encyclopedia has become a shareware / commercial product – the final freeware versions listed here are still available, but are no longer being updated.

Columbia Encyclopedia
The Columbia Electronic Encyclopedia contains almost 52,000 entries (marshalling six and one-half million words on a vast range of topics), with more than 84,000 hypertext cross-references. Columbia Encyclopedia is among the most complete and up-to-date electronic encyclopedias ever produced.

Ethnologue - languages of the world
An encyclopedic reference work cataloging all of the world’s 6,912 known living languages,

Britannica Concise Encyclopedia
A one-volume encyclopedia that includes 25,000 short entries.


General dictionaries

Wordsmyth
Wordsmyth is a dictionary that has several important and distinctive qualities. Chief among the distinctive features are (1) clarity, simplicity, and precision of style resulting in definitions that are more accessible than those of American college dictionaries; and (2) the integration of dictionary and thesaurus data, so that only one entry is required instead of both dictionary and thesaurus entries.

Edventures Term Browser
Look up tricky math and science terms

Smartpedia.com
An encyclopedia licensed under the GNU Free Documentation License (GFDL).

WordReference.com
The WordReference Dictionaries are free online translation dictionaries. Type in a word in the forms to the left for a quick translation or definition.

The Free Dictionary
English, Medical, Legal, Financial, and Computer Dictionaries, Thesaurus, Acronyms, Encyclopedia, a Literature Reference Library, and a Search Engine all in one!

OneLook
OneLook regroupe divers dictionnaires généraux et spécialisés en un seul outil de recherche. Les sujets couvrent presque tous les domaines. Il inclut près de 1000 dictionnaires avec plus de 6 000 000 de mots.

Cambridge Dictionaries Online
Publish dictionaries for people learning English all around the world

UltraLingua
Contains over 120,000 definitions & 80,000 synonyms. The online interface allows you to search for words or parts of words, search within definitions to find headwords (Reverse Dictionary) , search for words that sound alike (Phonetic Dictionary)

Longman Dictionary of Contemporary English (LDOCE)
You can use the Longman Web Dictionary to look up ANY word on the Web. It contains over 80,000 words and phrases, including 15,000 references to people, places, events and organizations.

The American Heritage Dictionary of the English Language
Le Dictionnaire integral du Patrimoine Americain(r), troisième édition, contient plus de 350 000 entrées et acceptions. Les définitions de mots sont ensuite soulignées par plus de 34 000 exemples d'utilisation, plus de 500 notes sur l'usage et un appendice récemment révisé des racines indo-européennes.


Specialized dictionaries

Webopedia
The only online dictionary and search engine you need for computer and Internet technology definitions.

FOLDOC
Free On-Line Dictionary Of Computing

Glossary of legal terms
Based on Merriam-Webster's Dictionary of Law 2001.

EconomicExpert.com
This site is intended as a resource for those working or interested in working on macro-economy research, training, education and economic development. We provide a comprehensive and searchable reference tool on the web, our website is completely free and non-profit.

Dictionary of legal terms
Dictionnaire en ligne anglais des termes juridiques avec les explications claires de 3000 termes juridiques communs.

The WorldWideWeb Acronym and Abbreviation Server
You can search here for acronyms and for words used in acronyms. An acronym is a label formed from the beginnings of words (Greek: acro [head] and nym [word]) -- or very rarely, from letters in the middle of words. There is no requirement that an acronym be pronounceable as a normal word (this is a curious myth perpetuated by American dictionaries): IBM is just as much an acronym as LASER.

Find out what those acronyms and abbreviations stand for
The web's most comprehensive dictionary of acronyms, abbreviations, and initialisms (414,000+ definitions).

Find the meanings of military terms and acronyms
With Military Words, you can search for military/government acronyms and abbreviations (powered by Acronym Finder) and military terms from the US DoD Joint Publication

MedTerms
The MedTerms Medical Dictionary is somewhat different from the traditional medical dictionary. Since this Medical Dictionary was first conceived some years ago, the medical staff of MedicineNet.com has added (and subtracted) entries almost daily. We have also revised existing entries on an ongoing basis. The MedTerms Medical Dictionary is an online publication with the advantages of this electronic medium.

Merriam-Webster medical dictionary
Dictionnaire très complet de l'anglais médical.

BioTech life science dictionary
Currently, most of our 8300+ terms deal with biochemistry, biotechnology, botany, cell biology and genetics. We also have some terms relating to ecology, limnology, pharmacology, toxicology and medicine.

The CMU Pronouncing Dictionary
The Carnegie Mellon University Pronouncing Dictionary is a machine-readable pronunciation dictionary for North American English that contains over 125,000 words and their transcriptions. This format is particularly useful for speech recognition and synthesis, as it has mappings from words to their pronunciations in the given phoneme set. The current phoneme set contains 39 phonemes, for which the vowels may carry lexical stress.



Programming languages

PROLOG

P#
P# is a compiler which facilitates interoperation between a concurrent superset of the Prolog programming language and C#. This enables Prolog to be used as a native implementation language for Microsoft's .NET platform. P# compiles a linear logic extension of Prolog to C# source code.

PROLOG tutorial in french
Ce support de cours correspond à un module de 20 heures, destiné à des étudiants en deuxième année d'IUT.

On-line guide to PROLOG programming
Contribution to evolving area of logic programming languages and PROLOG in particular

PROLOG theorem solver
PROLOG theorem solver.

Solutions for "The Zebra Puzzle"
This is an example of a completely specified solution which doesn't appear to be specified at all. The constraints are such that the answer is unique, but they are stated in such a way that it is not at all obvious (to this human, at least) what the answer is.

The µ-TBL Homepage
The µ-TBL system represents an attempt to use the search and database capabilities of the Prolog programming language to implement a generalized form of transformation-based learning.


Java

OpenCCG: The OpenNLP CCG Library
OpenCCG, the OpenNLP CCG Library, is an open source natural language processing library written in Java, which provides parsing and realization services based on Mark Steedman's Combinatory Categorial Grammar (CCG) formalism.

AnswerFinder
This is a general purpose open-domain question answering system (written in Java) that draws it's answers from the Internet.

Instance-Based Learning: A Java Implementation
Instance-Based Learning (IBL) is defined as the generalizing of a new instance (target) to be classified from the stored training examples. Training examples are processed when a new instance arrives. Instance-Based Learning methods are sometimes called Lazy Learning because they delay the processing until a new instance must be classified. Each time a new query instance is encountered, its relationship to the previously stored examples is examined to assign a target function value for the new instance.

Selection Engine
Selection Engine is a Java Case-Based-Reasoning (CBR) Tool

Java KBtextmaster Natural Language Processing Toolkit
Utilities for reading a variety of file formats (e.g., Microsoft Word, Powerpoint, PDF, OpenOffice.org, AbiWord), part of speach tagging, automatic categorization, extract human and place names from text, automatic summarization, document clustering, full indexing and search (using Lucene), etc.

GATE
GATE is one of the most widely used human language processing systems in the world. It is a tool for:
scientists performing experiments that involve processing human language;
companies developing applications with language processing components;
teachers and students of courses about language and language computation.
GATE comprises an architecture, framework (or SDK) and graphical development environment, and has been built over the past eight years in the Sheffield NLP group. The system has been used for many language processing projects; in particular for Information Extraction in many languages. The system supports the full lifecycle of language processing components, from corpus collection and annotation through system evaluation. GATE is funded by the EPSRC and the EU.

Noun Phrase Chunker
This application is a Java implementation of the Ramshaw and Marcaus BaseNP chunker (in fact the files in the resources directory are taken straight from their original distribution) which attempts to insert brackets marking noun phrases in text which have been marked with POS tags in the same format as the output of Eric Brill's transformational tagger. The output from this version should be identical to the output of the original C++/Perl version released by Ramshaw and Marcus. A wrapper is also included which allows the easy use of this chunker within the GATE framework.

Jena 2 source code
Jena is a Java framework for writing Semantic Web applications.

Lucene Java
The Apache Lucene project develops open-source search software, including Lucene Java, our flagship sub-project, provides Java-based indexing and search technology.

PowerLoom Knowledge Representation System
PowerLoom™ is the successor to the Loom™ knowledge representation system. It provides a language and environment for constructing intelligent applications. PowerLoom uses a fully expressive, logic-based representation language (a variant of KIF), and it uses a natural-deduction-style backward and forward chainer as its inference engine. The inference engine is not a complete first-order theorem prover, but it can handle complex rules, negation, equality reasoning, subsumption, and restricted forms of higher order reasoning. PowerLoom has a classifier that is able to classify descriptions expressed in full first order predicate calculus [See paper]. PowerLoom uses modules as a structuring device for knowledge bases, and lightweight worlds for classification and hypothetical reasoning. To implement PowerLoom we developed a new programming language called STELLA, which is a Strongly Typed, Lisp-like LAnguage that can be translated into Lisp, C++ and Java. PowerLoom is written in STELLA and therefore available in Common-Lisp, C++ and Java versions.

Wicked Cool Java: Crawling the Semantic Web (Get started with RDF)
Brian Eubanks explains how Java developers can participate in the Semantic Web, a project that strives to create a universal medium for information exchange by linking concepts together. He introduces the Resource Description Framework standard and presents some APIs that aid in producing or consuming content.

Stanford Log-linear POS Tagger download
This is a Java implementation of the log-linear part-of-speech (POS) taggers.


.NET (C#, VB.NET, Delphi.NET…)

Artificial Mind : .NET SDK for Artificial Intelligence
ArtificialMind is a free Artificial Intelligence platform (SDK) that provides the following services: Search Algorithms for problem solving, Genetic Algorithms and Artificial Neural Networks.

Nsolver
NSolver is a powerful programming language extension for ECMA CLS-compliant languages. It adds constraint programming capabilities to CLS-compliant languages.

NxBRE
NxBRE is a lightweight Business Rule Engine (aka Rule Based Engine) for the .NET platform, composed of a forward-chaining inference engine and an XML-driven flow control engine. It supports RuleML 0.86 Naf Datalog and Visio 2003 modeling.

P#
P# is a compiler which facilitates interoperation between a concurrent superset of the Prolog programming language and C#. This enables Prolog to be used as a native implementation language for Microsoft's .NET platform. P# compiles a linear logic extension of Prolog to C# source code.

DotLucene
DotLucene is a powerful open-source search engine for .NET.

FLUtE (Fuzzy Logic Ultimate Engine)
FLUtE, Fuzzy Logic Ultimate Engine, is a library released with LGPL license that allow the user to enforce his projects with the power of Fuzzy Logic’s techniques.

NooJ
NooJ is a linguistic development environment that includes large-coverage dictionaries and grammars, and parses corpora in real time. NooJ includes tools to create and maintain large-coverage lexical resources, as well as morphological and syntactic grammars. Dictionaries and grammars are applied to texts in order to locate morphological, lexical and syntactic patterns and tag simple and compound words. NooJ can build complex concordances, with respect to all types of Finite State and Context-Free patterns. NooJ users can easily develop extractors to identify semantic units in large texts, such as names of persons, locations, dates, technical expressions of finance, etc.

WordNet.Net
WordNet.Net library - the .Net Framework library for WordNet.

WordNet-based semantic similarity measurement
Semantic similarity is a confidence score that reflects the semantic relation between the meanings of two sentences. It is difficult to gain a high accuracy score because the exact semantic meanings are completely understood only in a particular context.


C / C++

Link Grammar Parser
The Link Grammar Parser is a syntactic parser of English, based on link grammar, an original theory of English syntax. Given a sentence, the system assigns to it a syntactic structure, which consists of a set of labeled links connecting pairs of words. The parser also produces a "constituent" representation of a sentence (showing noun phrases, verb phrases, etc.).

CASS chunker
Cass. A fast, robust partial parser developed by Steven Paul Abney. CASS is a partial parser designed for use with large amounts of noisy text.

YamCha: Yet Another Multipurpose CHunk Annotator
YamCha is a generic, customizable, and open source text chunker oriented toward a lot of NLP tasks, such as POS tagging, Named Entity Recognition, base NP chunking, and Text Chunking. YamCha is using a state-of-the-art machine learning algorithm called Support Vector Machines (SVMs), first introduced by Vapnik in 1995.

SS Tagger - a part-of-speech tagger for English
Tagging speed is crucial in large-scale information extraction and real-time NLP applications. This part-of-speech (POS) tagger offers fast tagging (2400 tokens/sec) with a state-of-the-art accuracy (97.10% on the WSJ corpus). The tagger uses an extension of Maximum Entropy Markov Models (MEMM), in which tags are determined in the easiest-first mannar.

Eric Brill's trainable rule-based part of speech tagger
The NLP programs that you can download: a supervised part of speech tagger, an unsupervised part of speech tagger, and a prepositional phrase attachment program. This tagger is based on transformation-based error-driven learning, a technique that has been effective in a number of natural language applications, including part of speech and word sense tagging, prepositional phrase attachment, and syntactic parsing.

SVMTool
The SVMTool is a simple and effective generator of sequential taggers based on Support Vector Machines. We have appied the SVMTool to the problem of part-of-speech tagging. By means of a rigorous experimental evaluation, we conclude that the proposed SVM-based tagger is robust and flexible for feature modelling (including lexicalization), trains efficiently with almost no parameters to tune, and is able to tag thousands of words per second, which makes it really practical for real NLP applications. Regarding accuracy, the SVM-based tagger significantly outperforms the TnT tagger exactly under the same conditions, and achieves a very competitive accuracy of 97.2% for English on the Wall Street Journal corpus, which is comparable to the best taggers reported up to date.

NeoClassic: The C++ Version of Classic
Classic is a family of knowledge representation (KR) systems designed for applications where only limited expressive power is necessary, but rapid responses to questions are essential. The Classic systems are based on description logics (DLs), which gives them an object-centered flavor, and thus most of the features available in semantic networks are also available in Classic. Classic has a framework that allows users to represent descriptions, concepts, roles, individuals and rules. Classic allows for both primitive concepts, similar to the classes and frames of other knowledge representation systems and object-oriented programming languages, and defined concepts, i.e. concepts that have both necessary and sufficient conditions for membership. Concepts are automatically organized into a generalization taxonomy and objects are automatically made instances of all concepts for which they pass the membership test. Another type of reasoning that Classic does is to detect inconsistencies in information that it is told. In the presence of defined concepts these operations are non-trivial and useful.

ThoughtTreasure
ThoughtTreasure is a commonsense knowledge base and architecture for natural language processing that uses multiple representations including logic, finite automata, grids, and scripts.

PowerLoom Knowledge Representation System
PowerLoom™ is the successor to the Loom™ knowledge representation system. It provides a language and environment for constructing intelligent applications. PowerLoom uses a fully expressive, logic-based representation language (a variant of KIF), and it uses a natural-deduction-style backward and forward chainer as its inference engine. The inference engine is not a complete first-order theorem prover, but it can handle complex rules, negation, equality reasoning, subsumption, and restricted forms of higher order reasoning. PowerLoom has a classifier that is able to classify descriptions expressed in full first order predicate calculus [See paper]. PowerLoom uses modules as a structuring device for knowledge bases, and lightweight worlds for classification and hypothetical reasoning. To implement PowerLoom we developed a new programming language called STELLA, which is a Strongly Typed, Lisp-like LAnguage that can be translated into Lisp, C++ and Java. PowerLoom is written in STELLA and therefore available in Common-Lisp, C++ and Java versions.

MINIPAR
MINIPAR is a broad-coverage parser for the English language. An evaluation with the SUSANNE corpus shows that MINIPAR achieves about 88% precision and 80% recall with respect to dependency relationships. MINIPAR is very efficient, on a Pentium II 300 with 128MB memory, it parses about 300 words per second.


Description logics languages

KRHyper
This is the homepage of the new implementation of KRHyper in Ocaml. KRHyper is a first order logic theorem proving and model generation system based on the hyper tableau calculus,

PowerLoom Knowledge Representation System
PowerLoom™ is the successor to the Loom™ knowledge representation system. It provides a language and environment for constructing intelligent applications. PowerLoom uses a fully expressive, logic-based representation language (a variant of KIF), and it uses a natural-deduction-style backward and forward chainer as its inference engine. The inference engine is not a complete first-order theorem prover, but it can handle complex rules, negation, equality reasoning, subsumption, and restricted forms of higher order reasoning. PowerLoom has a classifier that is able to classify descriptions expressed in full first order predicate calculus [See paper]. PowerLoom uses modules as a structuring device for knowledge bases, and lightweight worlds for classification and hypothetical reasoning. To implement PowerLoom we developed a new programming language called STELLA, which is a Strongly Typed, Lisp-like LAnguage that can be translated into Lisp, C++ and Java. PowerLoom is written in STELLA and therefore available in Common-Lisp, C++ and Java versions.

NeoClassic: The C++ Version of Classic
Classic is a family of knowledge representation (KR) systems designed for applications where only limited expressive power is necessary, but rapid responses to questions are essential. The Classic systems are based on description logics (DLs), which gives them an object-centered flavor, and thus most of the features available in semantic networks are also available in Classic. Classic has a framework that allows users to represent descriptions, concepts, roles, individuals and rules. Classic allows for both primitive concepts, similar to the classes and frames of other knowledge representation systems and object-oriented programming languages, and defined concepts, i.e. concepts that have both necessary and sufficient conditions for membership. Concepts are automatically organized into a generalization taxonomy and objects are automatically made instances of all concepts for which they pass the membership test. Another type of reasoning that Classic does is to detect inconsistencies in information that it is told. In the presence of defined concepts these operations are non-trivial and useful.

RACER
RacerPro is an OWL reasoner and inference server for the Semantic Web.

The Description Logic Handbook : Theory, Implementation and Applications
Description Logics are knowledge representation languages that have been studied extensively in artificial intelligence over the last two decades. This Handbook covers all aspects of research in this field; including theory, implementation, and applications. Its appeal is broad, ranging from more theoretically-oriented readers, to those with more practically-oriented interests who need a sound and modern understanding of knowledge representation systems based on Description Logics. The chapters by some of the most prominent researchers in the field first introduce the basic technical material before addressing the current state of the subject. This unique reference can also be used for self-study or in conjunction with knowledge representation and artificial intelligence courses.

Automated reasoning tools directory
A full list of theorem provers and satisfiability solvers.



Annotations / Corpus

Gutemberg

Project Gutemberg
There are 17,000 free books in the Project Gutenberg Online Book Catalog.


Corpora lists

Corpora and other Language and Speech Data under DICE
A full list of public corpora, some of them being freely available.

Downloadable Research Resources
Many (freely available) annotated corpora of spoken or written english (SUSANNE, CHRISTINE, LUCY), from Geoffrey Sampson.


AGTK: Annotation Graph Toolkit

AGTK
Annotation Graphs are a formal framework for representing linguistic annotations of time series data. Annotation graphs abstract away from file formats, coding schemes and user interfaces, providing a logical layer for annotation systems.


Reuters Corpora

Reuters Corpus, Volume 1 (RCV1)
In 2000, Reuters Ltd made available a large collection of Reuters News stories for use in research and development of natural language processing, information retrieval, and machine learning systems. This corpus, known as "Reuters Corpus, Volume 1" or RCV1, is significantly larger than the older, well-known Reuters-21578 collection heavily used in the text classification community.

Reuters-21578
The data was originally collected and labeled by Carnegie Group, Inc. and Reuters, Ltd. in the course of developing the CONSTRUE text categorization system. The collection is available here as a gzipped tar archive (8.2 MB; 28.0 MB uncompressed).


Text Encoding Initiative

Text Encoding Initiative
Initially launched in 1987, the TEI is an international and interdisciplinary standard that helps libraries, museums, publishers, and individual scholars represent all kinds of literary and linguistic texts for online research and teaching, using an encoding scheme that is maximally expressive and minimally obsolescent.


Corpus extracts

Brown Corpus (extract)
Here are 30 files from the Brown corpus.


English/US first and last names

1990 census name files
Source: U.S. Census Bureau, Population Division, Population Analysis & Evaluation Staff


Annotated Corpus

British National Corpus (BNC)
The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written.


Moby project

Moby lexicon project
Moby Words is part of the Moby Project, a large collection of lists of words and phrases, and works of literature (contents are now in the public domain). Partial contents of Moby Words:

  • Over 354,000 single words, excluding proper names, acronyms, or compound words and phrases. This list does not exclude archaic words or significant variant spellings.
  • 74,550 common dictionary words. A list of words in common with two or more published dictionaries. This gives the developer of a custom spelling checker a good beginning pool of relatively common words.
  • 4,946 female names. Frequent given names of females in English speaking countries.
  • 3,897 male names. Frequent given names of males in English speaking countries.
  • 21,986 names. This database contains the most common names used in the United States and Great Britain. Spelling checkers may want to supplement their basic word list with this one.



Online courses and tutorials

NLP courses

Multilingual information search
La liste de liens proposés présente quelques aspects de la recherche d'information multilingue. Les techniques et approches pour la traduction en RIML sont d'abord présentées. Certains systèmes sont ensuite répertoriés, de même de qu'une liste de sites reliés à leur évaluation. Des outils de désambiguïsation sont également décrits, ainsi que quelques organismes oeuvrant dans le traitement de la langue naturelle. Finalement, des articles pertinents à la recherche d'information multilingue et une liste de logiciels sont ensuite suggérés.

Introduction à la sémantique
Introduction à la sémantique : Cours magistral de 1ère année de Benoît Habert (LIMSI) à l'Université Paris X Nanterre - Sciences du Langage.


AI courses

Cours d'Intelligence Artificielle de l'EPFL
Cours en LISP

Computers and Thought: A practical Introduction to Artificial Intelligence
The aim of this book is to introduce people with little or no computing background to artificial intelligence (AI) and cognitive science. It emphasizes the psychological, social, and philosophical implications of AI and, by means of an extended project to design an Automated Tourist Guide, makes the connection between the details of an AI programming language and the `magic' of artificial intelligence programs, which converse in English, solve problems, and offer reasoned advice.

Knowledge Representation - Lecture Listing
Représentation des connaissances - Cours de Chris Thornton


Grammar courses

GrammarStation.com
Come, explore and learn the English language with ease and understand the correct grammar and its usage using GrammarStation.

Coherence: Anaphora and reference
Many examples of anaphora.

Online Writing Lab : Grammar, Punctuation, and Spelling
In this section of our site, we offer you handouts and exercises on grammar, spelling, and punctuation. We also have PowerPoint presentations related to grammar, and we have an entire section of handouts and resources for English as a Second Language learners that might also prove useful.

English Grammar: explanations and exercices - by Mary Ansell
All of the essential points of English grammar are covered.
Each point of grammar is clearly explained, and is illustrated by examples.
For every important point of grammar, one or more exercises are provided, to make it easier to learn and remember the material.
Answers for the exercises are provided.
A summary of the uses and formation of the English verb tenses is given for easy reference.
Grammatically determined rules for spelling, pronunciation, and punctuation are included.
The grammar of North American English is emphasized.
Grammatical differences between formal and informal English are pointed out.

Daily Grammar
Teachers have our permission to duplicate and use the lessons in their classrooms so long as the copyright information is preserved.

Grammar Bytes! Grammar Instruction with Attitude
Find detailed definitions of common grammar terms--everything from abstract nouns to verbs!

Glossary of English Grammar Terms
The grammar glossary is a comprehensive site with very clear definitions.

French interactive grammar
Grammaire interactive du français avec plusieurs centaines de fiches pratiques.

Online english course
Cours en ligne de Kevin Halion

Grammars and Language Courses
Here you will find grammars of over 100 languages where you can look up the rules of a language. Language courses that teach you foreign languages are also linked here, whether on line or on the shelf. Additional language resources such as newspapers, online radio stations, and our dictionaries are linked to each language.

UltraLingua grammar reference
Complete on-line grammar references for many languages (english, french, spanish, german).


Linguistics courses

Canadian linguistic courses
Cours de Françoise Labelle

Cours de linguistique de Véronique GENDNER
Intervention dans le cadre du DESS d'IREX - CCI Bourges

Glossary of linguistic terms
Un glossaire exhaustif de termes linguistiques, classés sous forme de taxonomie.



Agents

Intelligent agents

AgentLand
Le portail des agents intelligents.



Text generation

RAGS

RAGS
A Reference Architecture for Generation Systems



Directories

AI directory

aboutAI.net
All about Artificial Intelligence on the Net

Automated reasoning tools directory
A full list of theorem provers and satisfiability solvers.

KBS / Ontology Projects Worldwide
Some ongoing KBS/Ontology projects and groups.

Sites Relevant to Ontologies and Knowledge Sharing
A list of resources on Ontologies and Knowledge Sharing.

Fuzzy Logic
Fuzzy Logic is a departure from classical two-valued sets and logic, that uses "soft" linguistic (e.g. large, hot, tall) system variables and a continuous range of truth values in the interval [0,1], rather than strict binary (True or False) decisions and assignments.


NLP directory

Natural Language Software Registry
The Natural Language Software Registry (NLSR) is a concise summary of the capabilities and sources of a large amount of natural language processing (NLP) software available to the NLP community. It comprises academic, commercial and proprietary software with specifications and terms on which it can be acquired clearly indicated.

Natural Language Processing / Information Retrieval Software Repository
A very good list of corpora, grammars, lexicons, tools and libraries.

Statistical natural language processing and corpus-based computational linguistics: An annotated list of resources
A very good list of POS taggers, parsers and other tools.

OpenNLP
OpenNLP is an organizational center for open source projects related to natural language processing. Its primary role is to encourage and facilitate the collaboration of researchers and developers on such projects.

General Resources
Resources on
(1) - Corpora and Corpus Linguistics.
(2) - Multilingual and Parallel Corpora.
(3) - Electronic Literary Text Archives.
(4) - References, Standards & Educational Resources.
(5) - Tools.



People / Workgroups / Companies

Workgroups working on NLP

Microsoft Research
At Microsoft Research, we have an insatiable curiosity and the desire to create new technology that will help define the computing experience. Whether inspired by a suggestion from a customer or simply the search for a better way, we’re driven to innovate and push the state-of-the-art in computer science as far as our imaginations can reach. To that end, we collaborate with universities, submit papers for peer review, and partner with product groups to bring our research to you. Read on to discover what we’re doing to improve Microsoft products in the next two to ten years.

Google Labs
A partial list of papers written by people now at Google, showing the range of backgrounds of people in Google Engineering.

IBM Research - Computer Science - Natural Language Processing
Natural Language Processing at IBM is a dynamic research area spanning a wide range of topics vital for the development of cutting-edge applications of language engineering. Our mission is to offer speech and language technologies that form the core of current and future products and solutions for processing natural language. We work on theoretical issues of computational linguistics and develop technologies such as speech processing, machine translation, universal and application-specific dialog engines, information retrieval, text mining and hypertext databases, automatic text summarization, natural language understanding and generation, to mention just a few. One key goal is to provide advanced NLP software for multiple languages and modalities exploited in business applications. Another fundamental goal is to provide the sophisticated NLP technologies required to linguistically enable human-computer interfaces.

TALANA
Les activités de recherche de portent sur la Linguistique Informatique et plus particulièrement :
- Génération de textes
- Interactions sémantique-syntaxe-prosodie
- Modélisation linguistique et dépendance
- Modélisations sémantiques
Talana est dirigé par Laurence Danlos, Professeur, Université Paris 7.

ATALA
L'ATALA se consacre depuis 1959 au développement de la linguistique informatique en France. L'ATALA participe au portail Technoloangue.

GREYC
Le Groupe de Recherche en Informatique, Image, Automatique et Instrumentation de Caen a pour sigle GREYC. Ici, le Y symbolise à la fois les trois I (Informatique, Image, Instrumentation) mais aussi, avec un peu d'imagination (!) et une fois retourné, le A d'Automatique.

ISI - The University of Southern California
Natural Language Processing at USC/ISI USC/ISI is an academic research Institute that is part of USC's School of Engineering. Many ISI researchers are also on the faculty and computer science, and likewise, many CS graduate students do their dissertation research at ISI. In Business Week's recent survey of academic information technology research, USC ranked fifth overall, and ISI was called "a star" in its survey of international research institutions. ISI's Intelligent Systems Division is one of the largest university based artificial intelligence groups in the world.


Companies working on NLP

Sinequa
Specialized in corporate applications, Sinequa focuses on providing means to find, understand and seamlessly use textual information through intelligent and intuitive access. Thanks to its expertise in natural language processing, the company has developed on top of its patent protected semantic technology a flagship product called Intuition. This platform is both a search engine with advanced linguistic functionalities enabling true understanding of the meaning of both queries and documents, and an access and navigation tool facilitating each user corporate-wide access to information truly critical for day to day work.

PERTIMM
La société Pertimm est Editeur et Intégrateur de solutions de recherche d'informations à fortes valeurs ajoutées.

SYSTRAN
SYSTRAN is the leading provider of the world's most scalable and modular translation architecture. Its core technology powers revolutionary translation solutions for the Internet, PCs and network infrastructures that facilitate communication in 36 language pairs and in 20 specialized domains.

Connexor
Connexor provides linguistic technologies and expertise to software houses and solution providers who tackle the challenge of how to derive useful information from unstructured digital text for different kinds of consumers and analysts.

ONTOLOGOS CORP
L'évolution des marchés et des technologies a conduit à une profonde modification de nos sociétés que l'on qualifie volontiers de « sociétés de l'information » et dont le nouvel enjeu économique est devenu la maîtrise des connaissances et des savoir-faire. Cette maîtrise nécessite au préalable la construction de terminologies métier de l'entreprise qui soient consensuelles, cohérentes, partageables et réutilisables ; d'où la nécessaire introduction des ontologies comme représentations de la signification des termes pour l'indexation, la recherche, le routage, le rapprochement et la cartographie de l'information.

Synapse
Synapse Développement est une société toulousaine d'édition de logiciels créée en 1994. Elle a pour vocation le développement d'applications intégrant les techniques de la linguistique et de l'intelligence artificielle appliquées aux domaines de traitement de la langue, comme la correction orthographique, syntaxique, l'analyse de la langue, la traduction, le traitement automatique du langage naturel (Taln)

Softissimo
Pour vous aider à comprendre ou à traduire des documents, des pages web, et ceci pour de nombreuses combinaisons de langues et dans des domaines variés, Softissimo vous offre une gamme complète de logiciels de traduction : Reverso.

VirtuOz
VirtuOz met l’intelligence artificielle au service de vos clients et prospects. Avec la suite logicielle DialogServer et StudiOz, VirtuOz met enfin la technologie des agents conversationnels au service des entreprises. Capables de simuler le dialogue humain, nos solutions rendent vos applications web plus humaines et plus efficientes : résolution des problèmes clients, présentation des produits, communication des messages clés, enquêtes marketing...

MemoData
MemoData est leader européen des bases de données linguistiques destinées au traitement automatique du langage naturel. Développées depuis 15 ans, les bases de MemoData couvrent six langues européennes. Le dictionnaire français comprend plus de 185.000 mots-sens, ce qui correspond à environ 700.000 formes fléchies. 55.000 noms, 25.000 verbes, 25.000 adjectifs et 10.000 adverbes sont couverts. En outre, la base contient plus de 280.000 liens ontologiques (un fleuriste vend des fleurs, un animal mange, le chat est un animal...).

Cognitive Relation
Cognitive Relation is the only software solution on the market that both reasons and writes. Cognitive Relation offers your enterprise the opportunity to:
reproduce any type of intelligent and recurring reasoning process, be it business-related or administrative;
with the same skill and quality of reply;
for any contact channel used.

TextAnalysis
Text Analysis International's flagship product is VisualText, a comprehensive integrated development environment for NLP. VT uses the NLP++ general programming language with specializations for natural language processing. VT integrates the Conceptual Grammar KBMS for ontology and semantics. Analyzers built with VT blend grammars, patterns, keyword, and statistical paradigms in a multi-pass framework.

Lingway
Text mining solution, Lingway's technology consists of a natural language multilingual search engine, categorization and coding tools, software for generating an XML structure from textual documents, as well as information extraction and document visualization functions.

Basis Technology
Basis Technology provides software solutions for extracting meaningful intelligence from unstructured text in Asian, European and Middle Eastern languages.


 Print