Saturday, May 10, 2008
Register  |  Login
Resources
  Contact  

To contact us, or submit a new resource, please send us a mail at webmaster@proxem.com.

    
  Resources  

Lexical level resources

WordNet

WordNet
WordNet® is an online lexical reference system whose design is inspired by current psycholinguistic theories of human lexical memory. English nouns, verbs, adjectives and adverbs are organized into synonym sets, each representing one underlying lexical concept. Different relations link the synonym sets.

Word Sense Disambiguation in a Slot Grammar Framework
This is a preliminary report on a system for word sense disambiguation (WSD) for unrestricted vocabulary, which requires no training on tagged text. Disambiguation is done to WordNet word senses. The “disambiguating power” of the system comes from three sources: (A) Parsing by English Slot Grammar (ESG), (B) the WordNet relation system, and (C) the WordNet sense frequency data.

Mapping of EuroWordnet Top Ontology to Upper Cyc Ontology
A mapping of EuroWordnet Top Ontology into Upper Cyc Ontology is presented. The mapping is expressed in terms of a CycL microtheory encoding of the EuroWordnet Top Ontology, because it is impossible to be made just by means of equivalence and subsumption relations.

WordNet::Similarity
This is a CPAN module that implements a variety of semantic similarity measures that can be used in conjunction with WordNet. In particular, it supports the measures of Resnik, Lin, Jiang-Conrath, Leacock-Chodorow, Hirst-St.Onge, Wu-Palmer, Banerjee-Pedersen, and Patwardhan-Pedersen.

eXtended WordNet
The goal of this project is to develop a tool that takes as input the current or future versions of WordNet and automatically generates an eXtended WordNet that provides several important enhancements intended to remedy the present limitations of WordNet.

NameNet: a Self-Improving Resource for Name Classification
This paper presents a semantically structured resource of more than 1,600 Name Classes. This structure is based on the noun hyperonymy hierarchies in WordNet, expanded and validated by corpus evidence collected from the World Wide Web. The set of seed examples provided by WordNet is boostrapped and the used to automatically construct an annotated training corpus for each Name Class. The resulting Named Entity resource enables a supervised Named Entity Recognizer to identify all the encoded Name Classes with high accuracy and without any human intervention.

Balkanet
The Balkan WordNet aims at the development of a multilingual lexical database comprising of individual WordNets for the Balkan languages. The most ambitious feature of the BalkaNet is its attempt to represent semantic relations between words in each Balkan language and link them together in order to develop an on line multilingual semantic network. The main objective is the development of each's languages WordNet from available resources covering the general vocabulary of each language. Semantic relations will be classified in the independent WordNets according to a shared ontology. Then, all individual WordNets will be organized into a common database providing linking across them. Each of the WordNets will be structured along the same lines as the EuroWordNet through a WordNet Management System. This project is an excellent opportunity to explore the less studied Balkan languages and combine and compare them cross-linguistically.

WordNet.Net
WordNet.Net library - the .Net Framework library for WordNet.

WordNet-based semantic similarity measurement
Semantic similarity is a confidence score that reflects the semantic relation between the meanings of two sentences. It is difficult to gain a high accuracy score because the exact semantic meanings are completely understood only in a particular context.


MindNet

MindNet
MindNet is knowledge representation project that uses our broad-coverage parser to build semantic networks from dictionaries, encyclopedias, and free text. MindNets are produced by a fully automatic process that takes the input text, sentence-breaks it, parses each sentence to build a semantic dependency graph (Logical Form), aggregates these individual graphs into a single large graph, and then assigns probabilistic weights to subgraphs based on their frequency in the corpus as a whole. The project also encompasses a number of mechanisms for searching, sorting, and measuring the similarity of paths in a MindNet. We believe that automatic procedures such as MindNets provide the only credible prospect for acquiring world knowledge on the scale needed to support common-sense reasoning.


Antonymy resources

Vecteurs conceptuels et fonctions lexicales : application à l'antonymie.
Ce mémoire porte sur la représentation de l'aspect thématique des segments textuels (documents, paragraphes, syntagmes, etc). Nous nous basons sur une approche mixte (symbolique et vectorielle) qui vise à combiner les informations déductibles des structures syntaxiques et les informations issues des représentations de sémantique lexicale. Certaines formes syntaxiques sont indirectement porteuses de sens et d'une facon générale peuvent être modélisées à l'aide de la théorie sens-texte et des fonctions lexicales. La négation, très fréquente dans les textes, peut permettre, entre autres, d'éviter les répétitions, ou de produire des énoncés dont la forme n'est pas lexicalement avérée comme, par exemple, les syntagmes "il n'est pas sérieux", "il n'est pas aimable". Les mots ,sérieux - ou aimable- n'ont pas de contraires bien avérées. Les termes ,léger - et désagréable- ne sont tout au plus que des approximations. La négation ne signifie pas toujours le contraire d'une affirmation, comme dans le cas de la phrase, "elle n'est pas belle, elle est superbe". Par contre, dans le cas, "il n'est pas mort" la négation exprime, a priori, l'idée opposée "il est vivant" avec cependant les problèmes de la polysémie et des sens figurés. On peut parler de "vivant" dans le sens gai, tonique.

Antonymy and Semantic Range in English
This dissertation investigates what makes two words antonyms. Previous research has not adequately explained why some words seem to contrast in meaning but are still not considered antonyms (e.g. large and little) nor can it explain why some words have two antonyms (e.g., happy/sad and happy/unhappy). An explanation is given here using the notion of "semantic range" (a description of a word's typical collocation patterns); antonyms are shown to be words which have a great deal of semantic range in common.


Similarity measure between words

WordNet::Similarity
This is a CPAN module that implements a variety of semantic similarity measures that can be used in conjunction with WordNet. In particular, it supports the measures of Resnik, Lin, Jiang-Conrath, Leacock-Chodorow, Hirst-St.Onge, Wu-Palmer, Banerjee-Pedersen, and Patwardhan-Pedersen.


Other resources on lexical level

HyperLex algorithm for automatic discrimination of word uses in a textual database
Jean Véronis describes the HyperLex algorithm for automatic discrimination of word uses in a textual database. The algorithm does not require a dictionary. It detects high density components in the word-cooccurrence graph, and, contrary to previous methods (word vectors), enables the recognition of very low frequency uses. HyperLex is associated with a graphic representation technique that makes it possible to navigate through the lexicon and explore visually the various themes corresponding to the discriminated uses.

COMLEX
COMLEX Syntax is a monolingual English Dictionary consisting of 38,000 head words intended for use in natural language processing.



Ontologies

SUMO

SUMO
The Suggested Upper Merged Ontology (SUMO) and its domain ontologies form the largest formal public ontology in existence today. They are being used for research and applications in search, linguistics and reasoning. SUMO is the only formal ontology that has been mapped to all of the WordNet lexicon. SUMO is written in the SUO-KIF language. SUMO is free and owned by the IEEE.

KIF
Knowledge Interchange Format (KIF) is a computer-oriented language for the interchange of knowledge among disparate programs. It has declarative semantics (i.e. the meaning of expressions in the representation can be understood without appeal to an interpreter for manipulating those expressions); it is logically comprehensive (i.e. it provides for the expression of arbitrary sentences in the first-order predicate calculus); it provides for the representation of knowledge about the representation of knowledge; it provides for the representation of nonmonotonic reasoning rules; and it provides for the definition of objects, functions, and relations.


Other resources on ontologies

Sites Relevant to Ontologies and Knowledge Sharing
A list of resources on Ontologies and Knowledge Sharing.

John Bateman's ontology portal
This page is a collection of starting points for information on ontologies gathered together for ease of reference for our own ontology-related projects.

Fine-Grained Proper Noun Ontologies for Question Answering
The WordNet lexical ontology, which is primarily composed of common nouns, has been widely used in retrieval tasks. Here, we explore the notion of a finegrained proper noun ontology and argue for the utility of such an ontology in retrieval tasks. To support this claim, we build a fine-grained proper noun ontology from unrestricted news text and use this ontology to improve performance on a question answering task.

An introduction to Ontology by John F. Sowa
Ontology is the study of existence. An ontology is a system of categories for classifying and talking about the things that are assumed to exist. This directory contains a summary of the ontology developed and used in the KR book by John Sowa.

KBS / Ontology Projects Worldwide
Some ongoing KBS/Ontology projects and groups.

OWL Web Ontology Language Reference
The Web Ontology Language OWL is a semantic markup language for publishing and sharing ontologies on the World Wide Web. OWL is developed as a vocabulary extension of RDF (the Resource Description Framework) and is derived from the DAML+OIL Web Ontology Language.


Gellish

Gellish - A Generic Extensible Ontological Language
A Generic Extensible Ontological Language

Design and Application of a Universal Data Structure
Thesis describing Gellish. The problem statement of this research is the question whether it is possible to provide a formal generic artificial language for an unambiguous description of reality, that is based on natural language, is defined in a formal ontology, and is practically applicable, at least for technical artifacts such that it is suitable to express and exchange information in the form of electronic data in a structure that is system and natural language independent.


Named Entity Hierarchies

Sekine's Extended Named Entity Hierarchy
The Extended Named Entity Hierarchy is designed and developed to meet increasing needs for wider range of NE types. It originates from the first Named Entity set defined by MUC (Grishman et al., 1996), the Named Entity set developed by IREX (Sekine et al., 2000), and the Extended Named Entity hierarchy which contains approximately 150 NE types (Sekine et al., 2002). But now it extened again t 200 NE types. The applications include Questions and Answering (Q&A) system that analyzes general texts such as newspaper articles, as well as Information Extraction (IE), Machine Translation (MT), Summarization and Information Retrieval (IR) systems that meet variety of NLP applications. We designe the Extended Named Entity Hierarchy, so that Q&A system or IE system assuming that information one wants know is basically in a form of noun phrase with specific names, time expression or numerical values.

NameNet: a Self-Improving Resource for Name Classification
This paper presents a semantically structured resource of more than 1,600 Name Classes. This structure is based on the noun hyperonymy hierarchies in WordNet, expanded and validated by corpus evidence collected from the World Wide Web. The set of seed examples provided by WordNet is boostrapped and the used to automatically construct an annotated training corpus for each Name Class. The resulting Named Entity resource enables a supervised Named Entity Recognizer to identify all the encoded Name Classes with high accuracy and without any human intervention.



POS tagger

Eric Brill's Tagger

Eric Brill's trainable rule-based part of speech tagger
The NLP programs that you can download: a supervised part of speech tagger, an unsupervised part of speech tagger, and a prepositional phrase attachment program. This tagger is based on transformation-based error-driven learning, a technique that has been effective in a number of natural language applications, including part of speech and word sense tagging, prepositional phrase attachment, and syntactic parsing.


TreeTagger - a language independent part-of-speech tagger

TreeTagger
The TreeTagger is a tool for annotating text with part-of-speech and lemma information which has been developed within the TC project at the Institute for Computational Linguistics of the University of Stuttgart. The TreeTagger has been successfully used to tag German, English, French, Italian, Greek and old French texts and is easily adaptable to other languages if a lexicon and a manually tagged training corpus are available.


SVMTagger

SVMTool
The SVMTool is a simple and effective generator of sequential taggers based on Support Vector Machines. We have appied the SVMTool to the problem of part-of-speech tagging. By means of a rigorous experimental evaluation, we conclude that the proposed SVM-based tagger is robust and flexible for feature modelling (including lexicalization), trains efficiently with almost no parameters to tune, and is able to tag thousands of words per second, which makes it really practical for real NLP applications. Regarding accuracy, the SVM-based tagger significantly outperforms the TnT tagger exactly under the same conditions, and achieves a very competitive accuracy of 97.2% for English on the Wall Street Journal corpus, which is comparable to the best taggers reported up to date.


Stanford Tagger

Stanford Log-linear POS Tagger download
This is a Java implementation of the log-linear part-of-speech (POS) taggers.


SS Tagger

SS Tagger - a part-of-speech tagger for English
Tagging speed is crucial in large-scale information extraction and real-time NLP applications. This part-of-speech (POS) tagger offers fast tagging (2400 tokens/sec) with a state-of-the-art accuracy (97.10% on the WSJ corpus). The tagger uses an extension of Maximum Entropy Markov Models (MEMM), in which tags are determined in the easiest-first mannar.



Chunkers

CASS chunker

CASS chunker
Cass. A fast, robust partial parser developed by Steven Paul Abney. CASS is a partial parser designed for use with large amounts of noisy text.


Ramshaw and Marcaus BaseNP chunker

Noun Phrase Chunker
This application is a Java implementation of the Ramshaw and Marcaus BaseNP chunker (in fact the files in the resources directory are taken straight from their original distribution) which attempts to insert brackets marking noun phrases in text which have been marked with POS tags in the same format as the output of Eric Brill's transformational tagger. The output from this version should be identical to the output of the original C++/Perl version released by Ramshaw and Marcus. A wrapper is also included which allows the easy use of this chunker within the GATE framework.


YamCha

YamCha: Yet Another Multipurpose CHunk Annotator
YamCha is a generic, customizable, and open source text chunker oriented toward a lot of NLP tasks, such as POS tagging, Named Entity Recognition, base NP chunking, and Text Chunking. YamCha is using a state-of-the-art machine learning algorithm called Support Vector Machines (SVMs), first introduced by Vapnik in 1995.


Papers about chunking

A Divide-and-Conquer Strategy for Parsing
In this paper, we propose a novel strategy which is designed to enhance the accuracy of the parser by simplifying complex sentences before parsing. This approach involves the separate parsing of the constituent sub-sentences within a complex sentence. To achieve that, the divide-and-conquer strategy first disambiguates the roles of the link words in the sentence and segments the sentence based on these roles. The separate parse trees of the segmented sub-sentences and the noun phrases within them are then synthesized to form the final parse. To evaluate the effects of this strategy on parsing, we compare the original performance of a dependency parser with the performance when it is enhanced with the divide-and-conquer strategy. When tested on 600 sentences of the IPSM'95 data sets, the enhanced parser saw a considerable error reduction of 21.2% in its accuracy.



Syntactic level resources

Stanford Parser

Stanford Parser
Statistical parsers use knowledge of language gained from hand-parsed sentences to try to produce the correct analysis of new sentences. These parsers still make some mistakes, but commonly work rather well. Their development was one of the biggest breakthroughs in natural language processing in the last decade. This package is a Java implementation of probabilistic natural language parsers, both highly optimized PCFG and dependency parsers, and a lexicalized PCFG parser.


Link Grammar Parser

Link Grammar Parser
The Link Grammar Parser is a syntactic parser of English, based on link grammar, an original theory of English syntax. Given a sentence, the system assigns to it a syntactic structure, which consists of a set of labeled links connecting pairs of words. The parser also produces a "constituent" representation of a sentence (showing noun phrases, verb phrases, etc.).


Probabilistic Dependency Parser for English

A low-complexity, broad-coverage probabilistic Dependency Parser for English
Large-scale parsing is still a complex and timeconsuming process, often so much that it is infeasible in real-world applications. The parsing system described here addresses this problem by combining finite-state approaches, statistical parsing techniques and engineering knowledge, thus keeping parsing complexity as low as possible at the cost of a slight decrease in performance. The parser is robust and fast and at the same time based on strong linguistic foundations.


MINIPAR

MINIPAR
MINIPAR is a broad-coverage parser for the English language. An evaluation with the SUSANNE corpus shows that MINIPAR achieves about 88% precision and 80% recall with respect to dependency relationships. MINIPAR is very efficient, on a Pentium II 300 with 128MB memory, it parses about 300 words per second.


HPSG (Head-Driven Phrase Structure Grammar)

HPSG
This page provides information about Head-Driven Phrase Structure Grammar (HPSG) related activities at the Center for the Study of Language and Information (CSLI) at Stanford University, and pointers to other resources on the web.


RRG (Role and Reference Grammar)

RRG
Role and Reference Grammar [RRG] (Van Valin 1993) takes language to be a system of communicative social action, and accordingly, analyzing the communicative functions of grammatical structures plays a vital role in grammatical description and theory fr om this perspective.


AGFL

AGFL
The goal of the AGFL-project (Affix Grammars over a Finite Lattice) is the development of a technology for Natural Language Processing available in the public domain.
The AGFL formalism for the syntactic description of Natural Languages has been developed by the Computer Science Department of the Radboud University of Nijmegen. It is a formalism in which large context free grammars can be described in a compact way. AGFLs belong to the family of two level grammars, along with attribute grammars: a first, context-free level is augmented with set-valued features for expressing agreement between parts of speech.
The AGFL parser generation system for Natural Languages generates efficient parsers from AGFL grammars. It includes a lexicon system suitable for the large lexica needed in real-life NLP applications.


XTAG Project

XTAG
XTAG is an on-going project to develop a wide-coverage grammar for English using a lexicalized Tree Adjoining Grammar (TAG) formalism. XTAG also serves as an system for the development of TAGs and consists of a parser, an X-windows grammar development interface and a morphological analyzer.


Other resources on syntactic level

GRAC: GRAmmar Checker
GRAC is a GRAmmar Checker written in Python. GRAC is based on learning algorithms, it needs tagged text corpus and mistake-free corpus to learn grammar rules and then detects grammar errors on a sentence or text you give.

Terminologie : des Théories Linguistiques à l'Extraction Automatique
Comment améliorer l'extraction automatique de la terminologie d'un corpus de textes ? Quels peuvent être les apports des théories linguistiques dans ce domaine ?



Semantic level resources

eXtended WordNet

eXtended WordNet
The goal of this project is to develop a tool that takes as input the current or future versions of WordNet and automatically generates an eXtended WordNet that provides several important enhancements intended to remedy the present limitations of WordNet.


Conceptual graphs

An introduction to Conceptual graphs by John Sowa
Conceptual graphs (CGs) are a system of logic based on the existential graphs of Charles Sanders Peirce and the semantic networks of artificial intelligence. They express meaning in a form that is logically precise, humanly readable, and computationally tractable. With a direct mapping to language, conceptual graphs serve as an intermediate language for translating computer-oriented formalisms to and from natural languages. With their graphic representation, they serve as a readable, but formal design and specification language. CGs have been implemented in a variety of projects for information retrieval, database design, expert systems, and natural language processing.


Deep understanding

Story understanding resources
A list of resources on story understanding

Story understanding through multi-representation model construction
We present an implemented model of story understanding and apply it to the understanding of a children’s story. We argue that understanding a story consists of building multirepresentation models of the story and that story models are efficiently constructed using a satisfiability solver. We present a computer program that contains multiple representations of commonsense knowledge, takes a narrative as input, transforms the narrative and representations of commonsense knowledge into a satisfiability problem, runs a satisfiability solver, and produces models of the story as output. The narrative, models, and representations are expressed in the language of Shanahan’s event calculus.

Understanding script-based stories using commonsense reasoning
This paper investigates the use of commonsense reasoning to understand texts involving stereotypical activities or scripts. We present a system that understands news stories involving four terrorism scripts. The system (1) builds a commonsense reasoning problem given an information extraction template representing a terrorist incident, and (2) uses commonsense reasoning and a commonsense knowledge base to build a model of the terrorist incident. The reasoning problem, commonsense knowledge base, and model are expressed in the classical logic event calculus. The system was developed using the MUC3 and MUC4 development data set. We present the results of running the system on the MUC3 and MUC4 test data sets, using manually generated answer key templates and templates generated automatically by two MUC4 information extraction systems. We present a detailed analysis of the models produced by the system given automatically generated templates. We present methods for answering questions based on the models produced by our system. We assess the portability of the system by extending it to handle 10 scripts frequent in Project Gutenberg American literature texts.

Prospects for in-depth story understanding by computer (Erik T. Mueller - November 29, 1999)
While much research on the hard problem of in-depth story understanding by computer was performed starting in the 1970s, interest shifted in the 1990s to information extraction and word sense disambiguation. Now that a degree of success has been achieved on these easier problems, I propose it is time to return to in-depth story understanding. In this paper I examine the shift away from story understanding, discuss some of the major problems in building a story understanding system, present some possible solutions involving a set of interacting understanding agents, and provide pointers to useful tools and resources for building story understanding systems.

The Plots of Children and Machines: The Statistical and Symbolic Semantic Analysis of Narratives
This thesis presents a method of automatic plot analysis of narrative texts that uses both components of traditional symbolic analysis of natural language and statistical machine-learning. In particular, we are investigating the story rewriting task. In the story rewriting task, an exemplar story is read to the pupils and the pupils rewrite the story in StoryStation, which allows them to concentrate more on diction and grammar than on content creation. However, often in the process of content creation the pupil improperly recalls the story. Our method of automatic plot analysis should allow the tutoring system to automatically analyze the plot of the story and provide relevant feedback to both the pupil and teacher.
(Harry Reeves Halpin, Master of Science - School of Informatics - University of Edinburgh, 2003)


Natural Semantic Metalanguage (NSM)

The Natural Semantic Metalanguage homepage
This site contains information and resources about the 'natural semantic metalanguage' (NSM) approach to semantic analysis, which can lay claim to being the most well-developed, comprehensive and practical approach to cross-cultural semantics on the contemporary scene.
The approach is based on evidence that there is a small core of basic, universal meanings, known as semantic primes, which can be found as words or other linguistic expressions in all languages. This common core of meaning can be used as a tool for linguistic and cultural analysis: to explicate complex and culture-specific words and grammatical constructions, and to articulate culture-specific values and attitudes (cultural scripts), in terms which are maximally clear and translatable. The theory also provides a semantic foundation for universal grammar and for linguistic typology. It has applications in intercultural communication, lexicography (dictionary making), language teaching, the study of child language acquisition, legal semantics, and other areas.
The main author is Anna Wierzbicka, who is the originator of the theory, but she has many colleagues and collaborators whose works are also listed here.

Semantics: Primes and Universals (Anna Wierzbicka)
Conceptual primitives and semantic universals are the cornerstones of a semantic theory which Anna Wierzbicka has been developing for many years. Semantics: Primes and Universals is a major synthesis of her work, presenting a full and systematic exposition of that theory in a non-technical and readable way. It delineates a full set of universal concepts, as they have emerged from large-scale investigations across a wide range of languages undertaken by the author and her colleagues. On the basis of empirical cross-linguistic studies it vindicates the old notion of the "psychic unity of mankind", while at the same time offering a framework for the rigorous description of different languages and cultures.

Definition of "Natural semantic metalanguage"
From Wikipedia, the free encyclopedia.



Pragmatic level resources

ConceptNet : A Very-Large Semantic Network of Common Sense Knowledge

ConceptNet
ConceptNet is a freely available commonsense knowledgebase and natural-language-processing toolkit which supports many practical textual-reasoning tasks over real-world documents right out-of-the-box (without additional statistical training).


Open Mind Commonsense

Open Mind Commonsense
Computers today are just plain dumb! The Open Mind Commonsense project is an attempt to make computers smarter by making it easy and fun for people all over the world to work together to give computers the millions of pieces of ordinary knowledge that constitute "common-sense", all those aspects of the world that we all understand so well we take them for granted. This repository of knowledge will enable us to create more intelligent and sociable software, build human-like robots, and better understand the structure our own minds. We hope you will join us by registering below!

Resources
Various software modules and data sets that are/were used in Rada Mihalcea's research at University of North Texas.



Multi-level resources

FrameNet

FrameNet
The Berkeley FrameNet project is a lexicon-building effort in which we
(1) study words;
(2) describe the frames or conceptual structures which underlie these;
(3) examine sentences, using a very large corpus of contemporary English that contains these words;
and (4) record the ways in which information from the associated frames are expressed in these sentences.


VerbNet

VerbNet
Verbnet is a class-based verb lexicon.


PropBank frames

PropBank Frames
Scheme of rolesets for English verbs. A roleset, defined for a verb acting in one of its verb senses, specifies the arguments that a verb accepts, and defines which semantic role each argument is playing with respect to the verb.


MIKROKOSMOS

MIKROKOSMOS
A full NLP project including an ontology an Text Meaning Representations (TMRs).


ThoughtTreasure

ThoughtTreasure
ThoughtTreasure is a commonsense knowledge base and architecture for natural language processing that uses multiple representations including logic, finite automata, grids, and scripts.


Ellogon

Ellogon Language Engineering Platform
NLP is an extremely exciting and useful task, NLP can be really complicated and building NLP systems can be absolutely, outrageously, often unaffordably expensive.
Ellogon is an effort that tries to keep all the excitement and reduce all the complexity... Ellogon is different from other similar software. First of all, it respects the user's time by offering a simple and user friendly graphical interface. But beneath this simple appearance a powerful engine is hidden, that has been proved to be able to support a wide range of uses, from simple research prototypes to commercial applications.
Ellogon is licensed under the GNU LGPL license, is easy to install and administer and is reliable. Running under all major operating systems, Ellogon offers a comfortable environment for computational linguists, language engineers or plain users.


GATE - General Architecture for Text Engineering

GATE
GATE is one of the most widely used human language processing systems in the world. It is a tool for:
scientists performing experiments that involve processing human language;
companies developing applications with language processing components;
teachers and students of courses about language and language computation.
GATE comprises an architecture, framework (or SDK) and graphical development environment, and has been built over the past eight years in the Sheffield NLP group. The system has been used for many language processing projects; in particular for Information Extraction in many languages. The system supports the full lifecycle of language processing components, from corpus collection and annotation through system evaluation. GATE is funded by the EPSRC and the EU.


McCullough Knowledge Explorer

McCullough Knowledge Explorer and the MKR language
MKR is a user-friendly RI ("Real Intelligence") language which combines the best features of English, UNIX shell, Unicon and CycL. MKR propositions have a terse English-like format which helps a human user focus on essential characteristics and avoid floating abstractions. MKR is a very-high-level knowledge representation language with a rigorous epistemological foundation including context, genus-differentia definitions, ECP hierarchies (knits) and a unique characterization of the changes associated with actions.


OpenCyc

OpenCYC
OpenCyc is the open source version of the Cyc technology, the world's largest and most complete general knowledge base and commonsense reasoning engine. Cycorp, the builders of Cyc, have set up an independent organization, OpenCyc.org, to disseminate and administer OpenCyc, and have committed to a pipeline through which all current and future Cyc technology will flow into ResearchCyc (available for R&D in academia and industry) and then OpenCyc.
NB: the version 0.9 that can be downloaded on opencyc.org using BitTorrent had a bug that kept OpenCyc from running on about half of the Windows machines that tried it. This bug is fixed in version 0.9.5 that can be downloaded on https://sourceforge.net/projects/opencyc.

Mapping of EuroWordnet Top Ontology to Upper Cyc Ontology
A mapping of EuroWordnet Top Ontology into Upper Cyc Ontology is presented. The mapping is expressed in terms of a CycL microtheory encoding of the EuroWordnet Top Ontology, because it is impossible to be made just by means of equivalence and subsumption relations.


KIM (Knowledge and Information Management) Platform

Ontotext KIM
The KIM Platform provides a novel Knowledge and Information Management (KIM) infrastructure and services for automatic semantic annotation, indexing, and retrieval of unstructured and semi-structured content. The most direct applications of KIM are:
Generation of meta-data for the Semantic Web, which allows hyper-linking and advanced visualization and navigation;
Knowledge Management, enhancing the efficiency of the existing indexing, retrieval, classification and filtering applications. Ontotext is a Sirma laboratory for R&D related to knowledge representation, linguistics, and web services. We provide core technology with applications in Knowledge Management, Semantic Web, and integration. Read more about us and about our Products, Mission, Skills, and Focus. Ontotext is proven to be knowledgeable, reliable, and cost-effective in:
development of tools and solutions: knowledge management; language engineering; semantic web services; custom reasoning services;
ontology design, evaluation, and mapping: domain analysis and modelling; application-specific ontologies.
Our most popular product is the KIM platform for semantic annotation, indexing and retrieval.


Disambiguation

SensEval
There are now many computer programs for automatically determining the sense of a word in context (Word Sense Disambiguation or WSD). The purpose of Senseval is to evaluate the strengths and weaknesses of such programs with respect to different words, different varieties of language, and different languages.

Word Sense Disambiguation in a Slot Grammar Framework
This is a preliminary report on a system for word sense disambiguation (WSD) for unrestricted vocabulary, which requires no training on tagged text. Disambiguation is done to WordNet word senses. The “disambiguating power” of the system comes from three sources: (A) Parsing by English Slot Grammar (ESG), (B) the WordNet relation system, and (C) the WordNet sense frequency data.


MontyLingua

MontyLingua V.2.1 (Python and Java)
MontyLingua is a free, commonsense-enriched, end-to-end natural language understander for English. Feed raw English text into MontyLingua, and the output will be a semantic interpretation of that text. Perfect for information retrieval and extraction, request processing, and question answering. From English sentences, it extracts subject/verb/object tuples, extracts adjectives, noun phrases and verb phrases, and extracts people's names, places, events, dates and times, and other semantic information. MontyLingua makes traditionally difficult language processing tasks trivial!


NooJ

NooJ
NooJ is a linguistic development environment that includes large-coverage dictionaries and grammars, and parses corpora in real time. NooJ includes tools to create and maintain large-coverage lexical resources, as well as morphological and syntactic grammars. Dictionaries and grammars are applied to texts in order to locate morphological, lexical and syntactic patterns and tag simple and compound words. NooJ can build complex concordances, with respect to all types of Finite State and Context-Free patterns. NooJ users can easily develop extractors to identify semantic units in large texts, such as names of persons, locations, dates, technical expressions of finance, etc.


Thesis & Papers

A syntax / semantic interface using broad-coverage resources in English
In Natural Language Processing, we must first compute a semantic representation of a text prior to “understanding” it. We describe here how to pass from a syntactic structure (generated by a syntactic parser of English) to a semantic form (in the form of predicates and relations between these predicates).
Our approach is based on the interoperability between several resources, covering syntactical (Link Grammar Parser), lexical (WordNet) and semantic (VerbNet) aspects of English. The joint use of these broad-coverage resources leads to encouraging results on lexical and syntactical disambiguation. That also makes it possible to assign a “semantic probability” to each interpretation of a sentence.
(MSc dissertation of François-Régis Chaumartin)

A Practical Semantic Representation For Natural Language Parsing
This thesis deals with the problem of building fast, accurate and portable parsers for natural language understanding. Our focus is a multi-domain dialogue system in which we need a deep linguistically-motivated parser to produce the representations of the input suitable for reasoning. In this dissertation, we are concerned with building parsers which have the wide coverage and portability o ered by a general syntactic grammar without sacri cing parsing speed and accuracy.

Synchronisation des connaissances syntaxiques et sémantiques pour l'analyse d'énoncés en langage naturel à l'aide des grammaires d'arbres adjoints lexicalisées - Djamé Seddah
A interface between syntax and semantic aims to propose a logical formalization of the relations between the parts of a sentence. This thesis is a proposal based upon the analysis of problematic linguistic phenomena in the Lexicalized Tree Adjunct Grammars (LTAG) framework. LTAG is a linguistic formalism which provides two structures of representation, derived tree and derivation tree. The last one is an almost perfect structure to be used as a canvas for semantic analysis. However, the derivation tree cannot represent coindexations in an autonomous way. We based our proposition upon the study of linguistic phenomena induced by control verbs. In order to allow their treatment and their complete formalization, we modify the initial LTAG formalism by introducing a new lexical information: the control canvas. Its purpose is to integrate inference of missing argumental links into a synchronous course of derived trees and derivations trees via a shared forest. We propose a dynamic reconstruction algorithm based on inference rules. These rules are executed during the derivation tree extraction process from the shared forest. As we use tabular techniques, we can extract, into a dependency graph, all the argumental relations described by one shared forest.



Knowledge management

KIF (Knowledge Interchange Format)

KIF
Knowledge Interchange Format (KIF) is a computer-oriented language for the interchange of knowledge among disparate programs. It has declarative semantics (i.e. the meaning of expressions in the representation can be understood without appeal to an interpreter for manipulating those expressions); it is logically comprehensive (i.e. it provides for the expression of arbitrary sentences in the first-order predicate calculus); it provides for the representation of knowledge about the representation of knowledge; it provides for the representation of nonmonotonic reasoning rules; and it provides for the definition of objects, functions, and relations.

SUMO
The Suggested Upper Merged Ontology (SUMO) and its domain ontologies form the largest formal public ontology in existence today. They are being used for research and applications in search, linguistics and reasoning. SUMO is the only formal ontology that has been mapped to all of the WordNet lexicon. SUMO is written in the SUO-KIF language. SUMO is free and owned by the IEEE.


GKB / GFP

GKB-Editor (Generic Knowledge Base Editor)
The GKB-Editor (Generic Knowledge Base Editor) is a tool for graphically browsing and editing knowledge bases across multiple Frame Representation Systems (FRSs) in a uniform manner. It offers an intuitive user interface, in which objects and data items are represented as nodes in a graph, with the relationships between them forming the edges. Users edit a KB through direct pictorial manipulation, using a mouse or pen. A sophisticated incremental browsing facility allows the user to selectively display only that region of a KB that is currently of interest, even as that region changes.

Generic Frame Protocol (GFP)
The Generic Frame Protocol (GFP), jointly developed at SRI International and Stanford University, provides a set of functions that support a generic interface to underlying frame representation systems (FRSs). The interface layer allows an application some independence from the idiosyncrasies of specific FRS software and enables the development of generic tools that operate on many FRSs.


Other resources on Knowledge Management

Knowledge Representation - Lecture Listing
Représentation des connaissances - Cours de Chris Thornton

Sites Relevant to Ontologies and Knowledge Sharing
A list of resources on Ontologies and Knowledge Sharing.



Question Answering / Information Extraction

Webclopedia

Webclopedia
Webclopedia - Targeted Delivery of Multilingual Information
Webclopedia is a multilingual natural-language processing research project that aims to answer questions based on texts available on the Web or other text collections.


AnswerFinder

AnswerFinder
This is a general purpose open-domain question answering system (written in Java) that draws it's answers from the Internet.


Search engine API

Google Web APIs
With the Google Web APIs service, software developers can query billions of web pages directly from their own computer programs. Google uses the SOAP and WSDL standards so a developer can program in his or her favorite environment - such as Java, Perl, or Visual Studio .NET.

Yahoo! Search Web Services
Yahoo! Search Web Services allow you to access Yahoo content and services in your favorite programming languages. This means you can now build Yahoo directly into your own applications.

MSN Web Search SDK
The MSN Search SDK provides documentation that describes the core concepts, requirements, development guidelines, and class library for the MSN Search Web Service. The SDK also contains sample code that demonstrates application development techniques using the MSN Search Web Service.


Advanced Web search engine

AnswerBus
AnswerBus is an open-domain question answering system (QA) based on intelligent information retrieval. It accepts your questions in natural languages and extracts answers from the Web. Currently, You can use English, German, French, Italian, Spanish and Portuguese as your languages.

KartOO
KartOO est un méta moteur de recherche qui présente ses résultats sous forme de carte. Les sites trouvés sont représentés par des pages plus ou moins grosses en fonction de leur pertinence. Entre ces sites figurent des thèmes qu'il suffit de cliquer pour préciser votre recherche.

WebClust
WebClust is a meta search engine based on a technology called "Document clustering": the automatic organization of documents into meaningful groups. WebClust queries one or more web search engines, parses their result pages to extract the documents (titles, URLs, and short descriptions) and groups the documents based on this information. This process presents the best results of the web in a "horizontal" topical arrangement in addition to a single vertical list.