From: www.itworld.com
March 29, 2001 —
Ontology: A formal, explicit specification of how to represent the objects, concepts, and other entities in a particular system, as well as the relationships between them
Natural-language processing (NLP) is an area of artificial intelligence research that attempts to reproduce the human interpretation of language. NLP methodologies and techniques assume that the patterns in grammar and the conceptual relationships between words in language can be articulated scientifically. The ultimate goal of NLP is to determine a system of symbols, relations, and conceptual information that can be used by computer logic to implement artificial language interpretation.
Natural-language processing has its roots in semiotics, the study of signs. Semiotics was developed by Charles Sanders Peirce (a logician and philosopher) and Ferdinand de Saussure (a linguist). Semiotics is broken up into three branches: syntax, semantics, and pragmatics.
A complete natural-language processor extracts meaning from language on at least seven levels. However, we'll focus on the four main levels.
Morphological: A morpheme is the smallest part of a word that can carry a discrete meaning. Morphological analysis works with words at this level. Typically, a natural-language processor knows how to understand multiple forms of a word: its plural and singular, for example.
Syntactic: At this level, natural-language processors focus on structural information and relationships.
Semantic: Natural-language processors derive an absolute (dictionary definition) meaning from context.
Pragmatic: Natural-language processors derive knowledge from external commonsense information.
A practical reality?
The realization of a fully communicating artificial intelligence was long considered a science fiction fantasy. However, with the advent of the World Wide Web, XML, and the World Wide Web Consortium's (W3C) RDF, NLP could become a pervasive reality. With powerful Web crawlers needing to index an exponentially growing collection of resources, it's no surprise that information management and data querying is an area that might benefit immensely from NLP.
So, why hasn't NLP escaped a backdrop of impractical artificial intelligence software implementations? How does XML technology fit into all this?
Natural-language limitations
One of the major limitations of modern NLP is that most linguists approach NLP at the pragmatic level by gathering huge amounts of information into large knowledge bases that describe the world in its entirety. These academic knowledge repositories are defined in ontologies that take on a life of their own and never end up in practical, widespread use. There are various knowledge bases, some commercial and some academic. The largest and most ambitious is the Cyc Project. The Cyc Knowledge Server is a monstrous inference engine and knowledge base. Even natural-language modules that perform specific, limited, linguistic services aren't financially feasible for use by the average developer.
In general, NLP faces the following challenges:
Addressing the limitations
The W3C's Resource Definition Framework (RDF) was developed to enable the automated processing of Web resources by providing a means of defining metadata about those resources. RDF addresses the physical limitation of memory space by allowing a natural-language processor to access resources in a distributed environment. A networked computer processor can access RDF models on various other processors in a standard way.
RDF provides a unifying ontological syntax for defining knowledge bases. RDF is expressed in XML, a markup language designed to cleanly separate data formatting from data semantics. As a result of the extensible nature of XML (authors have only the restriction of being well-formed and valid), a number of categories of information can be expressed very clearly using XML and RDF.
RDF is by no means the perfect ontological syntax. For instance, there are five semantic principles: existence, coreference, relation, conjunction, and negation. RDF doesn't inherently support conjunctions and negations. At its core, RDF allows users to define statements in a simple format about network resources. The statements are of the form:
subject | predicate | object.
For example:
http://www.microsoft.com | author | Microsoft.
RDF can also define RDF schemas that define a hierarchy of resources and the properties/predicates that can be asserted about them. The ability to define semantics for resources is only as effective as the resources themselves are well defined and the RDF schemas are complete.
However, RDF has the advantage of almost inevitable adoption on the World Wide Web. The impact that the semantic Web will have on search engine technology and knowledge management is evident.
RDF separates the process of defining metadata about resources from the resources themselves. This brings the World Wide Web a step closer to being a usable knowledge repository, because precise RDF models can add more semantics to existing resources in a standardized way.
Thought Treasure
Now that we've outlined how RDF can affect NLP and knowledge management in general, let's take a closer look at a practical example. Thought Treasure is a powerful, open source natural-language processor developed by Erik T. Mueller. It is capable of interpreting natural-language queries, responding to queries in natural language, extending its knowledge base, and identifying emotions in conversation. It provides developers with powerful ways to extend its knowledge base and invoke its various services. Thus, I could easily extend it to handle RDF using 4RDF.
Thought Treasure comprises seven modules, but we'll focus on five of them.
Representation agency: The Thought Treasure database format can be described by the following example:
=media-object/information/ ==advertisement// ==art// ==computer-program// ==dance// ==datafeed// ==film// ===film-genre// ====comedy-film// ====documentary-film// ====drama-film// ====fantasy-film// ====horror-film// ====musical-film// ====mystery-film// ==genetic-code// ==opera// ==play// ==text// ===book// ===magazine//
Hierarchy is signified by the indentation level of the leading = characters. In this case, for example, an advertisement is a kind of media-object. The hierarchy consists of concrete concepts, which break down into entities, situations, states, actions, relations, attributes, and enumerations.
Objects can have explicit parents signified by a list of concepts, separated by a / after the indentation. For example, media-objects aren't arranged below a concept of lower indentation, but their parent is identified as the information concept. Concepts at the same hierarchical level are considered equivalent; an opera, for example, is equivalent to a play under the concept of a media-object.
One large concept in the Thought Treasure ontology is a relation. A relation can comprise subclasses, enabling users to add domain-specific relations to their proprietary ontologies. For example:
=media-object-relation/relation/ ==author-of// ==composer-of// ==newscaster-of// ==viewer-of// ==actor-of// ==cinematographer-of// ==director-of// ==language-of// ==MPAA-rating-of// ==producer-of// ==writer-of//
Here, the user defines a media-object-relation concept, which is implicitly declared a relation. The user then defines other media-object-relations that will be used to describe media-objects.
Users can then make assertions about how a concept relates to another. For example:
==comedy-film// ==mystery-film// ===the-big-lebowski/|director-of=MALE:"Joel Coen"| |actor-of=MALE:"John Goodman"| |actor-of=MALE:"Jeff Bridges"| |actor-of=MALE:"Steve Buscemi"| |actor-of=FEMALE:"Julianne Moore"|
Assertions are in the form of | relation=concept |, which are made about a concept (in this case the film The Big Lebowski). That is shorthand notation for the fully expanded form [relation concept1 concept2], which can be interpreted as "concept1 relates to concept2 by the relation relation."
Lexical component: Part of the database format includes the ability to mark words as belonging to specific parts of speech. For instance:
====medium#A-length# film*.z//
This entry specifies that the phrase medium-length film is broken into an adjective (A) and two nouns (the default part of speech). The # signifies that the word is used as is (other grammatical forms are not to be used interchangeably). The * signifies that other forms and conjugations of the word can be used in place of the specified one. For instance, medium-length films and medium-length film are considered equivalent concepts.
Text agency: The text agency is responsible for scanning natural-language text for predefined words and phrases. It then creates parse nodes for every possible alternative grammatical form of the words and phrases in the text. The text agency then tags the nodes with parts of speech and sends the results to the syntactic parser. Other agents can process the input for names, dates, and other nomenclature.
Syntactic component: This component is responsible for creating parse trees out of the parse nodes generated by the text agency. It uses various production rules and parts-of-speech filters to accomplish this.
Semantic component: This all-important component consists of a semantic parser, which is responsible for converting the syntactic parse tree into an assertion in Thought Treasure. It uses various high-level grammatical constructs (adjuncts, copulas, relative clauses, relations, and so on.) to resolve the parse tree information into assertions. It also comprises a model of English and French verb tenses.
A useful example
Thought Treasure is a great starting point for any practical NLP/RDF development, mostly because it's a very robust natural-language processor but also because it's written in ANSI C and can be built into binaries on various platforms, including Linux.
For this example, I'll use a component of Fourthought's 4Suite: 4RDF. We'll run the Python 1.5 code under Linux.
First, take a look at this RDF schema.
Now let's look at a Thought Treasure ontology representing the predicates in the RDF schema:
;----------Predicates---------- =rdf-relation/relation/ ==URI-of//URI.
Unix Insider