OLAC Vocabulary for Language Technology Functionality

Date issued:	2002-12-09
Status of document:	Draft Standard. This is only a preliminary draft that is still under development; it has not yet been presented to the whole community for review.
This version:	http://www.language-archives.org/OLAC/functionality-20021209.html
Latest version:	http://www.language-archives.org/OLAC/functionality.html
Previous version:	http://www.language-archives.org/OLAC/functionality-20021202.html
Abstract:	This document specifies the controlled vocabulary used by OLAC in the description of language technology functionality. The vocabulary describes the functionality in particular of software according to the functional categories provided by the HLT Survey version 2.
Editors:	Baden Hughes mailto:baden@compuling.net
Changes since previous version:	20021209: added some synonyms and definitions; divided elements by section; added references to HLT Survey website for elements

Copyright © Baden Hughes. This material may be distributed only subject to the terms and conditions set forth in the Open Publication License, v1.0 or later (the latest version is presently available at http://www.opencontent.org/openpub/).

Introduction
Information Extraction
Information Retrieval
Authoring Tools
Language Analysis
Language Understanding
Knowledge Representation and Discovery
Spoken Language Input
Written Language Input
Natural Language Generation
Spoken Output Technologies
- spoken_output_technologies/text-to-speech_synthesis
- spoken_output_technologies/spoken_language_generation
Multilinguality
Multimodality
- multimodality/representations of_space_and_time
- multimodality/modality_integration_facial_movement_and_speech
Coding and Compression
Mathematical Methods
Discourse and Dialogue
Language Resources
Evaluation
Conclusion

1. Introduction

This document specifies the controlled vocabulary used by OLAC in the description of language technology functionality. The vocabulary describes the functionality in particular of software according to the functional categories provided by the HLT Survey.

Any single piece of language technology software may have one or more functionality descriptions, these will usually be closely related items.

2. Information Extraction

information_extraction/information_extraction

Name	Information Extraction
Definition	The goal of informtion extraction (IE) is to build systems that find and link relevant information from natural language text ignoring irrelevant information. The information of interest is typically pre-specified in form of uninstantiated frame-like structures also called templates. The templates are domain and task specific. The major task of an IE-system is then the identification of the relevant parts of the text which are used to fill a template's slots.
Comments

information_extraction/relation_extraction

Name	Relation Extraction
Definition	Automated or human-assisted acquisition of relations between concepts from textual or other data, usu. within a selected domain.
Comments

information_extraction/text_data_mining

Name	Text Data Mining (TM)
Definition	Text data mining concerns the application of data mining (knowledge discovery in databases, KDD) to unstructured textual data. The goal of data mining is to discover or derive new information from data, finding patterns across datasets, and/or separating signal from noise. Core text mining algorithms decompose text in meaningful chunks that can then be used for true data mining purposes.
Comments

information_extraction/summarization

Name	Summarization
Definition	Text summarization is the process of distilling the most important information from a source (or sources) to produce an abridged version for a particular user (or users) and task (or tasks).
Comments

information_extraction/answer_extraction

Name	Answer Extraction (Textual Question Answering)
Definition	Answer extraction (AE) aims at retrieving those exact passages of a document that directly answer a given user question. AE is more ambitious than information retrieval and information extraction in that the retrieval results are phrases, not entire documents, and in that the queries may be arbitrarily specific. It is less ambitious than full-fledged question answering in that the answers are not generated from a knowledge base but looked up in the text of documents.
Comments

information_extraction/named_entity_recognition

Name	Named Entity Recognition (NERC)
Definition	Named entity (NE) recognition is a form of information extraction in which the major task is to identify and classify from NL text every word or sequence of words as being a person-name, organizaton, location, date, time, monetary value, percentage expression. NE recognition has a high impact for a number of applications, like e.g., InterNet search enginges, text data mining or answer extraction.
Comments

information_extraction/multimedia_information_extraction

Name	Multimedia Information Extraction
Definition	Multimedia Multimedia Information Extraction is extraction of useful information from multimedia document collections.
Comments

3. Information Retrieval

information_retrieval/information_retrieval

Name	Information Retrieval (IR)
Definition	Information Retrieval is the process of locating information that fits a user's requirements, where the requirements are usually expressed as a search query. The fit of the retrieved information with the information need is referred to as "relevance". The information can be retrieved from databases (data retrieval) or from document collections (document retrieval), where documents can either be text documents or other media (audio, video, semi-structured data, multimedia). Success in information retrieval is generally defined by retrieving as much relevant information as possible (measured by "recall")while minimising the irrelevant information retrieved (measured by "precision"). The most widely used information retrieval systems today are Internet search engines.
Comments

information_retrieval/topic_detection

Name	Topic Detection (TD)
Definition	Detection of the topic of a document or of a segment in a stream of natural language data.
Comments

information_retrieval/multilingual_information_retrieval

Name	Multilingual Information Retrieval (CLIR)
Definition	Cross-language information retrieval means using queries in one language to search for documents in a different language. Multilingual information retrieval is a broader term, which includes the case where queries in different languages are used, but only for searching documents in the same language.
Comments	Synonyms: cross-language information retrieval, translingual information retrieval, sprachubergreifendes Information Retrieval

information_retrieval/categorization

Name	Categorization
Definition	The categorization task is to assign a new data type (e.g. a document) to one, or more, of a pre-existing set of classes (e.g. document classes). By contrast, the task of clustering (e.g. document clustering) is to create, or discover, a reasonable set of clusters for a given set of data types (e.g. documents).
Comments	Synonyms: Classifikation, Kategorisierung, Klassifizierung

information_retrieval/relevance_ranking

Name	Relevance Ranking
Definition	Queries given to search engines or other retrieval systems are often not very specific, and lead to a large number of matching documents. In these cases the retrieval system should have a good estimate of the relevance of the documents to the user's needs, so that "good" documents show up early in the enumeration. A large number of factors should enter into a good ranking method, including the positions of the query terms in the document, linguistic context of the matches, link popularity, classification of the documents, user models etc.
Comments

information_retrieval/speech_retrieval

Name	Speech Retrieval
Definition	Speech Retrieval is the process of retrieving spoken audio material (documents)in response to a search query. Search queries can be spoken or textual. Speech retrieval makes use of techniques from speech recognition, natural language understanding and information retrieval. Possible applications are the indexing of archives of broadcast material, and monitoring of telephone conversations.
Comments	Synonyms: Spoken Document Retrieval, Audio Retrieval, Audio Mining, Speech Mining

information_retrieval/clustering

Name	Clustering
Definition	Clustering algorithms partition a set of objects into groups or clusters. The task of clustering (e.g. document clustering) is to create, or discover, a reasonable set of clusters for a given set of data types (e.g. documents). By contrast, the categorization task is to assign a new data type (e.g. a document) to one, or more, of a pre-existing set of classes (e.g. document classes).
Comments	Synonyms: grouping, category induction, Clustering-Verfahren, Klassifikationsverfahren

information_retrieval/presentation_and_visualisation

Name	Presentation and Visualisation
Definition	TBD
Comments	Synonyms: Visualisierung

information_retrieval/multimedia_retrieval

Name	Multimedia Retrieval
Definition	Multimedia Retrieval is a variant of information retrieval on multimedia document collections.
Comments

4. Authoring Tools

authoring_tools/spell_checking

Name	Spell Checking
Definition	Techniques for the identification of spelling or typing errors in textual documents, which may be applied interactively during the creation of the document, or off-line for existing documents. Spelling correction is an extension in which for each assumed error one or several hypothetical corrections are suggested.
Comments	Synonyms: Spelling Correction, Rechtschreibkorrektur

authoring_tools/automatic_hyperlinking

Name	Automatic Hyperlinking
Definition	TBD
Comments

authoring_tools/language_checking

Name	Language Checking (LC)
Definition	Language Checking comprises technologies used to detect and/or correct erroneous or inconsistent language use in documents. The scope of language checking technology ranges from general error correction, as performed by spell checkers and grammar checkers, to the implementation of corporate styles and terminology control (controlled language). Benefits of controlled languages are the enhancement of consistency within and across documents and the reduction of ambiguity and vagueness, yielding documents which are easier to process by both humans and machines.
Comments	Synonyms: controlled language checking, grammar checking

authoring_tools/structure-based_authoring_assistants

Name	Structure-based Authoring Assistants (CL)
Definition	Structure-based authoring assistants: Software for supporting the distributed creation of consistent, high-quality information on an industrial scale. Key components include terminology extraction for legacy information,terminology checking and hyperlinking integrated in standard authoring environments, as well as structural (syntactic) checking of texts to ensure readability, consistency and translatability.
Comments	Synonyms: controlled language tools

5. Language Analysis

language_analysis/tokenization_and_segmentation

Name	Tokenization and Segmentation
Definition	Tokenization is commonly seen as an independent process of linguistic analysis, in which the input stream of characters is segmented into an ordered sequence of word-like units, usually called tokens, which function as input items for subsequent steps of linguistic processing. Tokens may correspond to words, numbers, punctuation marks or even proper names.The recognized tokens are usually classified according to their syntax. Since the notion of tokenization seems to have different meanings to different people, some tokenization tools fulfil additional tasks like for instance isolation of sentences, handling of end-line hyphenations or conjoined clitics and contractions.
Comments	Synonyms: Word boundary detection, Tokenisierung

language_analysis/shallow_parsing

Name	Shallow Parsing
Definition	TBD
Comments	Synonyms: Chunk Parsing, Partial Parsing, (NP) Chunking

language_analysis/grammar_models_and_formalisms

Name	Grammar Models and Formalisms
Definition	TBD
Comments

language_analysis/head-driven_phrase_structure_grammar

Name	Head-driven Phrase Structure Grammar (HPSG)
Definition	HPSG is a constraint-based, lexicalist approach to grammatical theory that seeks to model human languages as systems of constraints on typed feature structures. Lexical information is organized in terms of multiple inheritance hierarchies that allow complex properties of words to be derived from the logic of the lexicon. Phrasal types are also treated in terms of multiple inheritance hierarchies that allow generalizations about diverse construction types to be factored into various cross-cutting dimensions. See also the corresponding pages of the HLT-Survey Section: http://www.lt-world.org/HLT_Survey/ltw-chapter3-3.pdf
Comments

language_analysis/government_and_binding_theory_minimalist_framework

Name	Government and Binding Theory / Minimalist Framework (GB Theory / Minimalism)
Definition	Minimalism is the latest development of Transformational Generative Grammar, which was initiated by Chomsky in the 1950s, and further developed into the Principles and Parameters (or Government and Binding) Theory of Syntax in the 1980s. The fundamental idea of Transformational Generative Syntax is that a sentence is produced from an abstract structural representation, which is sequentially altered by structure-dependent derivations, following universal principles and language-specific parameter settings. The Minimalist Program maintains that derivations and representations be minimal, according to principles of economy.
Comments	Synonyms: Principles and Parameters Theory of Syntax, Generative Syntax, Minimalismus

language_analysis/lexicons_for_constraint-based_grammars

Name	Lexicons for Constraint-Based Grammars (CbG-Lex)
Definition	lexicons which provide rich information about morphological syntactic and semantic properties of words and are developed in unification- and constraint- based grammar formalisms which encode lexical descriptions as feature structures which have a clear mathematical and computational interpretation and constitute ideal data structures for complex word knowledge information encoding. See also the related HLT-Survey Section: http://www.lt-world.org/HLT_Survey/ltw-chapter3-4.pdf
Comments	Synonyms: lexicons in unification-based grammar formalisms, lexicons in lexicalist theories

language_analysis/part-of-speech_tagging

Name	Part-of-speech Tagging (POS Tagging)
Definition	The technologies for or the process of determining the correct part-of-speech tag for a word given its local context. The task comprises disambiguation of multiple part-of-speech tags and guessing of the correct part-of-speech tag for unknown words. Part-of-speech tagging is frequently used as a preprocessing step for shallow and deep parsers. See also the related HLT-Survey Section: http://www.lt-world.org/HLT_Survey/ltw-chapter3-2.pdf
Comments	Synonyms: Wortartenzuweisung, Etiqueteurs de Parties du Discours

language_analysis/probabilistic-context-free_grammars

Name	Probabilistic Context-free Grammars (PCFG)
Definition	A context-free grammar augmented with non-negative weights for all grammar rules resulting in a probability distribution for both the syntax trees and the language of the grammar. Probabilistic context-free grammars may be used (i) to disambiguate the analyses of a given sentence, and (ii) in language modeling. See also the related HLT-Survey Section on Robust Parsing: http://www.lt-world.org/HLT_Survey/ltw-chapter3-7.pdf
Comments	Synonyms: stochastic context-free grammars, probabilistische kontextfreie Grammatiken, stochstische kontextfreie Grammatiken

language_analysis/categorial_grammar

Name	Categorial Grammar (CG)
Definition	Categorial Grammar is a lexical approach in which expressions are assigned categories that specify how to combine with expressions to create larger expressions. An analysis of an expression proceeds by inference over the categories assigned to its individuatable parts, trying to assign a given goal-category to the expression. In the type-logical variant of categorial grammar, a semantic representation is built compositionally in parallel to the categorial inference. See also the corresponding pages in the HLT-Survey Section: http://www.lt-world.org/HLT_Survey/ltw-chapter3-6.pdf
Comments	Synonyms: type-logical grammar, multimodal logical grammar, categorial type logic

language_analysis/lexical_functional_grammar

Name	Lexical-Functional Grammar (LFG)
Definition	Lexical-Functional Grammar is a lexicalist, nontransformational theory of grammar which is built on a powerful and mathematically well-defined grammar formalism, designed for typologically diverse, configurational and non-configurational languages. LFG models different levels of linguistic description in a functional correspondence architecture. C-structure encodes constituency and surface order, which radically differ across typologically distinct languages. F-structure encodes functional syntactic information, which is largely shared between typologically distinct languages. LFG grammars are declarative, and therefore reversible for generation.
Comments	Synonyms: Lexikalisch-funktionale Grammatik

language_analysis/systemic_functional_linguistics

Name	Systemic Functional Linguistics (SFL)
Definition	Systemic-Functional Linguistics (SFL) is a theory of language centred around the notion of language function. While SFL accounts for the syntactic structure of language, it places the, function of language as central (what language does, and how it does it), in preference to more structural approaches, which place the elements of language and their combinations as, central. SFL starts at social context, and looks at how language both acts upon, and is constrained by, this social context., A central notion is 'stratification', such that language is analysed in terms of four strata: Context, Semantics, Lexico-Grammar and Phonology-Graphology.
Comments	Synonyms: Systemic Functional Grammar, Systemic Grammar, SFG, Systemisch Funktionale Grammatik, Systemisch Funktionale Linguistik, Linguistique Systemique Fonctionelle

language_analysis/morphological_analysis

Name	Morphological Analysis
Definition	The technologies for or the process of tracing the inflectional, derivational, and compounding processes in the formation of a given word in order to determine properties such as stem form, part-of-speech and inflectional information. As a crucial preprocessing step, morphological analysis is used in virtually all fields of natural language processing. See also the corresponding pages in the HLT-Survey Section: http://www.lt-world.org/HLT_Survey/ltw-chapter3-2.pdf
Comments	Synonyms: morphology, morphologische Analyse, analyse morphologique

language_analysis/natural_language_parsing

Name	Natural Language Parsing (NL Parsing)
Definition	Parsing (from Latin "pars orationis" = parts of speech) is the syntactic analysis of languages. Natural Language Parsing is the syntactic analysis of natural languages, such as Finnish or Chinese. The objective of Natural Language is to determine parts of sentences (such as verbs, noun phrases, or relative clauses), and the relationships between then (such as subject or object). Unlike parsing of formally defined artificial languages (such as Java or predicate logic), parsing of natural languages presents problems due to ambiguity, and the productive and creative use of language. See also the corresponding HLT-Survey Section: http://www.lt-world.org/HLT_Survey/ltw-chapter3-6.pdf
Comments	Synonyms: syntactic analysis

language_analysis/dependency_grammar

Name	Dependency Grammar (DG)
Definition	Dependency Grammar" stands for a collection of approaches to natural language grammar sharing the following fundamental characteristics: The distinction between heads and dependents; the immediate modification of a head by a dependent (i.e. without intervening nonterminals like in phrase-structure grammar); and, the naming of the relation between a head and a dependent. Approaches can be differentiated mainly according to whether they consider grammatical or semantic relations (e.g. "subject" vs. "Actor"), and whether the grammar describes tree-structures or graphs. See also the corresponding HLT-Survey Section: http://www.lt-world.org/HLT_Survey/ltw-chapter3-3.pdf
Comments	Synonyms: Dependenzgrammatik, Afhankelijkheidsgrammatika

language_analysis/tree_adjoining_grammar

Name	Tree Adjoining Grammar (TAG)
Definition	Tree-Adjoining Grammars (TAGs) are tree rewriting systems. The grammar components are (lexicalised) elementary trees, which are composed by substitution and adjunction operations. The syntactic representation consists of the constituent tree built by composition of elementary trees, and a derivation tree, which records the dependencies between elementary trees as established by substitution and adjunction operations. A basic linguistic assumption underlying TAG is that elementary trees encode basic predicate-argument structure. This extends to long-distance dependencies, which are "localised" in (lexicalised) elementary trees.
Comments	Synonyms: LTAG, FTAG, MCTAG

language_analysis/optimality_theory_in_syntax

Name	Optimality Theory in Syntax (OT in Syntax)
Definition	Optimality Theory (OT) is a recent development in theoretical linguistics. OT deviates from more traditional linguistic frameworks in that it assumes grammatical constraints to be (a) universal, (b) violable, and (c) ranked. Assumption (a) means that constraints are maximally general, i.e., they contain no exceptions or disjunctions, and there is no parametrization across languages. Highly general constraints will inevitably conflict, therefore assumption (b) allows constraints to be violated, even in a grammatical structure, while assumption (c) stipulates that some constraint violations are more relevant than others. In this setting, a structure is grammatical if it is optimal in the sense of violating the least highly ranked constraints compared with other possible candidate structures. Which candidate is optimal depends on how the constraints in the grammar are ranked, thus crosslinguistic variation can be accounted for via variation in the constraint ranks. Optimality Theory is widely used in phonology, based on Prince and Smolensky's (1993) seminal work. In syntax, the OT paradigm is less popular, but there have been interesting attempts to combine OT with LFG. The OT literature also includes important computational contributions (especially as regards OT models of language acquisition).
Comments	Synonyms: OT Syntax

6. Language Understanding

language_understanding/word_sense_disambiguation

Name	Word Sense Disambiguation (WSD)
Definition	Word Sense Disambiguation is a subtask of semantic tagging, which consists of assigning a semantic class (sense) to a lexical item as specified by a semantic lexicon. If the semantic lexicon specifies more than one sense for a particular lexical item, a disambiguation procedure is needed to decide upon the most appropriate sense(s) for any given instance of the lexical item in text. WSD is not a self-contained application, but it may be included as an integrated part of a semantic processor.
Comments	Synonyms: lexical sense resolution, sense discrimination, Lesartdesambiguierung

language_understanding/computational_psycholinguistics

Name	Computational Psycholinguistics
Definition	Computational models of the architectures and mechanisms which underly human language processing. Computational psycholinguistics aims to develop predictive computational theories of mind that explicitly characterize how people both use and acquire knowledge of language. Models are evaluated in terms of their ability to account for human linguistic performance in tasks such as incremental ambiguity resolution, language acquisition, and production.
Comments	Synonyms: sentence processing

language_understanding/computational_semantics

Name	Computational Semantics
Definition	TBD
Comments

language_understanding/computational_pragmatics

Name	Computational Pragmatics
Definition	Pragmatics studies language use in relation to context, and particularly linguistic communication. For communication to work, hearers must recognize speakers' communicative intentions, whereby the connection between intentions and sentences relies on a shared system of beliefs and inferences. Communication is also a social affair, relying on a shared conception of the context situation. Current computational applications are mostly dialogue systems and text generation systems.
Comments	Synonyms: computational discourse processing

7. Knowledge Representation and Discovery

knowledge_representation_and_discovery/ontologies

Name	Ontologies
Definition	What are Ontologies? And why are they important for NLP? From a theoretical point of view, ontology is the metaphysical study of the nature of being and existence. In practice, an ontology is normally viewed as a formal representation of all semantic objects and their connections in a Universe of Discourse. Mapping these semantic objects onto language units (words, phrases, text segments, etc.) is the task of semantic processing in NLP.
Comments

knowledge_representation_and_discovery/automatic_hyperlinking

Name	Automatic Hyperlinking
Definition	TBD
Comments

knowledge_representation_and_discovery/knowledge_discovery

Name	Knowledge Discovery
Definition	Generally, knowledge discovery / data mining is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.
Comments	Synonyms: Data Mining

knowledge_representation_and_discovery/semantic_web

Name	Semantic Web
Definition	The Semantic Web is a W3C-based initiative for representing knowledge on the World Wide Web in a machine-readable fashion, such that it can be understood and used by machines for intelligent applications.
Comments

8. Spoken Language Input

spoken_language_input/speech_recognition

Name	Speech Recognition (ASR)
Definition	Automatic Speech Recognition deals with automatic transcribing spoken language as text which is further processed in application dependent ways. Important applications are dictation, control of machines and devices by speech, information systems, speech translation, aids for disabled persons
Comments	Synonyms: Spracherkennung

spoken_language_input/acoustic_modelling_in_speech_recognition

Name	Acoustic Modelling in Speech Recognition
Definition	Modelling of basic recognition units in the microphone signal. These units are often phones (esp. if a large vocabulary is used), while systems with a small vocabulary sometimes use larger units like words. The acoustic signal is not used directly, but represented by spectral parameters derived from it. Spectral parameters that are often used are mel-frequency cepstral coefficients (MFCC's) or RASTA PLP coefficients (noise-robust linear predictive coding parameters), although many other parameter types, including parameters based on auditory processing or phonetic features, are also used sometimes. The models in most state-of-the-art systems are obtained through hidden Markov modelling (HMM), although dynamic time warping and neural nets are also used for acoustic modelling (the latter also in combination with HMM). A limited number of systems exist in which the acoustic modelling is not stochastic, but knowledge-based.
Comments

spoken_language_input/spoken_language_understanding

Name	Spoken Language Understanding
Definition	The analysis of spoken language for an application. Spoken language understanding can involve dealing with multiple recognition hypotheses from ASR, taking prosodic properties of utterances into account and having to deal with fragmentary and grammatically incorrect utterances. Commercial ASR products often are accompanied by analysis tools.
Comments

spoken_language_input/signal_analysis_and_representation

Name	Signal Analysis and Representation
Definition	In acoustic phonetics, the speech signal is represented as a waveform (amplitude curve over time). Through subsequent frequency analysis (e.g., using an FFT), a spectrogram (frequency distribution over time) is generated. For automatic speech processing (e.g., recognition, synthesis), further derived and discretised representations are required, e.g. mel-cepstrum coefficients (see also DSP Techniques).,
Comments	Synonyms: Signalanalyse und Reprasentation, Analyse et representation du signal

spoken_language_input/language_modelling

Name	Language Modelling
Definition	Statistical Language Models define probability distributions over sequences of words, and can be used to select the best transcription of an utterance in a speech recognizer. Other applications include spelling correction, natural language generation, and machine translation. The parameters of statistical language models are estimated from a set of training examples. Useful techniques range from simple models based on trigram frequencies up to hybrid models that involve linguistic and world knowledge that may be optimized using sophisticated machine learning approaches.
Comments	Synonyms: Language Modeling, Statistical Language Modelling

spoken_language_input/emotion_recognition

Name	Emotion Recognition
Definition	The recognition of emotions from text, speech, facial expressions, gestures and/or physiological measures. A key challenge is the appropriate representation of emotional states.
Comments	Synonyms: Emotionserkennung, Reconnaissance des emotions, Reconocimiento de las emociones

spoken_language_input/prosody_information_processing

Name	Prosody Information Processing
Definition	Prosody can be defined as a feature of speech which extends over more than one segment and is often synonymous with 'suprasegmentals'. Prosodic features include fundamental frequency (F0),relative duration and intensity, and spectral quality. They determine the rhythm and intonation of utterances.
Comments	Synonyms: Prosodic Analysis, Prosodie

spoken_language_input/speaker_recognition

Name	Speaker Recognition
Definition	Speaker recognition, which can be classified into identification and verification, is the process of automatically recognizing who is speaking on the basis of individual information included in speech waves. This technique makes it possible to use the speaker's voice to verify their identity and control access to services such as voice dialing, banking by telephone, telephone shopping, database access services, information services, voice mail, security control for confidential information areas, and remote access to computers.
Comments	Synonyms: speaker verification, speaker identification, voice recognition

9. Written Language Input

written_language_input/document_image_analysis

Name	Document Image Analysis (DIA)
Definition	TBD
Comments	Synonyms: Document Image Decoding

written_language_input/ocr_print

Name	OCR: Print (OCR)
Definition	Automatic transformation of bitmap images of printed textual documents into machine editable text documents.
Comments	Synonyms: Optical Character Recognition

written_language_input/ocr_handwriting

Name	OCR: Handwriting (ICR)
Definition	Automatic transformation of hand-written text into machine editable text.
Comments	Synonyms: Handwriting recognition, Cursive character recognition, Intelligent character recognition

10. Natural Language Generation

natural_language_generation/natural_language_generation

Name	Natural Language Generation (NLG)
Definition	The field of Natural Language Generation (NLG) is concerned with building computer software systems which can produce meaningful texts in human languages from some underlying non-linguistic representation of information. For document production, NLG systems use knowledge about human languages and possibly the application domain. NLG components are used for e.g. automatic report generation, authoring, concept-to-speech and machine translation systems.
Comments	Synonyms: Human Language Generation, natuerlichsprachliche Generierung, generation du language naturel

natural_language_generation/deep_generation

Name	Deep Generation
Definition	A knowledge-based approach to natural language generation that stresses theoretical motivation and re-usability of technology and knowledge sources.
Comments

natural_language_generation/shallow_generation

Name	Shallow Generation
Definition	An approach to natural language generation in which the generator is specifically taylored around the specific needs of the given application.
Comments

natural_language_generation/syntactic_generation

Name	Syntactic Generation (how-to-say)
Definition	Generation of a syntactically well-formed natural language utterance from a given representation of its meaning, typically guided by a grammar that encodes the relevant syntactic and semantic constraints.
Comments	Synonyms: NLG, syntaktische Generierung

11. Spoken Output Technologies

spoken_output_technologies/text-to-speech_synthesis

Name	Text-to-speech Synthesis (TTS)
Definition	The generation of synthetic speech from text. Typically, a text-to-speech synthesis system performs a text analysis using natural language processing techniques; determines the appropriate phonetic string and prosodic features; and generates a speech signal by employing a concatenative or rule-based synthesis method.
Comments	Synonyms: speech synthesis, synthetic speech generation, Sprachsynthese, Text-to-Speech Synthese, Synthese de la parole, Habla sintetica

spoken_output_technologies/spoken_language_generation

Name	Spoken Language Generation
Definition	Whereas the generation of spoken language from semantic representations can be sequentialized into generation of text followed by text-to-speech, using text as an intermediate representation may lose information that was available in the original input. An integrated solution can avoid this problem and thereby lead to improved quality and/or simpler system architecture.
Comments	Synonyms: Concept-to-Speech Generation, Meaning-to-Speech System

12. Multilinguality

multilinguality/machine_translation

Name	Machine Translation (MT)
Definition	TBD
Comments

multilinguality/multilingual_information_retrieval

Name	Multilingual Information Retrieval (CLIR)
Definition	Cross-language information retrieval means using queries in one language to search for documents in a different language. Multilingual information retrieval is a broader term, which includes the case where queries in different languages are used, but only for searching documents in the same language. See also the corresponding HLT-Survey Section: http://www.lt-world.org/HLT_Survey/ltw-chapter8-5.pdf
Comments	Synonyms: cross-language information retrieval, translingual information retrieval, sprachubergreifendes Information Retrieval

multilinguality/example-based_translation_and_translation_memories

Name	Example-Based Translation and Translation Memories (TM)
Definition	Translation memories and example-based MT are techniques that reuse parts of existing translations to simplify the translation of new text from the same domain. Whereas translation memories concentrate on the reuse of translated sentences, example-based MT applies this idea to finer units like phrases, terms, and constructions.
Comments

multilinguality/human-aided_machine_translation

Name	Human Aided Machine Translation (HAMT)
Definition	We call Human-aided Machine Translation all systems and techniques which rely on real automation of the translation function when porting a text from one language to another. As opposed to full Machine Translation, human-aided MT does not fully rely on computational translation, but assists this process by pre-editing and post-editing steps, possibly also interactive human intervention to steer or select from alternative translations. Translation of real-time spoken language, by contrast, does not allow for human intervention, except for negotiation functions, such as clarification dialogues. See also the corresponding HLT-Survey Section: http://www.lt-world.org/HLT_Survey/ltw-chapter8-2.pdf
Comments

multilinguality/multilingual_speech_processing

Name	Multilingual Speech Processing
Definition	Speech systems able to understand more than one language. Either the speakers' language is recognised automatically by a language identifier or multiple language specific input channels can be employed.
Comments

multilinguality/statistical_machine_translation

Name	Statistical Machine Translation (SMT)
Definition	Techniques for machine translation that combine a stochastic model of the target language with a stochastic relation between target and source language. Translation is seen as a decoding task similar to speech recognition. Both types of models can be build automatically from suitable training data.
Comments

multilinguality/machine-aided_human_translation

Name	Machine-Aided Human Translation (MAHT)
Definition	Techniques that help to increase the productivity of human translators via suitable computational infrastructure, including translation memories, terminology management, partial machine translation, online lexicons, or other techniques that automate parts of the translator's work, such as speech recognition or accelerated typing techniques applied to human translations. See also the corresponding HLT-Survey Section: http://www.lt-world.org/HLT_Survey/ltw-chapter8-4.pdf
Comments

multilinguality/automatic_language_identification

Name	Automatic Language Identification (LI)
Definition	Automatic Language Identification (LID) is the problem of identifying the language of a sample of speech or written text by an unknown person. Several important applications already exist for LID, viz., as a front-end to, e.g., a call router in a telephone-based application or a multi-lingual speech recognition system. See also the corresponding HLT-Survey Section: http://www.lt-world.org/HLT_Survey/ltw-chapter8-7.pdf
Comments	Synonyms: language identification, Sprachenidentifizierung, Automatic Language Identification (speech and text)

multilinguality/multilingual_generation

Name	Multilingual Generation
Definition	TBD
Comments

13. Multimodality

multimodality/representations of_space_and_time

Name	Representations of Space and Time
Definition	Semantic representation and logic of temporal and spatial expressions and concepts in natural languages.
Comments

multimodality/modality_integration_facial_movement_and_speech

Name	Modality Integration: Facial Movement and Speech
Definition	Multimodal fusion combines the output of speech understanding, gestures and mimic recognition (if available) to an uniform representation of the user intention.
Comments	Synonyms: multimedia fusion, multimodal fusion

14. Coding and Compression

coding_and_compression/speech_coding

Name	Speech Coding
Definition	Coding algorithms seek to minimize the bit rate in the digital representation of a signal without an objectionable loss of signal quality in the process. High quality is attained at low bit rates by exploiting signal redundancy as well as the knowledge that certain types of coding distortion are imperceptible because they are masked by the signal. Models of signal redundancy and distortion masking are becoming increasingly more sophisticated, leading to continuing improvements in the quality of low bit rate signals. See also the corresponding HLT-Survey Section: http://www.lt-world.org/HLT_Survey/ltw-chapter10-2.pdf
Comments

coding_and_compression/text_compression

Name	Text Compression
Definition	Methods for text compression identify and exploit redundancy in text documents in order to obtain a more condensed representation of the information, from which the original data can be recovered without modification (lossless compression). In theory, there is a close relation between compression and prediction: The better a statistical language model can estimate the probability of a word, given some context, the more the text as a whole can be compressed.
Comments

coding_and_compression/speech_nhancement

Name	Speech Enhancement
Definition	The improvement of speech intelligibility by removing background noise from the speech signal. Due to the complexity of speech acoustics and perception, no simple mathematical error criterion can be applied instead, algorithms and measures need to be developed which accomodate human perception. See also the corresponding HLT-Survey Section: http://www.lt-world.org/HLT_Survey/ltw-chapter10-3.pdf
Comments	Synonyms: noise cancellation, noise removal

coding_and_compression/text_encryption

Name	Text Encryption
Definition	A cryptosystem or cipher system is a method of disguising messages so that only certain people can see through the disguise. Cryptography is the art of creating and using cryptosystems. Cryptanalysis is the art of breaking cryptosystems---seeing through the disguise even when you're not supposed to be able to. Cryptology is the study of both cryptography and cryptanalysis.
Comments

coding_and_compression/speech_encryption

Name	Speech Encryption
Definition	Application of encryption technology to the transmission of speech signals in real time.
Comments	Synonyms: Voice Encryption, Voice Scrambling, Speech Scrambling

15. Mathematical Methods

mathematical_methods/natural_language_parsing

Name	Natural Language Parsing (NL Parsing)
Definition	Parsing (from Latin "pars orationis" = parts of speech) is the syntactic analysis of languages. Natural Language Parsing is the syntactic analysis of natural languages, such as Finnish or Chinese. The objective of Natural Language is to determine parts of sentences (such as verbs, noun phrases, or relative clauses), and the relationships between then (such as subject or object). Unlike parsing of formally defined artificial languages (such as Java or predicate logic), parsing of natural languages presents problems due to ambiguity, and the productive and creative use of language. See also the corresponding HLT-Survey Section: http://www.lt-world.org/HLT_Survey/ltw-chapter3-6.pdf
Comments	Synonyms: syntactic analysis

mathematical_methods/statistical_modeling_and_classification

Name	Statistical Modeling and Classification
Definition	In most applications of human language technology some tasks cannot be solved by purely deductive (rule-based) approaches, but need quantitative mechanisms to pick the most plausible out of a larger set of potential outcomes, or rank a set of possibilities. Often, the required preferences can be extracted from training examples by suitable statistical techniques. Statistical language modeling for speech recognition and text retrieval and categorization have been among the earliest applications. Recent work in many subfields of HLT focusses on the integration of statistical (implicit) and rule-based (explicit) knowledge. See also the corresponding HLT-Survey Section: http://www.lt-world.org/HLT_Survey/ltw-chapter11-2.pdf
Comments

mathematical_methods/optimization_and_search_in_speech_and_language_processing

Name	Optimization and Search in Speech and Language Processing
Definition	Optimization and search are vital to modern speech and natural language processing systems, as speech recognition and parsing are combinatorial optimization problems, in which from a large number of potential analyses the best ones (those with highest overall probability, smallest number of assumed errors, best fit with contextual expectation...) need to be identified. See also the corresponding HLT-Survey Section: http://www.lt-world.org/HLT_Survey/ltw-chapter11-7.pdf
Comments

mathematical_methods/maximum_entropy_methods

Name	Maximum Entropy Methods (ME, MEM, MEMD, MaxEnt)
Definition	Maximum entropy methods are techniques for the estimation of probability distributions that pick the "most uniform" distribution compatible with the observed statistics. The maximum entropy formulation has a unique solution which can be found by iterative scaling algorithms. Maximum entropy models have been applied to NLP-related task like text segmentation and classification, language modeling, part-of-speech tagging, parsing, and machine translation.
Comments

mathematical_methods/latent_semantic_analysis

Name	Latent Semantic Analysis (LSA)
Definition	Latent Semantic Analysis is a technique for indexing and retrieval that takes advantage of implicit higher-order structure in the association of terms with documents ("semantic structure") in order to improve the detection of relevant documents on the basis of terms found in queries. Uses statistical techniques for dimensionality reduction such as singular-value decomposition or unsupervised soft clustering.
Comments	Synonyms: Latent Semantic Indexing

mathematical_methods/language_modeling

Name	Language Modelling
Definition	Statistical Language Models define probability distributions over sequences of words, and can be used to select the best transcription of an utterance in a speech recognizer. Other applications include spelling correction, natural language generation, and machine translation. The parameters of statistical language models are estimated from a set of training examples. Useful techniques range from simple models based on trigram frequencies up to hybrid models that involve linguistic and world knowledge that may be optimized using sophisticated machine learning approaches. See also the corresponding HLT-Survey Section: http://www.lt-world.org/HLT_Survey/ltw-chapter1-5.pdf
Comments	Synonyms: Language Modeling, Statistical Language Modelling

mathematical_methods/dsp_techniques

Name	DSP Techniques
Definition	A collective term for algorithms analysing, modifying, or coding a signal. DSP techniques are employed in most speech technology applications analysing, generating or transmitting a speech signal. See also the corresponding HLT-Survey Section: http://www.lt-world.org/HLT_Survey/ltw-chapter11-3.pdf
Comments	Synonyms: Digital Signal Processing, Digitale Signalverarbeitung

mathematical_methods/conditional_random_fields

Name	Conditional Random Fields (CRF)
Definition	A framework for building probabilistic models to segment and label sequence data. According to their inventors, CRFs offer advantages over hidden markov models, stochastic grammars, and maximum entropy markov models.
Comments

mathematical_methods/support_vector_machines

Name	Support Vector Machines (SVM)
Definition	Support Vector Machines are machine learning algorithms for binary classification based on recent advances in statistical learning theory. The input is mapped into a high dimensional feature space, in which a linear classifier is constructed that maximizes the margin between the classes and hence generalizes well to unseen data. Learning requires only information about the relative distances of the training instances, so it can be performed for arbitrary distance metrics (called kernels) that may be specific to the application domain. These generalized SVMs are called kernel machines.
Comments	Synonyms: Kernel Machines

mathematical_methods/emerging_computing_paradigms

Name	Emerging Computing Paradigms
Definition	Emergent computaion is a type of computation that is bottom-up and not globally nor totally programmed. Only local information or very limited amount of information is used for a unit of computation. However, certain global information structure, which is often unexpected, is emerged from this computation. For the field of natural language processing, researchers are concerned with the question of the origin and evolution of language.
Comments

mathematical_methods/finite_state_technology

Name	Finite State Technology (FST)
Definition	Finite-state devices such as finite-state automata and finite-state transducers have been known since the emergence of computer science and are recently extensively used in many areas of natural language processing. Their use is motivated by their time and space efficiency and the fact that many relevant local language phenomena can be easily and intuitively expressed as finite-state devices. See also the corresponding HLT-Survey Section: http://www.lt-world.org/HLT_Survey/ltw-chapter11-5.pdf
Comments	Synonyms: Finite-State Technology

mathematical_methods/connectionist_techniques

Name	Connectionist Techniques (PDP)
Definition	Connectionist techniques are modelled on biological brains, whose higher-order cognitive processes appear to emerge from the interplay of large numbers of simple processing units, the neurons. Rather than being used as a substrate in which to implement known elements playing known roles, neural networks are let to evolve by themselves: they gradually adapt to the environment through a modification of inter-neural connection strengths, which come to reflect the neurons' history of co-activities. Typically, the emerging network represents objects, symbols, attributes, etc. (if at all) in states, involving larger numbers of neurons. Connectionism is a field of machine learning and has an affinity to statistics, fuzzy logic, and genetic programming. See also the corresponding HLT-Survey Section: http://www.lt-world.org/HLT_Survey/ltw-chapter11-5.pdf
Comments	Synonyms: connectionism, parallel distributed processing, neurocomputing, neural networks, Konnektionismus, Neuronale Netzwerke, Neurale Netzwerke

mathematical_methods/hmm_methods

Name	HMM Methods (HMM)
Definition	Probabilistic modeling of sequential data by assuming underlying (hidden) state sequences that produce observed (visible) sequences. See also the related HLT-Survey Section on HMM Methods in Speech Recognition: http://www.lt-world.org/HLT_Survey/ltw-chapter1-5.pdf
Comments

mathematical_methods/inductive_logic_programming

Name	Inductive Logic Programming (ILP)
Definition	Inductive Logic Programming (ILP) is a research area formed at the intersection of Machine Learning and Logic Programming. ILP systems develop predicate descriptions from examples and background knowledge. The examples, background knowledge and final descriptions are all described as logic programs. A unifying theory of Inductive Logic Programming is being built up around lattice-based concepts such as refinement, least general generalisation, inverse resolution and most specific corrections. In addition to a well established tradition of learning-in-the-limit results, some results within Valiant's PAC-learning framework have been demonstrated for ILP systems. U-learnabilty, a new model of learnability, has also been developed.
Comments	Synonyms: induktive Logikprogrammierung

16. Discourse and Dialogue

discourse_and_dialogue/discourse_and_dialogue

Name	Discourse and Dialogue
Definition	A Discourse is a piece of language including more than one sentence. A Dialogue is a linguistic exchange involving more than one participant. Discourse and dialogue therefore encompasses almost all non-local phenomena in language, but in particular discourse coherence, anaphoric dependencies, dialogue structure and the relation between questions and answers. The most obvious practical application in this area is dialogue systems and, more recently, spoken dialogue systems.
Comments	Synonyms: Dialogue and Discourse, Diskurs und Dialog

discourse_and_dialogue/spoken_language_dialogue

Name	Spoken Language Dialogue (SLD)
Definition	Spoken Language Dialogue covers man-machine dialog systems using speech as main means of user interaction as well as systems analysing spoken dialogues between humans. Such systems presuppose speech recognition, speech synthesis and spoken language understanding.
Comments

discourse_and_dialogue/discourse_modeling

Name	Discourse Modeling
Definition	Discourse modeling describes all aspects of the relations between groups of sentences in monologue (text) or dialogue, e.g. text coherence, rhetorical relations, intentional and attentional state, centering, dialogue moves,dialogue acts, and reference phenomena, to name just a few.
Comments	Synonyms: Discourse Modelling, Diskursmodellierung

discourse_and_dialogue/spoken_dialogue_systems

Name	Spoken Dialogue Systems (SDS)
Definition	Spoken Dialogue Systems are automatic systems that interact with humans (or other systems) by accepting spoken language input and producing spoken language output. Spoken language input is handled by speech recognition, and language analysis and understanding components. Spoken language output is achieved by playback of recorded human speech, or by speech synthesis. Spoken dialogue systems include a component for dialogue control, which may make use of artificial intelligence techniques. Spoken dialogue systems are widely used for information and transaction systems, such as stock market information or travel reservation.
Comments	Synonyms: speech dialogue systems

discourse_and_dialogue/dialogue_modeling

Name	Dialogue Modeling
Definition	TBD
Comments

17. Language Resources

language_resources/written_language_corpora

Name	Written Language Corpora
Definition	Any collection of more than one text can be called a corpus, (corpus being Latin for "body", hence a corpus is any body of text). But the term "corpus" when used in the context of modern linguistics means a machine-readable text collection which is representative for the language use under investigation.
Comments	Synonyms: written language resources, text corpora, Korpora der geschriebenen Sprache, Textkorpora, corpus de la langue ecrite, corpus textuel

language_resources/linguistically_annotated_corpora

Name	Linguistically Annotated Corpora
Definition	Linguistically annotated corpora are text collections which are enriched with linguistic information.
Comments	Synonyms: annotated corpora, linguistically interpreted corpora, linguistically enriched corpora, annotierte Korpora, linguistisch annotierte Korpora, corpus annote, corpus linguistiquement annote

language_resources/thesauri_wordnets

Name	Thesauri WordNets
Definition	TBD
Comments

language_resources/spoken_language_corpora

Name	Spoken Language Corpora
Definition	Spoken language corpora are collections of recorded spoken language, generally associated with transcriptions of speech and noises, and with annotations at different linguistic levels. Speech corpora can contain read speech, spontaneous speech, dialogues and may be recorded under different conditions with regard to microphones, environment (e.g., lab, office, background noise), and transmission channel (e.g., telephone, broadcast). Speech corpora are used for different purposes, including training and evaluation of speech recognisers, phonetic and phonological research, dialect research, dialogue research, and speech synthesis.
Comments	Synonyms: speech corpora

language_resources/lexicons

Name	Lexicons
Definition	TBD
Comments

language_resources/grammars

Name	Grammars
Definition	TBD
Comments

language_resources/multilingual_corpora

Name	Multilingual Corpora
Definition	Any collection of more than one text in more than one language can be called a multilingual corpus, (corpus being Latin for "body", hence a multilingual corpus is any body of multilingual texts). But the term "multilingual corpus" when used in the context of modern linguistics means a machine-readable text collection of multilingual texts which are representative for the language use under investigation.
Comments	Synonyms: multilinguale Korpora, corpus multilingue

language_resources/terminology

Name	Terminology
Definition	TBD
Comments

language_resources/standards

Name	Standards
Definition	Standards provide a common framework for the creation, maintenance and exchangeability of linguistic resources.
Comments	Synonyms: Standards, standards

18. Evaluation

evaluation/evaluation_of_machine_translation_and_translation_tools

Name	Evaluation of Machine Translation and Translation Tools
Definition	Evaluation of MT systems depends strongly on whether such a system is used for information dissemination, assimilation, or in a conversational context, the types of texts to be translated, whether there is a well-defined and limited application domain and many more factors. The growing number of MT systems on the market that span a wide range of quality along these dimensions has motivated activities towards evaluation standards from national and international organizations.
Comments

evaluation/human_factors_and_user_acceptability

Name	Human Factors and User Acceptability
Definition	TBD
Comments

evaluation/usability_and_interface_design

Name	Usability and Interface Design
Definition	TBD
Comments

evaluation/evaluation_of_broad-coverage_natural-language_parsers

Name	Evaluation of Broad-Coverage Natural-Language Parsers
Definition	Measuring the success of (stochastic or symbolic) parsers.
Comments	Synonyms: Parser Evaluation, Parser-Evaluierung

evaluation/speech_input_assessment_and_evaluation

Name	Speech Input - Assessment and Evaluation
Definition	Assessment and evaluation are concerned with the global quantification and detailed measurement of system performance. Assessment is the process of system appraisal which leads to global, overall, quantification of performance. Evaluation involves the analytic description of system performance in terms of defined factors.
Comments	Synonyms: ASR Evaluation

evaluation/information_retrieval_evaluation

Name	Information Retrieval Evaluation
Definition	TBD
Comments

evaluation/deep_parser_performance_evaluation

Name	Deep Parser Performance Evaluation
Definition	TBD
Comments

evaluation/speech_synthesis_evaluation

Name	Speech Synthesis Evaluation
Definition	Evaluation of speech synthesis traditionally considers intelligibility and naturalness. More recently, expressivity has become an issue with the increasing demand for expressive voices. Due to the multitude of aspects involved, there is no agreed standard for evaluation of speech synthesis systems.
Comments	Synonyms: Evaluation von Sprachsynthese, Evaluation de systemes de synthese de la parole, Evaluacion de sistemas de sintetizacion del habla

19. Conclusion

References

[HLTv1]	Language Technology - A Survey of the State of the Art (First Edition 1997) <http://www.lt-world.org/HLT_Survey/Edit_Board/>
[HLTv2]	Language Technology - A Survey of the State of the Art (Second Edition 2003 in preparation) <http://www.lt-world.org/HLT_Survey/Edit_Board/>
[LT-World]	LT-World <http://www.lt-world.org/>
[OLAC-MS]	OLAC Metadata Set. <http://www.language-archives.org/OLAC/olacms.html>