Latest page update: 1997 January 30.
Return to the book page.A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0-1 vector: any vector each of whose components are either 0 or 1.
absolute term frequency: the raw count of the number of times that a term occurs in a document or document collection.
abstract: any brief one or two paragraph description of the contents of a document, usually by the author.
ACM: Association for Computing Machinery, a professional organization.
ad hoc query: any query that is asked once, requiring search of an entire database.
adaptive model: any data compression method in which the encoding changes or adapts as the statistical properties of an individual document are determined.
agglutinative language: any language in which syntactic relationships are expressed by distinct suffixes.
Aho-Corasick algorithm: a string matching algorithm that uses multiple finite state recognizers for simultaneous matching of several substrings.
algorithm: the specification of a method by which an information system accomplishes a given task.
animation: the presence of motion in a document such as a videotape.
ANSI: American National Standards Institute, the U.S. authority for data encoding and other standards; a text encoding standard.
antonym: a word meaning the opposite of a given word.
approximate match: any matching technique retrieving documents that are similar to, but may not exactly match, the query specification.
arithmetic code: any data compression method that represents an entire document by a single number computed adaptively from the frequencies of letters or pixels within the document.
Arithmetic Mean Coefficient: a similarity measure based on the arithmetic mean.
array: any rectangular array of data, usually of numbers.
ASCII: American Standard Code for Information Interchange, a method for encoding alphanumeric data.
atomic data: data that are not subdivided into smaller units.
automatic indexing: indexing that is performed according to an algorithm, without human intervention.
average information content: a measure of how much information is contained in a typical message from a given set of messages.
average precision: a value computed by averaging the precision values at several different recall levels, typically three or eleven levels.
average recall: a value computed by averaging the precision values at several different precision levels, typically three or eleven levels.
average similarity: the average of the similarities of document pairs within a collection.
Return to the beginning of the glossary.balance: in an indexing method, having the subcollections identified by index terms be of approximately uniform size.
base representation: representation of a document as a vector of numbers related to every term in the vocabulary used for a collection of documents.
basis vector: one of a set of vectors from which all vectors within a given vector space can be defined.
Bead: a visual information retrieval interface using a landscape metaphor.
bibliography: the list of documents cited by a given document, also called a reference list.
bilevel image: any image in which a pixel has only two values, typically black or white.
binary measure: any measure having only two values.
binary search: a search technique that iteratively discards half of a given set in an effort to locate a desired item. The technique requires a sorted set stored as an array.
BIRD: a visual information retrieval interface utilizing a separator array to effect sequential development of a Boolean query two terms at a time.
bit: the smallest unit of data, having only two possible values.
bit map: a representation of a 0,1-vector in which each component is represented by a single bit. Used to represent a set, with 1 representing an element (from a universal set) that is included in the set and 0 representing an element that is not in the set.
BMG algorithm: see Boyer-Moore-Galil algorithm.
BookHouse: a visual information retrieval interface using a library metaphor.
Boolean algebra: an algebra based on a certain set of arithmetic rules, used for logical computations. The number of elements in a finite Boolean a lgebra is always a power of 2. The most common Boolean algebra has only two elements, 0 and 1, and differs from ordinary algebra in that 1 + 1 = 1.
Boolean point: in a visual information retrieval interface, any point representing a Boolean combination of reference terms.
Boolean query: any query in which the individual terms are combined with Boolean or logical connectives.
Boolean retrieval system: any retrieval system using Boolean queries.
Boyer-Moore-Galil algorithm: a string matching algorithm based on matching substrings from the right hand end, rather than the left hand end. This is an O(n) algorithm that in the best case may be faster than other O(n) algorithms by a factor of five or more.
branching factor: in a tree or hierarchical file organization, the maximum number of subunits that a given unit can have.
breeding pair: in a genetic algorithm, two variants that are associated for possible crossover operations.
breeding population: in a genetic algorithm, the replicated population from which the population of variants for the next generation is formed.
broader term: in a thesaurus, any term whose interpretation includes a given term and similar or related terms.
Brown corpus: a well-known frequency study of American texts of various types.
byte: a sequence of eight bits, hence a unit of data having 256 possible values.
Return to the begining of the glossary.caption: any brief text describing a figure in a document.
Cassini oval model: a model of interaction among query terms, in which the similarity of a document to several reference points is computed using the product of the distances from the document to each reference point.
cell; in general, an element position in an array. Specifically, one of four element positions in the separator array for BIRD.
characteristic function: a function whose value is 1 for elements of a given set and 0 for elements not in the set. It is used, for example, to separate documents satisfying a query (1) from those not satisfying it (0).
citation index: an index that lists documents citing a given document.
citation processing: any retrieval technique in which documentary citations are traced to identify documents related to a given one.
city block distance: a distance computed using the sum of the absolute values of distance changes in each direction, so called because it counts the number of blocks traversed in moving from one location to another in a city; L1.
classification bin: in BIRD, a bin in which a selected subset of documents can be stored.
cluster point: any point that represents a cluster of documents.
clustered file: any file in which the data elements are organized by a clustering technique.
clustering technique: any technique by which relationships among data elements such as documents are determined and closely related elements are grouped into clusters.
CNF: see conjunctive normal form.
co-citation: the phenomenon of two documents being cited by a given document, used as a measure of similarity of the two documents.
coefficient of association: any measure of similarity between two documents.
co-filter: use of a user profile as a second reference point in conjunction with a query.
collision: in hashing, the situation in which two data items are assigned to the same location.
communication theory: see transmission theory.
component: an individual element in a vector.
concept: an idea within a document, in contrast to the specific terms used to express that idea.
concordance: an inverted index identifying all occurrences of each term within a body of text.
conditional probability: the probability that a given event occurs, assuming that another event has occurred. Conditional Probability Coefficient: a similarity measure based on conditional probability.
conjunct: in the disjunctive normal form, a group of individual terms joined by AND; in the conjunctive normal form, a group of disjuncts joined by AND.
conjunctive model: a model of interaction among query terms, in which the similarity of a document to several reference points is computed using the maximum of the distances from the document to each reference point.
conjunctive normal form: a standard form for logical expressions, in which individual terms are joined by OR, and such groups of terms are joined by AND.
conjunctive query: any Boolean query using only AND and NOT.
content-bearing words: words that are deemed to relate to the concepts in a document. See also stop list.
content search: any search to locate a record or document having a specific content.
context-dependent: any character or term whose interpretation depends on the context within which it occurs.
context encoding: any encoding method in which the code for a given symbol depends on the context within which it occurs.
contingency table: any array in which the cells represent specific combinations of conditions, often a 2 x 2 table in which each of two conditions may or may not occur.
continuous tone image: any image in which each pixel may have any of a range of values. Typically the value of an individual pixel is closely related to the values of surrounding pixels.
contour: within a document space, the boundary of a region containing documents to be retrieved.
controlled vocabulary: any restricted set of words and phrases that are used to describe documents within a given set.
copyright: the right of an individual or corporation to receive credit for, and benefit from, published works.
Cosine Coefficient: a similarity measure based on the cosine of the angle between two documents as represented by term weight vectors.
cosine measure: the vector angle between two documents, used as a measure of similarity.
coverage ratio: the proportion of the relevant documents known to the user that are actually retrieved.
cross referencing: in a thesaurus, reference to terms related to the given term.
crossover: in a genetic algorithm, any method of exchanging portions of two variants to create two new variants.
crossover rate: in a genetic algorithm, the fraction of breeding pairs that are chosen for a crossover operation.
current awareness system: any information retrieval system in which users are automatically notified of any new documents that may relate to their interests; also called selective dissemination of information, and routing system.
Return to the beginning of the glossary.data: the documents received, stored and retrieved by an information endosystem.
data compression: the encoding of data in less than one byte per character for text, and in as little as one or two bits per pixel for images.
data fusion: the merging of search results from several different databases, possibly using several different search techniques.
data model: in data compression, the model, adaptive or static, used to represent the data.
deep structure: the structure of a sentence related to its meaning, independently of the specific syntax used.
default bus: in a finite state recognizer, a bus that is used for any unspecified character.
deleted average similarity: average similarity of documents within a collection, computed on the assumption that occurrences of a given term have been deleted from the computation.
DeMorgan's Laws: logical laws governing the interaction of AND, OR, and NOT.
deterministic: any algorithm or automaton such that for any given set of data each step has only one possible successor.
device: any computer or other tool used to process information.
Dewey Decimal classification: a system of classifying documents according to contents.
Dice's coefficient: a similarity measure developed by Dice.
dimensional compatibility: in an abbreviated vector representation of documents, the concept that a given position must refer to the same term in each of the documents, whether or not that term occurs.
direct file: any file of documents without an index into it.
direct search: any search of each document within a file to locate those containing a given term.
discriminant function: in probabilistic retrieval, a function that determines whether a given document should be retrieved.
disjunct: in the conjunctive normal form, a group of individual terms joined by OR; in the disjunctive normal form, a group of disjuncts joined by OR.
disjunctive model: a model of interaction among query terms, in which the similarity of a document to several reference points is computed using the minimum of the distances from the document to each reference point.
disjunctive normal form: a standard form for logical expressions, in which individual terms are joined by AND, and such groups of terms are joined by OR.
disjunctive query: A Boolean query using only OR and NOT.
dissimilarity measure: any measure in which high values represent documents that are dissimilar and low values such as 0 represent documents that are similar.
distance measure: any measure relating two entities that satisfies certain conditions: zero distance only between an entity and itself, non-negativity, symmetry, and the triangle inequality.
distance space: any space in which documents are positioned according to their distances from given reference points.
DNF: see disjunctive normal form.
doctrine of fair use: the concept in copyright law that limited individual use of any document is permitted without specific permission of the copyright holder.
document: any stored data record in any form.
document analysis: the process of analyzing a scanned document to determine its components such as headings, paragraphs, and figures.
document cluster: any group of related documents.
document-document matrix: an array used to compare documents within a collection according to a given criterion.
document identifier: any number or other code uniquely identifying a document.
document reference number: the identifier by which an information system refers to a document.
document space: a conceptual space in which documents are distributed according to given characteristics, often term occurrences.
document surrogate: any limited representation of a full document.
Return to the beginning of the glossary.EBCDIC: Extended Binary Coded Decimal Information Code, a method for encoding alphanumeric data, now largely obsolete.
economy: how well the information system meets the economic goals of the funder.
ectosystem: those system factors that are not under control of the designer, including the people who are involved with the system, the forms in which information is available, and the equipment and technology available for the system.
effectiveness: the quality of the information system response to the information need.
efficiency: the time and effort required for the information system to respond to the information need.
eleven-point average: computation of average precision or average recall from the values at eleven recall or precision points, respectively, namely, at 0.0, 0.1, ..., 1.0.
ellipsoidal model: a model of interaction among query terms, in which the similarity of a document to several reference points is computed using the sum of the distances from the document to each reference point.
endosystem: those system factors that the designer can specify and control, such as the equipment, algorithms, and procedures used.
Euclidean distance: ordinary straight line distance; L2.
exact match: any document that exactly matches the terms and criteria in a query.
exclusive OR: interpretation of "or" as meaning either one or the other but not both.
exhaustivity: the extent to which a given set of index terms covers all topics and concepts met in a document set.
expected search length: the average number of documents to be examined to locate a given number of relevant documents.
expert system: any inferential information system built on a knowledge base.
extended Boolean query: see weighted Boolean query.
extended user profile: any user profile that includes user characteristics that cannot be directly related to terms in documents, such as levels of education and experience.
extract: any brief description of a document formed by selecting certain sentences from the document.
extrinsic measure: any document similarity measure that depends on reference to some point independent of the two documents.
Return to the beginning of the glossary.failure link: in the KMP algorithm, a link to be taken when a required character is not present.
fallout: the proportion of non-relevant documents that are not retrieved.
feedback>
Transfer interrupted!