Automatic
Methods for Determining the Semantic Difference Between Collections of Documents
William M. Pottenger and Dmitry Zelenko
8:45-9:30
Abstract:
A standard approach to semantic retrieval performs statistical correlations
on each subject area to support deeper retrieval. However, true information
retrieval involves entire sessions not individual queries. For example,
a user first chooses subjects of interest and then retrieves pertinent documents.
Next, the user repartitions the collection for further queries more closely
related to the desired topic. An unaided human user can perform this task
for tens or possibly hundreds of items, but automated assistance is clearly
required to refine a query on a set of tens or hundreds of thousands of
items, as are routinely found on the World Wide Web.
We have developed automatic techniques for clustering categories from a collection and then indexing concepts from each of the identified categories. The clustering techniques produce a Category Map [Orwig, Chen and Nunamaker 1995] and the indexing techniques produce a Concept Space [Chen, Schatz, Martinez and Ng 1995]. Collectively we refer to these techniques as semantic indexing.
Concept Spaces are collections of abstract concepts which are generated from concrete objects. The concepts are typically labeled with text phrases and the collections have traditionally been text documents. However, except for the generation operations, the logical concepts are independent of the physical objects they represent.
A Concept Space thus summarizes the statistically derived semantics of a given collection. In light of this fact, we have developed automatic techniques which enable us to perform a "Semantic Difference" (or semantic 'diff') between two collections. This gives us a quantitative measure for determining the difference, or semantic distance, between two collections.
One of the key problems that can be addressed with this technology is the evaluation of Concept Spaces and Category Maps, particularly in that we are experimenting with alternative algorithms and implementations for computing indexes of this nature.
In this seminar, we present the research results of our experiments in the automatic comparison and validation of Concept Spaces.
Cross
media validation in a multimedia retrieval system
Michael Ortega, Kaushik Chakrabarti, Kriengkrai Porkaew and Sharad
Mehrotra
9:30-10:15
Abstract:
Conventional retrieval systems are commonly measured by their performance
in terms of precision and recall. While the merit of these metrics is under
debate, they enjoy widespread use. A central problem with these metrics
is the implicit requirement of a small collection for which few experts
can develop queries and determine relevant and non relevant sets. Increased
collection sizes force us to investigate alternate metrics to determine
the quality of retrieval. The MARS research group at the University of Illinois
explores multimedia retrieval. While information retrieval is not a solved
problem, some domains have significant progress in retrieval quality. We
propose to exploit spatial and temporal associations between media to use
a reference media and a test media. An automatic technique can compare the
results and determine how much agreement exists between test and reference
media, thus validating the test media retrieval under a correctness assumption
of the reference media.
Measures
for Evaluating Database Selection Techniques
James C. French, Allison L. Powell, Charles L. Viles, Travis Emmitt
and Kevin J. Prey
10:30-11:15
Abstract:
There have been a number of research efforts focusing on database selection
and distributed searching; however, the variety of test environments and
evaluation measures employed by these researchers has made it difficult
to compare the results. We have created a testbed for evaluating database
selection techniques and have conducted a study to examine the effectiveness
of one specific technique. We will describe the evaluation measures that
we are currently employing in our ongoing investigation and discuss their
effectiveness in terms of the specific study.
DL Metrics:
Web Characterization and 4S
Edward A. Fox, Ghaleb Abdulla, and Neill Kipp
11:15-12:00
Abstract:
This presentation will provide perspective on DL metrics from three different
sources. First, there will be a brief report on the work of the W3C Web
Characterization Group, which has developed metrics and scenarios to characterize
the Web. Second there will be a summary of the recently completed dissertation
of Ghaleb Abdulla, who studied Web traffic and measured certain aspects
of digital library usage. Third, there will be an explanation of how the
4S model can provide a framework for describing and studying digital libraries,
based on doctoral studies of Neill Kipp.
Dealing
with Complex Evaluations of Digital Libraries
Paul Kantor, Rutgers.
1:30-2:15
Abstract:
The "performance" of a digital library is a complex relation among
inputs, outputs and constraints. Realistic evaluation of alternative technologies
and designs for digital libraries will result in a wealth of data. Typically,
configuration A will seem better than configuration B with regard to some,
but not other scales or criteria. This will be true whether the measures
are measures of input (or cost, bandwidth, operating expense) or output
(activity, timeliness, reliability, usability, ...). In such situations
it is not possible to arrive at a single realistic measure of value, which
takes all of these aspects into account. On the other hand, there is a rigorous
method for determining that some configurations are "dominant"
in the sense that no others are definitely better. And for others, which
are not dominant, it is possible to find the comparison set which shows
up their shortcomings, and even to put a number to the amount of shortfall.
This technology, called Data Envelopment Analysis, is an essential tool
in the Rutgers project to understand and take advantage of the "Human
in the Loop" in the design of digital libraries to serve widely diverse
populations.
New
Types of Metrics for the Changing Landscape of Visual Dominant Interfaces
Jim Thomas
2:15-3:00
Abstract:
We are on the verge of the discovery of a new human information discourse. This
new discourse will enable people to view, search, retrieve, manage, transform,
and further present masses of information such as those found in the digital
libraries of the future. Understanding the foundations of this new human information
discourse will help guide our selection and refinement of the appropriate metrics.
It no longer is the visual paradigm that is the dominant factor of the point
and click interaction. It is the human to information space bindings facilitated
by a suite of high order interaction techniques that will have to be measured.
This presentation will give a vision of this new discourse and the fundamental
advances that are envisioned to change how we interact with our personal to
public information spaces.