Journal of the American Society for Information Science (JASIS) -- Table of Contents
American Society for Information Science
Silver Spring, Maryland, USA
VOLUME 51, NUMBER 7 and NUMBER 8
CONTENTS (Number 7)
In this issue
Bert R. Boyce
We begin this issue with four diverse papers on clustering as a retrieval method and end with three even more diverse papers on user study.
- Order-Theoretical Ranking
Claudio Carpineto and Giovanni Romano
First we have Carpineto and Romano, who make use of a clustered document file based upon set inclusion relations among terms, merge queries into the clustered document space and consider the shortest path between a query and document as the basis of a retrieval status value. Typical hierarchical clustering methods do not produce all likely clusters due to arbitrary tie breaking, and fail to discriminate between documents with significantly different degrees of similarity to a query. In their concept lattice ranking (CLR), a lattice is built on the basis of term co-occurrence in documents and supplemented rather than totally re-computed with the addition of each new document or query.
Using the CACM and CISI collections and queries, weighted term vectors were computed to be used in best match retrieval, and a hierarchical single link clustering using cosign ranking, for comparison with CLR. Lattice construction took 15 minutes for CACM and 2 hours for CISI. Both best match and CLR return better precision and recall measures than hierarchical clustering, but little difference appears between the two. A comparison of CLR and hierarchical clustering on unmatched documents was then carried out using expected search length as a measure. CLR outperforms and may be useful in discovering non-matching relevant documents.
- A Linear Algebra Measure of Cluster Quality
Laura A. Mather
Mather proposes a new measure of cluster effectiveness independent of knowledge of retrieval measures computed for queries on the clustered file, and based on the theory that the clustering quality of a term document matrix is determined by the disjointedness of the terms across the clusters. The ideal clustering case is that where terms which occur in one cluster occur only in that cluster, or, that is to say, are mutually exclusive across clusters. Such clusters occur if and only if the matrix is ``block diagonal,'' that is to say, has rows and columns that can be permuted to produce a matrix that has some set of blocks on the diagonal of the matrix that contain nonzero elements, while the remainder contain zero elements. The singular values of each of the blocks of a block diagonal matrix are the same as the singular values of a block diagonal matrix when terms are disjoint and as the structure diverges from block diagonal the two sets of singular values diverge as more term intersection occurs. A measure of the distance between the singular values of the term document matrix and the cluster matrices indicates cluster value, but is difficult to interpret. By taking random permutations of the matrix and creating clusters one can approximate the mean and standard deviation and by subtracting the mean from the actual observed clustering and dividing by the standard deviation of the samples, one can produce the number of standard deviations from a random clustering for the observation. These values can be compared to indicate the best clustering. The computation of the singular values of many large matrices is required and would be expensive. Experimentally the metric correlates significantly with Shaw's F and with the precision measure, increasing as these measures increase.
- A Unified Mathematical Definition of Classical Information Retrieval
Dominich reviews the basic retrieval models concentrating upon the vector space and probabilistic representations. He shows that these retrieval models define systems of vicinities of documents around queries which can both be represented by a similarity space and thus have a unified mathematical definition.
- Validating a Geographical Image Retrieval System
Bin Zhu and Hsinchun Chen
Zhu and Chen compare the performance of their Geographical Knowledge Representation System with image retrieval by human subjects. Gabor filters are used to extract low level features from 1282 pixel tiles cut from aerial photograph images. A 60 feature vector describes each tile and a Euclidean distance similarity measure is used to sort the tile images by least distance. Adjacent similar tiles are grouped to create regions which in turn are represented with derived vectors. Kohonen's Self Organizing Map (SOM) is created showing tiles representing the textures to be found in the data. Clicking on these displays the tiles in the same category.
Thirty human subjects were assigned an image and six randomly selected reference tiles to score for similarity to each of the 192 tiles in the image. A second group of ten subjects were asked to draw lines around areas they found similar to the reference tiles. A third group of ten subjects were given the SOM selected reference tiles and asked to categorize each tile in the whole image into categories represented by these reference tiles. The system exhibited no significant difference in precision from the human subjects but preformed less well on recall. Humans selected more tiles viewed as similar and the top 5 system and subject tiles were consistently different. Both had difficulty with tiles where texture alone did not distinguish one from another. In tile groupings into regions, humans out preformed the system on both measures but in image categorization no significant difference existed. Adding features other than texture may help performance which is close to inexpert human performance.
- How Can We Investigate Citation Behavior? A Study of Reasons for Citing Literature in Communication
Donald O. Case and Georgeann M. Higgins
Case and Higgins review the previous studies providing lists of reasons for author's citing behavior, and studies using these categories where investigators classify citation behavior on the basis of content analysis. They also reexamine the smaller set of studies involving surveys of authors as to the reasons for their behavior. Choosing the two most highly cited authors appearing in both of two recent studies of the Communication literature all citations to their work in the years 1995 and 1996 were collected. 133 unique citers were identified and sent 32 item questionnaires with the questions from a recent study in the Psychology literature. Returns from 56 were received, 31 for author A and 25 for author B, and responses for the two authors were not significantly different. No new reasons for citation were identified. The top reasons were a review of past work, acting as a representative of a genre of studies, and as a source of a method. Negative citation is quite rare. Twenty five not redundant items with some indication of importance were subjected to a factor analysis. Seven factors explain 69% of the variance; classic citation, social reasons, negative citation, creative citation, contrasting citation, similarity citation, and cite of a review. Factors predicting citation are; perception of novelty and representation of a genre, perception that citation will promote cognitive authority of the citing work, and perception that the cited item deserves criticism.
- Children's Use of the Yahooligans! Web Search Engine: I. Cognitive, Physical, and Affective Behaviors on Fact-Based Search Tasks
In the Bilal study twenty two middle school students were assigned a question to search in Yahooligans! as part of their Science class. The teacher provided ratings of the children's topic knowledge, general science knowledge, and reading ability. A quiz administered to the students indicated knowledge of the Internet and of Yahooligans! in particular. Lotus ScreenCam was used to record 18 of the student system interactions. Student's transcribed moves were classified and counted with a score of one (relevant) for selection of a link that appears appropriate and leads to the desired information; .05 for the selection of a link that appear appropriate but is not successful, and 0 to the selection of links that give no indication of information leading to success. Weighted effectiveness and efficiency scores are then computed.
Thirty six percent initially browsed subject categories while the rest entered single or multi-word concepts. Key words and in some cases natural language were used in subsequent moves despite the fact that Yahooligans! does not support natural language search. Subsequent activity mixed browsing with term search. Looping and backtracking were very common but the go button using the search history links was unused. Most children scrolled but not often the complete page. Half were successful but all were inefficient.
- Ethnomethodologically Informed Ethnography and Information System Design
Andy Crabtree, David M. Nichols, Jon O'Brien, Mark Rouncefield, and Michael B. Twidale
Crabtree et al. object to traditional ethnographic analysis as applied to information problems on the basis that the application of pre-defined rules and procedures yields an organization of the activity observed from the point of view of the analyst rather than that of the participants. Such a ``constructive analysis'' approach does not describe the actual activities, but in the name of objectivity imposes a structure which obscures the real world practices through which subjects make sense of their surroundings, and produce information.
Ethnomethodology emphasizes rigorous thick description of local practices by assembling concrete cases of preformed activity as the direct units of analysis. EM analysis attempts to generate a description in great detail of how the described activity could be reproduced in and through the same practices. Such description provides a sense of the real world aspects of a socially organized setting to systems designers and thus provides the exceptions, contradictions, and contingencies of the activities that otherwise might not be evident. Practitioners of ethnography and computer system design have quite different cultures but communication can lead to far better design practices.
- Annual Review of Information Science and Technology, Vol. 33, 1998, by Martha E. Williams
- IT Investment in Developing Countries: An Assessment and Practical Guideline, by by Sam Lubbe
Queen Esther Booker
- Information Brokering, by Florence M. Mason and Chris Dobson
James J. Sempsey
- Information Management for the Intellegent Organization: The Art of Scanning the Environment, by Chun Wei Choo
Donald R. Smith
CONTENTS (Number 8)
In this issue
Bert R. Boyce
- An Evaluation of Retrieval Effectiveness Using Spelling-Correction and String-Similarity Matching Methods on Malay Texts
Zainab Abu Bakar, Tengku Mohd T. Sembok, and Mohammed Yusoff
We begin this issue with Bakar et alia's evaluation of string matching methods on Malay texts. Much of current post 1960 Malay text is in the Rumi alphabet, a romanised system based on English phonemes. English conflation algorithms can be used effectively. Because of prefixes and infixes stemming alone is not effective, and the addition of n-gram matching is required. Using a data set with 5085 unique Malay words and 84 query words, eight phonetic code lists were created using four coding methods from stemmed and not stemmed dictionaries. One hundred words surrounding a matched key are chosen, equally above and below unless too close to the top or bottom of the list. Stemming proves to be very helpful, as does phonetic coding. It seems that smaller key sizes perform better. Diagram, an existing string matching algorithm, gave the best relevant and retrieved results.
- Managing Heterogeneuous Information Systems through Discovery and Retrieval of Generic Concepts
Uma Srinivasan, Anne H.H. Ngu, and Tom Gedeon
Within application domains users with common objectives create heterogeneous databases to store and manage similar data types. Usage patterns indicate the knowledge of the users. The notion, for Srinivsan et alia, is to create a ``middle layer'' of concepts extracted from similar patterns in existing systems and from the use of these systems, which can wrap the existing databases and provide a common access mechanism. Entities defined in existing systems as sets of variables, are extracted and classed using similarity measures based on commonality in structure and use patterns. Those classed together represent a common application specific generic concept.
For each class user group pair a ``group data object'' is created. A tree of ``group data objects'' that represents user types at different levels of specificity is generated from user supplied terms and query extracted terms from each user type. A user is mapped into a user type and then the appropriate group data objects are generated and their labels displayed to the user for selection. Selection generates the extractors from each database for that user type in that group data object. Three medical databases clustered yielded eight concept classes and multiple user objects were created. Tests showed varied query production in the same concept classes for the various groups.
- Raising Reliability of Web Search Tool Research through Replication and Chaos Theory
After reviewing the literature of evaluative web search tool research, Nicholson replicates the 1996 Ding and Marchionini search service study ten times during the Summer of 1998. Previous work finds replication yields significantly different results over time. The first twenty pages returned by Infoseek, Lycos, Alta Vista and Excite for the five queries were examined and ranked between 0 and 5 for relevance. Differing engine rankings for each replication are the rule. Using two queries, one designed to have a stable answer and another a dynamic answer over time, the four systems were tried again on five successive weeks. New pages appearing in the first 20 pages in each successive week were counted, as were pages that changed ranked position. Both queries showed considerable change week to week. The results were aggregated and the frequency of the engine with the highest number of relevant documents found to show a replicable pattern over all weeks, the odd weeks, and the even weeks. This pattern provides a clear ranking of the five engines, which was not determinable from the individual replications.
- The Personal Construction of Information Space
An information space, according to McKnight, is just the objects, real or virtual, an individual uses to acquire information. A repertory grid is a means of externalizing a person's view of the world where a triad of elements is presented and the subject asked to find how two are the same and the third different. The focus that makes this possible is given a rating scale with extremes for both poles, and called a construct. Multiple constructs with element ratings provide an individuals view of a domain. Eleven information sources were elicited from a University lecturer and presented as triads. Ten constructs were elicited and the elements rated on the constructs. A cluster analysis reorders the grid so similarly rated elements and similarly used constructs are adjacent. Both construct and element clusters seem to make sense and likely reflect the subject's views of his information space. It remains to be seen if parts can be shared with other subjects.
- Time-Line Interviews and Inductive Content Analysis: Their Effectiveness for Exploring Cognitive Behaviors
Schamber uses her weather information data collected by time-line interview techniques and content analysis to address the effectiveness of these techniques. By soliciting a sequence of events where weather information was needed and sought, and soliciting the one event in the sequence where information was most actively sought, the key event, and those before and after it could be studied in some detail. The time-line technique provides an unobtrusive means of collecting data on perceptions and yields rich data. It is, however, a labor intensive method. The content analysis was also unobtrusive and effective, but also very labor intensive. In this framework criteria are best defined from user's perceptions, which are indicated with validity from self reports.
- Abstracts Produced Using Computer Assistance
Timothy C. Craven
Craven evaluates abstracts produced with the assistance of TEXNET, an experimental system which provides the abstractor with text words and phrases extracted by frequency after a stop-list pass. Three texts of approximately 2000 words each were chosen and for each text a set of 20 different subjects drawn by advertisement within a University community created abstracts using TEXNET. Half got a display of keywords occurring eight or more times, and half got a display of phrases of the same occurrence. All subjects were surveyed as to background and reaction to the software, provided with a demonstration of the software, and told their abstract should not exceed 250 words. Nine of these, including the author abstract, were read by three raters again recruited by advertisement. Analysis shows no correlation between keywords or phrases and quality ratings or usefulness judgements by subjects. Experience did not lead to conciseness, originality or approximation of the author abstract. Female gender correlated positively with length and use of words from the text. Subjects wanted to view text and emerging abstract simultaneously, easy scrolling, standard black on white screens, a dynamic word count and spell checker.
- Encounters with the OPAC: On-Line Searching in Public Libraries
Deborah J. Slone
Slone looks at the behavior of OPAC users conducting known item, area (broad search with most refinement off line), and unknown item searches in a public library. Thirty six participants, who approached the terminals and agreed, answered a pre-search questionnaire on OPAC experience, reason for coming, and length of time spent planning their search. They were then observed, and their searching terms, comments, reactions, age, gender, time on line, and outcome logged. Feelings were inferred from observation and noted except that confidence level was solicited in the questionnaire. Twenty eight began confident, but only 14 displayed confidence during their search. Successful unknown item searchers began broadly, and focused with terms selected form initial results. Area searchers searched broadly for a general area and focused at the shelves using minimal computer resources. Known item searches were quickly effective at the terminal. Frustration, anxiety and disappointment abounded during unknown item searches.
- Using Clustering Strategies for Creating Authority Files
James C. French, Allison L. Powell, and Eric Schulman
When disparate bibliographic databases are integrated different authority conventions prevent physical combination and require a mapping that hides the heterogeneity from users. French, Powell, and Schulman advance automated techniques for the assistance of those maintaining authority for author affiliations in the Astrophysics Data System. Strings were extracted, clustered, reviewed by a domain expert and iterated to a final form. Concentration was on an ideal set of 38 institutions represented by 1,745 variant strings, with a goal of properly clustering these while excluding instances of the other 12,139 identified strings in the ideal clusters. First a lexical cleanup was run removing uppercase, country designations at the end of a string, as well as ZIP codes, and state abbreviations, and expanding a list of abbreviations. Then string and frequency of occurrence pairs are sorted and beginning with the most common string its distance to all other strings is computed and those exceeding a threshold are clustered with the most common item and removed from consideration. The process is iterated until the file is exhausted. Tested distance measures are: edit distance i.e. the number of four simple operations required for transforming one string to another, edit distance with words rather than characters, and the Jaccard coefficient. Allowing the threshold to be some fraction of the length of the shorter string improves results over a fixed threshold but higher thresholds required to cluster all variants still result in significant errors. Required human effort rises with the number of misplaced strings but such effort is reduced roughly in half by the clustering procedure.
- Inventing the Internet, by Jane Abbate
Cheryl Knott Malone
- Internet Policy Handbook for Libraries, by Mark Smith
Janie L. Hassard Wilkins
Click here to return to the D-Lib Magazine clips column.