March 8, 2007
These are notes consolidated from postings on Karen Coyle's blog. I was typing these notes in real time, so they are fairly crude.
It probably helps to be an insider. This report on the LoC page for the Working Group does a better job of bringing together the messages of that day than I did.
Based on the meeting presentations and comments, two main information user and use environments for bibliographic data are apparent: a consumer environment and a management environment. The consumer environment relates to the end-user of the bibliographic data, the information consumer, as described by Karen Markey and Timothy Burke, and services that are designed to assist the end-user in finding relevant information, from search engines to specialized catalog interfaces. The management environment pertains to resource collection management. Although these two environments represent different perspectives of bibliographic data, they are interrelated...
This tension between the user view and the management view is something that I keep coming up against. Whenever the question comes up of how we will define the library catalog of the future, most librarians exhibit an interesting schizophrenia, trying simultaneously to satisfy the library management need of inventory control and the much broader needs of users who simply want the best information now, no matter who owns it. We really must resolve this conflict if we are to move forward.
This meeting was announced about two weeks ago, catching many of us by surprise. As I noted in my setup post, the meeting was originally intended to have an "invitation only" audience. The switch to "open to the public" may have come late in the planning. There were about 50 people there, most of them from the immediate area. The members of the LoC committee were also there.
This committee has a huge task: to define the future of "bibliographic control." No one defined the term bibliographic control during this meeting, and in fact it was rarely voiced as a term. That may be for the better, because it describes something that libraries have traditionally done, and at least some people are suggesting that we shouldn't do it in the future. Thus the "future of bibliographic control" may be an oxymoron.
By the end of the day, however, none of us in the audience could have made a clear statement about the day's topic. The speaker who seemed most on track (and who was the most interesting, IMO) was Timothy Burke, professor of history. Burke talked about how he searches for information, but most importantly he talked about why he searches for information. Some examples he gave were:
He also talked about the sociology of knowledge, that is needing to know who is authoritative, what work has influenced other work. What the opposing camps are in a field, and how a line of thought has developed. In the discussion afterward libraries were talked about as static while information is social and dynamic. Later, Lorcan Dempsey summarized this with a concept from Eric Hellman: the difference between lakes and rivers. Libraries are lakes; a little comes in a little goes out, but it pretty much stays the same. Information as we use it in the networked world is a river, fast moving and you never step into the same place twice. Burke offered that libraries could decide that they will specialize in the static, stable part of our information use, and leave the rest to others, but he acknowledged that would not be a good idea. (Unfortunately, that is our status today.)
When Andrew Pace showed the NCSU Endeca catalog, I could see some of Burke's dynamism taking place in the ability to get the information from different angles. Pace, however, began his talk by explaining that the catalog was designed to work with the data that they had, that is, standard MARC records. See his wishlist at the end of his talk.
Two speakers made specific comments about problems with MARC. Bernie Hurley showed that much of the detail of MARC is never used, and at the same time that creating indexes from the used fields is very complex because the data for a single index is scattered over many fields and subfields. (Think "title") Oren Beit-Arie of Ex Libris had a list of MARC problems, including the resource types (scattered throughout the LDR, 006, 007, 008), uniform titles (which he thinks are not working as they should), and internationalization, which MARC does insufficiently.
There was some interesting discussion about full text. Dan Clancy, of Google Book Search, talked about the difficulties of doing ranking with full text books. He stated that Google does not organize web information - the web contains its own organization in the form of links and link text, which give you both the connection between documents and the context for that connection. The main revelation in this talk for me was that they are experimenting with full text scans for de-duplication. This is intriguing when you think about how you could map "likeness" when you have the full text and the images in a large body of books.
Some brave statements were made:Burke: we may have to forget about backward compatibility
It was a provocative day, and although there wasn't a lot that was really new it was interesting to see that there is some commonality of thought coming from what are essentially different perspectives. As I process more of this, I will add ideas from this day to the futurelib wiki so we can work with them there.
Note: this was really the stellar talk in this meeting. Not only was this guy the only non-librarian, he was the thoughtful user that we all hope to meet.
Dr. Burke is Associate Professor in the Department of History at Swarthmore College. He wrote a piece in 2004 titled "Burn the Catalog" In this he says:
I’m to the point where I think we’d be better off to just utterly erase our existing academic catalogs and forget about backwards-compatibility, lock all the vendors and librarians and scholars together in a room, and make them hammer out electronic research tools that are Amazon-plus, Amazon without the intent to sell books but with the intent of guiding users of all kinds to the books and articles and materials that they ought to find, a catalog that is a partner rather than an obstacle in the making and tracking of knowledge.
Burke presents himself as "the outsider." An academic, but not in the library or in information fields. His talk (excellent) was about how he gets/uses/searches for information. He started with a story about helping a student search in an area in which he wasn't terribly familiar. The topic was about economics, politics, and China. He said that they began with a World Bank report that had some citations. But they needed some context: who is a trusted source in this area? Who is authoritative? They tried the library catalog and LCSH, and finally went to Amazon for a current book on Chinese economy. Why Amazon? It was the easiest place to find what's new and what people are reading. Then from there they went into articles with author names, and only then did they turn to Google because they needed some knowledge about the topic in order to interpret the "torrent of results" that would be retrieved.
How/why he searches: (He's obviously has thought about this a lot)
The tools he needs:
What's not out there?
What search can't do and shouldn't try to do: tell me in advance the key words I need to do my searches. A necessary permanent feature is that search is a multi-step practice; search teaches you something.
Tony does technology development with the Nature Publishing Group, was on the NISO OpenURL standard committee, and is the creator of Connotea, a social tagging system for scientists.
Talk title: Agile Descriptions
Tony's talk was a review of the various Web 2.0 microformats available. He refers to this as "Rivers of Metadata." He distinguishes between "Markup of documents (semantics)" and "Exposed metadata (microformats)"
Exposed metadata (microformats) includes:
Exposed metadata could replace custom APIs for metadata exchange. Pages that are marked up with microformats can be turned into RDF for use in the semantic web.
Here he goes through various microformats, some of which connect to the kinds of things that Burke was talking about: hCard, hReview, hCite -- which allow one to make connections between things. xFolk for bookmarks. Although all of these can be used to make connections, it isn't always clear what the connection is. This came up in the discussion after Burke's talk, which is that ranking things by popularity can be mis-used... but Burke pointed out that popularity, even if you don't know WHY, tells you something about the sociology of the knowledge.
He describes tags (as in social tagging) as "simple labels" and as person "aides-memoires". Burke talked about how some of the searching he does is to confirm a memory -- we seem to do this alot, we leave bread crumbs all over, but generally they don't connect to each other. Microformats are turning into usable bread crumb paths.
Now he's showing a topic map based on the author-assigned keywords from some Nature journals. In the topic map, the tag "pediatric urology" is a larger blob than "urology." He explains this by saying that "tags are created in a context." You can see this with Flickr -- the tags something is given are within the context of the person putting the picture on the site. At the time, they are looking just at that one photo. They aren't making connections in the sense that Burke wanted, and the tags probably only make sense in that context -- but the context is not knowable to anyone but the tagger. The upshot is, however, that a topic map made from tags will not look like a topic map done as a general exercise or using a normal topical hierarchy.
Andrew Pace is Head of Information Technology, North Carolina State University Libraries, the folks who created one of the first faceted library user interfaces using Endeca technology.
Title: The Promise and Paradox of Bibliographic Control
Pace starts off with "Rumsfeld's law" (which he claims he will now retire): You search the data you have not the data you want to have. (I didn't get that right - Andrew, please correct)
The now famous NCSU/Endeca catalog was designed to overcome some "regular" library catalog problems:
Andrew quoted Roy Tennant saying that the library catalog "should be removed from public view."
Catalogs are going to change more frequently than they have in the past, and have to adapt to new technology, different kinds of screen technologies. The need to be flexible.
The "next gen" catalog is really responding to "this gen" users. (By the time we get to "next" we'll be waaaaaay behind.)Data Reality Check
In the "old" catalog, 80 MARC fields were indexed in the keyword index -- 33 of those are not publicly displayed. There are 37 different labels in the display. In Endeca they indexed 50 MARC fields.
Simple data are the best. Were thinking of going to XML, but Endeca preferred a flat file, basically a display form of the MARC record. Removed punctuation.
With the Endeca system they were able to re-index their entire database every night, without bring down existing system. This meant that they were able to tweak relevance algorithms many times to get it right. (How many of us don't think of a "re-index" as a two-week job?) This kind of ability to manipulate the data makes a huge difference in how we can perfect what we do.
Andrew then gave the usual, impressive demo of NCSU catalog, and the facets. It's easy to see how far superior this is to the standard library catalog.How to Relevance Rank
Slide: Relevance ranking TF/IDF not adequate (Andrew, what does TF/IDF mean?)(Andrew: TF-IDF stands for "text frequency/inverse document frequency". The article in Wikipedia on it is pretty good.)
Basically, we haven't really figured out how to do ranking with library metadata. The NCSU catalog used some dynamic ranking (phrase, rank of the field, weights), plus static ordering based on pub date.
Andrew gave some interesting statistics about use of the catalog:
Two most freqently used "options" are LC classification and subject headings. Subject-based navigation is nearly 1/2 of the navigation. It doesn't appear that the order of the dimensions (facets) determines usage. The statistics from the NCSU catalog show that users are selecting facets that appear lower in the page.
Most searches in the catalog are keyword searches. Subject searches are very small (4%). Author searches only 8%. [Note: later in the day, someone suggested that the committee should gather stats about actual use of catalogs to inform the discussion. Duh!]
The definition of "most popular" (which is an option selected 12% of the time) is based on circulation figures. Call number search, title and author search are used at about the same amount, each around 10%
We still have a natural language problem -- and LCSH isn't very good for this. Andrew gave the example of the common term "Revolutionary War" vs. an LC subject heading that reads: United States-History-Revolution-1775--1783. [Look this up in any library catalog -- the dates vary so it's really hard to tell what subject heading defines this topic.]
The new discovery tools point out inadequacies in the data. What could replace LCSH? User tagging is interesting, but there's the difficulty that the same tag gets added to many items, and the retrieved set is huge.
Will we be able to make sense out of full text? Right now our store of digital materials is incomplete so it is very hard to draw any conclusions from the full text works that we have.
Andrew present a wish list:
Note: Anurag Acharya could not be here so Dan Clancy from Google Book Search is taking Anurag's place.
In the early Internet days, Yahoo started out emulating a traditional catalog with its subject categories, but people seem to prefer the search method. The search method works because the web itself provides the organization through links. Google doesn't organize the web, instead it makes use of the organization that web pages provide. Google makes heavy use of anchor text that defines links. These anchors provide the meaning behind the link; essentially, aboutness. A link is an assertion about the relationship. It is also a kind of metadata.
Google Book Search currently relies on things like the title for ranking, not links. On the web, people consider search to work well, but without those links, search is not a "solved problem."
One of the questions that a system like Google must address is: What is an object? The answer is not simply: "a web page is an object." There are many "same" pages on the web, so even the web needs to be de-duped. How do you determine "sameness"? It's not pure equivalence; sameness is a fuzzy function. In the end, things are determined to be "effectively equivalent."
Apply this to books. It depends on the context. Google needs to algorithmically determine equivalence.
Authority: Who is an Expert?
Authority used to be easier to determine -- professors, where they work, what degrees they have. Doesn't work on the web. He calls the web a "democracy." The only way to get authority is to take advantage of the masses; there's too much stuff for you to be able to make determinations any other way.
The cost of asserting opinions determines value: it costs more to maintain a web page than to write a blog; it costs more to write a blog than to tag photos in flickr.
Searching things other than the web.
How to decide?: Listen to your users Let the users tell you what they want to do. That doesn't mean that you can't also serve minority groups. (kc: This implies that "average" users like Google Scholar, specialist users prefer library or vendor databases.)
Clancy gives some examples in a demo:
Google Book Search
Metadata problems are a big issue (?) Didn't say much to support this (we should get him to elaborate)
How do you determine if these two books are the same? (Two books from different libraries) It's easier to figure out once you've scanned the same book twice. (This implies that they use the scans to determine duplicates. Intriguing idea!)
How do you create an ontology of non-web objects?
Google Books is an Opportunity to help users. "We have the opportunity to help users find this wealth [in libraries]"
Roy Tennant asks: what do *you* mean by metadata.
Answer: "Things that describe this book."
Person from Bowker asks: ISTC is coming along; will Google use it?
Clancy: sure, if it helps users.
Lorcan Dempsey asks: Do they have any authority files for persons, places, etc.
Clancy: We probably do not use authority files to the extent we should. We mainly work with the text.
Bernie Hurley is the Director for Library Technology at the UC Berkeley library.
Opening screen: ;-)
Title: 245 00 $aBibliographic control $h[electronic resource] $bA perspective from a Research Library $cBernard J Hurley
Most of what bibliographic control is to libraries is: MARC.
"I'm desensitized to MARC. Or thought I was; I actually have some deep feelings about it. The title [of this talk] is a metaphor."
Title of Talk: Metadata Needs for Research Libraries
The purpose of university is to confer tenure. This means: teaching, research, and publishing.
Metadata is used for
How we index things:
2/3 of searching done in 3 indexes : title keyword, personal author, subject keyword. Limit by "location" is the most frequent in the UCB catalog.
We are maintaining access points that are rarely used. This is a question of where we put our resources -- we should put more energy into keyword indexes.
MARC is not only encoding, but what we encode. MARC 245 has information about the title, but also information about the author, dates, medium form, version. This makes indexing complex, as indexes pull from individual subfields all over the record.
Simple displays use very few fields. Our catalog displays 75 of the 175 MARC fields; it maps those into 27 labels. Display loses a lot of detail.
digital: 856 with the URL works pretty well; but the 856 also has lots of other information
Print: leads users to shelf
There is a mismatch between the richness of MARC and how we serve our users
Can we make it work harder? Maybe MARC isn't the *right* metadata. (Oh, horrors!)
It's expensive to create MARC records. It's expensive to create the MARC format. MARC sucks up all of the resources available for metadata creation. At Berkeley, the technical services staff doesn't have time to do metadata creation for digital library, so digital library is setting up its own metadata creation function.
The UC Bibliographic Services Task Force Report
MARC isn't flexible - it's hard to integrate new metadata into MARC. Things like faceted browsing, full indexing, etc. are hard to do with MARC We need to radically simplify MARC - we aren't using most of it. It could be used with other metadata, like DC, ONIX, LOM. METS already packages these together. It's not just MARC anymore.
Best quote: "Research libraries are spending a fortune on creating metadata that is mismatched to our users' needs."
New services make the MARC mis-match worse; we can't fit new stuff into MARC.
Oren Beit-Arie is Chief Strategy Office at Ex Libris, one of the key (and ever diminishing group of) library systems vendors. Oren created the first OpenURL resolution service that was offered commercially, and has been active in metasearch, electronic resource management, and other ILS developments.
He began by saying: "I'm just glad that there still is something called the library vendor."
Vendors are affected by the changing economics of libraries and the information area. Libraries are focusing more on their core role, not just their core competencies. Focusing on what they "ought" to do.
Uses: Discovery & Delivery
Solutions need to take into account other languages and other cultures; differences in workflows. What we don't do well:
Economics -- we need more mid-level collaboration
Library catalogs and other services are in decline:
There is more content and more content types Can't be isolated - have to interoperate (and that needs to go both ways)
Role of the library
Challenge: End-user services are tightly tied to back-office operations; This isn't going to work. Overall architecture has to change. We need to decouple the user experience from the back-office operations.
Role of metadata
What you see is NOT what you get; what you get is NOT what you see. Decoupling complexity and user view.
Parts of MARC that just don't work:
Full text adds a lot to the mix, but it's a very different beast. It isn't clear what the role of metadata is in the full text world. There seems to still be room for manual processes to clarify semantics. How can libraries benefit from full text without taking on whole expense of storage and organization?
Lower barriers have a better chance for success; but some radical change can be handled.