Archive for the ‘cataloging’ Category

Thursday, February 5th, 2015

Subjects and the Ship of Theseus

I thought I might take a break to post an amusing photo of something I wrote out today:


The photo is a first draft of a database schema for a revamp of how LibraryThing will do library subjects. All told, it has 26 tables. Gulp.

About eight of the tables do what a good cataloging system would do:

  • Distinguishes the various subject systems (LCSH, Medical Subjects, etc.)
  • Preserves the semantic richness of subject cataloging, including the stuff that never makes it into library systems.
  • Breaks subjects into their facets (e.g., “Man-woman relationships — Fiction”) has two subject facets

Most of the tables, however, satisfy LibraryThing’s unusual core commitments: to let users do their own thing, like their own little library, but also to let them benefit from and participate in the data and contributions of others.(1) So it:

  • Links to subjects from various “levels,” including book-level, edition-level, ISBN-level and work-level.
  • Allows members to use their own data, or “inherit” subjects from other levels.
  • Allows for members to “play librarian,” improving good data and suppressing bad data.(2)
  • Allows for real-time, fully reversible aliasing of subjects and subject facets.

The last is perhaps the hardest. Nine years ago (!) I compared LibraryThing to the “Ship of Theseus,” a ship which is “preserved” although its components are continually changed. The same goes for much of its data, although “shifting sands” might be a better analogy. Accounting for this makes for some interesting database structures, and interesting programming. Not every system at LibraryThing does this perfectly. But I hope this structure will help us do that better for subjects.(3)

Weird as all this is, I think it’s the way things are going. At present most libraries maintain their own data, which, while generally copied from another library, is fundamentally siloed. Like an evolving species, library records descend from each other; they aren’t dynamically linked. The data inside the records are siloed as well, trapped in a non-relational model. The profession that invented metadata, and indeed invented sharing metadata, is, at least as far as its catalogs go, far behind.

Eventually that will end. It may end in a “Library Goodreads,” every library sharing the same data, with global changes possible, but reserved for special catalogers. But my bet is on a more LibraryThing-like future, where library systems will both respect local cataloging choices and, if they like, benefit instantly from improvements made elsewhere in the system.

When that future arrives, we got the schema!

1. I’m betting another ten tables are added before the system is complete.
2. The system doesn’t presume whether changes will be made unilaterally, or voted on. Voting, like much else, existings in a separate system, even if it ends up looking like part of the subject system.
3. This is a long-term project. Our first steps are much more modest–the tables have an order-of-use, not shown. First off we’re going to duplicate the current system, but with appropriate character sets and segmentation by thesaurus and language.

Labels: cataloging, subjects

Tuesday, March 15th, 2011

VIAF, OCLC and open data

Yesterday I released a service called “LC AuthoritiesThing.” The service solved a problem many have had with the LC Authorities website. Although a fine searchable resource, LC Authorities does not have stable URLs. Links die after a short period and are tied to sessions in a way that prevents sharing URLs during that period. LC AuthoritiesThing provides a window into the LC Authorities site which allows hard, reliable links. Various catalogers have thanked us for making the service, as it will allow them to refer to authority records more easily.

As an update to the post I took notice of VIAF, the Virtual Authority File, recommended to me as a substitute by a cataloger on Twitter. I assumed (apparently wrongly) that VIAF would at some point supercede LC Authorities. And I wrote that VIAF wasn’t a good substitute because it is an OCLC project, and encumbered by licensing restrictions.

Since then, I have received a diversity of communications that I am wrong. Although its data is hosted by and its services were developed and served by OCLC, VIAF is not an OCLC project, and the project has no access terms. Thomas Hickey from OCLC even wrote on this blog that full dumps are also available, although they must be approved somehow by project leaders.

This is welcome news. LibraryThing will be submitting a request for a full VIAF dump, and we’ll see where that goes. We will also look into automated harvesting of the website, or at least the LC portion of the data.

So much so good. But the situation is illustrative. Select people within the library community may believe that VIAF is free. But every public indication is that it is not free.

These indications include:

  1. OCLC copyright notices on every single page, and all VIAF-related pages on
  2. Links to the OCLC Terms and Conditions from multiple pages, including the Privacy page.
  3. A robots.txt file that prohibits automated access to result pages.
  4. The “About VIAF” project page prominently states “Use of our prototypes is subject to OCLC’s terms and conditions. By continuing past this point, you agree to abide by these terms.”

As all catalogers surely know, the OCLC Terms and Conditions are lengthy and explicit. Among other things they prohibit commercial use, automated use, storage of data, and use of the data for cataloging (!). They state that OCLC has sole and arbitrary discretion to discontinue access to anyone for any reason. They state that exceptions to the terms requires permission in writing from OCLC.

Meanwhile, apart from a blog comment from Thom Hickey, I can find no assertions that OCLC terms don’t apply to VIAF, no mention of dumps or of a process to get them.

VIAF is to be commended for its openness and lack of terms. This is a great move forward for open bibliographic data. But it needs to make greater efforts to make others aware of this state of affairs, and define the level and character of openness. (It’s still unclear to me whether VIAF asserts any ownership, or whether it is all in the public domain.) And VIAF should make efforts to remove multiple statements asserting that OCLC terms apply to VIAF data.

Labels: cataloging, oclc

Sunday, March 13th, 2011

LC AuthoritiesThing: Permanent links to LC Authority records

Library catalogs are notorious for their URL structure. More than a decade after the rest of the web decided on solid, permanent links, most library systems continue to generate ephemeral, usually session-based ones. Sometimes catalogs have a syntax for permanent links, but they’re a special, added feature.

The problem is at its worst with the Library of Congress Authorities system, used by catalogers and librarians the world over. The core of authority control is a stable identifier, in this case the LCCN, but the LC Authorities catalog can neither be searched by nor linked to by that identifier. No matter what URL you find, it dies when the session dies. You can’t even link to searches. What ought to be a rock is a puff of smoke.

The problem was been solved for Subject Authority files when the Library of Congress released the Authorities and Vocabularies website, which allows linking to subjects by their LCCN (eg., sh85026719). But name-authority files (ie., authors) have received no similar treatment.

LC AuthoritiesThing is a partial and tentative solution to that problem, a window into the Library of Congress Authorities catalog that allows permanent linking. Search for a name (or subject) and, when you find it, the page will have a tiny link icon () which serves as the permalink for the page.


It took a little magic to get it to work, but it does.* For now at least, you can’t link to records you haven’t found. If there’s interest, I will inject Simon Spero’s ingenious screen-scrape dump of LC Authority files, which will give me the necessary link between 001 and 035 fields.

For now, it’s just an experiment. Will anyone find it useful? Is it worth putting on its own domain? What would make it better? I know, anyway, that it can be of some use to LibraryThing. In the near future I plan to bolt it to LibraryThing itself, so members can link authors to their LC Authority number, when the link will help clarify things.

If you have any thought, discuss them here.

Update: It’s been objected that LC Authorities has or will be superseded by VIAF, the Virtual International Authority File, an aggregate of authority files from libraries around the world. Unfortunately, VIAF is another OCLC project, studded on every side by copyright assertions, EULAs, use restrictions and licensing terms. As with most everything else OCLC does, the core information was created at taxpayer expense, and is legally impossible to copyright. The rest was created by libraries with no intention of creating a proprietary resource. And the result is another proprietary, restricted and nigh-inescapable data monopoly.

*Behind the scenes it’s doing both proxied requests and stepping through pages as if it were. If anyone can come up with a better way, I’m all ears.

Labels: cataloging

Tuesday, February 8th, 2011

LibraryThing and FRBR?

Jeremy and I just finished writing a long post, LibraryThing dives into editions and expressions, laying out our plans to move LibraryThing to a new structure reminiscent in some ways of the FRBR system familiar to many librarians. Anyone interested in FRBR and cataloging might be interested in checking it out.

LibraryThing has long had a FRBR-like system, with three rather than four “levels,” and some differences in how the levels are conceived. The system is managed by members, and has achieved remarkable results. We believe, for example, that our ThingISBN service, produces better other-edition data for a book than OCLC’s xISBN service, which lacks user input. (Also, ours is free; they charge—but I digress.)

It’s time, however, to move to a more complex system, which can do everything members want to do. Go ahead and check out the discussion.

I posted here because I think the question should engage the larger library world. LibraryThing is a unique test-bed for ideas, and a potential source of both inspiration and actual organization for libraries.

Some questions for librarians:

1. How do you see the system agreeing with or differing from FRBR?
2. What FRBR-related ideas should we take a look at?
3. Which will happen first, RDA or LibraryThing’s new system? (joke)

Labels: cataloging, frbr

Friday, October 29th, 2010

Better German cataloging from open data

University of Konstanz (Wikimedia Commons)

Casey has just finished loading 1.38 million library MARC records from Konstanz University into LibraryThing’s search index, Overcat.

While Overcat isn’t the only way to find German items–you can search libraries directly–it has become many members’ first source. At 35.2 million items, it’s now considerably larger than any remote source, as well as faster and more diverse. The Konstanz University records jump it up significantly as a German-language source.

Adding the records was possible because Konstanz chose to release the records as “CC-0,” essentially “public domain.” In as much as OCLC has convinced (or intimidated) much of the library world into acting as if library records were private property, this was a brave move.(1) You can read more about the release on the Open Knowledge foundation blog. It’s notable they originally opted for a more restricted, non-commercial license, but, under prompting from German librarians, opened it up all the way.

And what will we do with these records? Evil things! Hardly. LibraryThing has never sold library records and we never will. But the records will make a small percentage of members happy, as their German books suddenly got easier to catalog. These records, in turn, will serve as a scaffold to add other cataloging-like data—what we call Common Knowledge (CK)—all of which is released under a Creative Commons Attribution license. In this way open data improves open data, and everyone is the richer.

1. Their action is especially notable in that German governmental agencies aren’t required to disclaim copyright, as US ones are. Locking up free US government or government-funded library data, as OCLC does, is obnoxious and legally dubious, but Germany has different rules–including a true “database copyright” the United States lacks.

Labels: cataloging, open data, openness

Saturday, January 31st, 2009

Cataloging and fun

On Thursday we introduced a silly new “meme” page called “Dead or Alive?” which listed your authors by their mortal status–alive, dead, unknown or “not a person.” (See the blog post or check out yours.) The feature drew on the birth and death dates of the authors in our Common Knowledge system, a free (Creative Commons) “fielded wiki” for miscellaneous “cataloging” information (think “Wikipedia for book info”). To move an author from the “unknown” column, members had to find their dates and enter them onto into Common Knowledge.

Here’s a chart of Common Knowledge contributions over the last month.* Can you spot the day “Dead or Alive?” went live?

As you can see, birth, death and gender edits (gender is where you mark an author as “not a person”) went through the roof when the feature was announced—from an average of 143 edits per day, to 3731 and 3584, 25 times the average. Other edits went up too—a 30% increase.

A few members joked that it was a plot to encourage contributions to Common Knowledge. It wasn’t that. I just thought it was a funny idea, but I wasn’t unaware that it would have that effect. Indeed, the upshot shows again something of a LibraryThing finding—that regular people will contribute cataloging information if you make it meaningful to them. That is, whatever incentive there is to add author information, the incentive is increased when they’re your authors, and increased again when that information does something for you. Of course, even if incentive is personal, the effect is general; you update the author because you have his or her book, but everyone else shares in the value of that update.

The way this works undercuts a common myth of “Web 2.0″—that there are all these people out there adding “user-generated content” out of altruism or an extreme mismatch between time and exciting things to do. And it cuts against an older myth, that cataloging is so boring you have to pay people to do it.

We’ve seen the same jump every time we introduced a new Common Knowledge category, and again when we made that category “come alive” in some way for members. And although the short-term jump will surely level out, the overall rate of “dead-or-alive” entries certainly not. You get more changes when the changes do something for people.

Now, of course, there’s a whole list of things this doesn’t mean. It doesn’t mean that LibraryThing members are doing their job well (although I suspect they are). It doesn’t mean the same would apply to much more difficult forms of cataloging, or to forms that generally presuppose professional training (ie., LCSH). And it doesn’t mean that regular people will get to the “rare stuff,” indeed it probably means that average cataloging attention is directly related to popularity of the underlying item.

Even so, pretty cool. Oh, by the way, I’m adding a feature allowing you to compare yourself to other members, which should inflame the other great motive for personal metadata—competition. After all, my library has a higher dead/alive ratio than yours!

UPDATE: Here’s the current chart, without day-norming. Notice how everything went up.

*The numbers are normed against day-related changes. Basically, we smoothed out that many more edits are made on Monday than Saturday.

Labels: cataloging, dead or alive, Social Cataloging, zombies

Thursday, June 26th, 2008

The Future of Cataloging at ALA

If you’re at ALA in Anaheim, have nothing to do Sunday morning and are interested in the future of cataloging—and who isn’t?—you might be interested in the following panel:

ALA Annual Conference
Sunday, June 29, 2008 from 8:00 a.m. – 12:00 noon
Anaheim Convention Center, Rm. 204B

The panelist include Roy Tennant, Jennifer Bowen, Martha Yee, Diane Hillmann—and (gulp) me!

The moderator, Robert Wolven of Columbia*, is promising to keep it snappy, with brief presentations and oodles of time to discuss the big issues.

I don’t know all the panelists, but I know we include some very different visions of the future. There may be fireworks! (I won’t be attacking OCLC as much as I otherwise might. Roy could disarm Rambo.)

My mini-presentation is titled “UGC: The Next Sharp Stick?” UGC is, of course, User Generated Content. And the “Next Sharp Stick? is a reference to John Hodgman’s humorous one-act play “Fire: The Next Sharp Stick?” The play ends with the fire-promoting caveman being killed, of course.

What can I say? They didn’t ask me on to be conservative straight-man.

*No “primary link” I can find, but see this for starters.

Labels: ala, ala anaheim, ala2008, cataloging