Cleaning up the tags

read

As you probably know, the best way to get a bargain on eBay is to stumble on an item that has been poorly described. People searching for a pirate doll, for example, will not find an item that has been described as "priate doll" and hence its eventual sale price will probably be much lower than an item correctly described. Is the same true of information? In an article in D-Lib online magazine (a DARPA-sponsored publication about digital librarianship), two digital librarians discuss whether we should make any effort to "tidy up" folksonomies.

"On this scene enter – winged, horned, and spined – A longlegs, a moth, and a dumbledore

[Hardy] might have instead written "A crane fly, a moth and a bee", had he been willing to foresake the opportunity to instill a little local colour, but his choice to use dialect or common names was inspired, and the poem benefits from it. However, a search engine would not."*

It`s undeniably true that the way items are tagged on sites like Flickr and del.icio.us is very haphazard. This is often because words are mis-spelled due to carelessness or a particular idioglossary, and any such tags are unlikely to be useful to other people unless the mis-spelling is common enough to be statistically significant. Another reason for the variation is that there is no widespread convention on whether to use singular or plural words for a tag; looking for "goose" on Flickr will not find images tagged only with "geese". The third common reason for the variation in tags for a particular thing is the one referred to in the example of Thomas Hardy`s poem. Searching for "dumbledore" will give you a lot of hits about Harry Potter and some, but far fewer, about bees. Tags are applied in many languages, and even though most are in English, English has such an enormous vocabulary that most words have synonyms. The librarians who wrote the article recognise the difficulty of educating or coercing all users to use more useful tags, although they believe that regular users will naturally tend to use the same tags as each other because these are the tags they see most frequently in searches (a Power Law effect). My own opinion is that user behaviour is unlikely to get any better than it is today. In fact the more people who use tagging systems, the higher the proportion of naive tags there will be. However, Flickr has shown that value can be added to tags by using statistical methods to enhance searching. Search for "apple" and you will get results divided automatically into images of computers and images of fruit, simply because of the additional tags that are commonly applied to those two discreet sets of images. If you Google "geese" you will get some pages that only contain the word "goose". Google is sophisticated enough to know that the two words are closely related. We can learn to deliver useful search results based on tags of relatively poor quality. All that is needed is a critical mass of tags to begin with. Let`s start tagging stuff! * Interesting to note that even in this short extract there are two words, "foresake" and "instill", that my spell checker quibbles with. Even librarians have cultural differences.

Cleaning up the tags

Dominic Sayers

Written by

Dominic Sayers

Supported by

Dominic Sayers

That was inedible muck, and there wasn't enough of it