Wednesday, January 17, 2007

Findability: Concept-Mapping Blog Archives by Content

Morville book
A wealth of information creates a poverty of attention.”
 — Herbert Simon, Nobel Laureate Economist, 1916-2001.

CMT: At the digital media workshop at this past weekend’s CMA meeting, WGBH’s Annie Valva asked developers to please make their chamber music media retrievable “by content”. What did she mean, and how do we do that?

DSM: She was not referring to “music information retrieval” (MIR), in the sense of retrieving mp3 or other music by its musical incipits or motives or orchestrational attributes. She was referring to more ordinary web content and Web 2.0 usability features. She mainly illustrated her comments during the CMA workshop by describing textual tagging of website and blog and podcast content with keyword tags. Textual tagging allows users to search and retrieve things that are of interest to them by entering tag words into HTML search input forms, or by examining “tagcloud” areas on the website, or by setting up RSS feed criteria using their preferred tag words so that content that matches those criteria is automatically pushed to them by the aggregators/syndicators.

CMT: But, even if that was the limit of what Annie had time to discuss with us, I was left with the impression that she wasn’t restricting her request to doing just those things. Aren’t there more techniques that can and should be used, to enable people to quickly find what they want or need?

DSM: I agree that Annie was implying something broader than the time available to her in the meeting allowed. Given, too, that WGBH’s programming is substantially ‘educational’ programming, I think that it’s likely that Anne was alluding to some of the so-called ‘concept mapping’ techniques that have been used in teaching/pedagogical applications over the past ten years or so. I’ve put some books and resources in our list at the bottom of this post, if you’re interested . . .

CMT: Well, what we do here to operationalize what Anne was implying?

DSM: We could do several things. One of the reader comments we received last month asked about organizing the posts so that similar things are “close together,” regardless of how far apart the similar posts may be in time. The native “Archives” and “Topics” capabilities of Blogger and other blog authoring tools really can’t help much. The topical scope of what we have been covering in this CMT blog is pretty diverse—ranging from broad social and policy-oriented topics to very narrow individual performance practice topics, and ranging from abstract ideas to very concrete ones. It’s hard to capture and index all of what each post is about with tags, in part because Blogger only lets us enter about 20 tags and is limited to at most 200 characters for the whole tag list per post. What might be better is for us to use the Resource Description Framework (RDF) and Web Ontology Language (OWL) to drive a visualization that quantitatively positions links to the CMT posts, and clusters things that are conceptually similar close together.

CMT: Okay. You’ve created this ‘principal components analysis’ (PCA) or ‘factor analysis’ (FA) statistical coding of the CMT content. And you’ve put it over to the right, in the sidebar that contains our CMT Archives. It’s simple and a bit klugey with about 100KB of CSS code. But, surprisingly, it does accurately position things in constellations or clusters that make sense. Did you specify the two axes or dimensions yourself, or does the PCA software generate those?

DSM: The Statistica PCA software automatically establishes the two “factors,” which are ‘composite’ variables composed of subsets of the concept tagging variables we use. I just specified that I wanted Statistica to do the regression and produce two factors, not three or more. It seemed to me that a two dimensional map would be best in terms of usability. What’s more, we’re only using 12 main concept tagging variables and we only have three dozen posts at this time. So trying to do 3-D or other fancier things would’ve been over-fitting the post data. The data are reasonably well fitted in the 2-D PCA, though.

CMT: But what about the legends you’ve put next to the X axis and the Y axis? Did the PCA software come up with those concept names?

DSM: No. The PCA software only optimizes the allocation of the input variables to un-named latent “factors” or “principal components.” It doesn’t infer what those factors or components mean or give names to them. But by examining the mapping and where positionally the various items are located in the two dimensions—where along the spectrum of values in each dimension each item is located—it’s often possible for a human to give names to them or to say what each factor or dimension means. This is standard practice in statistical factor analysis, in methods in common use for fifty years. We just happen to be trying the method out for chamber music blog content. It doesn’t replace the blog-search or archives index or other features. It just supplements them with a bit different ‘ambient findability’—in concept clusters.


Morville's feet and their implicit semantic connections


No comments:

Post a Comment