Tuesday, February 23, 2010

Distribution of Dictionary Content on the Web

Dictionary publishing has been behind the times long enough. At no other time in the history of dictionary making other than right now have publishers been privy to what and how dictionary content is actually being used by people. Who among us has not looked up a word online? I've read reports by publishers who have taken stabs at answering questions of dictionary usage, but by and large these attempts in the end came out smelling like random polls or marketing surveys. Despite their scientific slant, I doubt they adequately ascertained how, why, and for what purpose users reach for a dictionary. And research conducted every few years doesn't in my mind constitute on-going research. No doubt valuable data was gathered from such practices, compiled, duly noted. In rare instances, I'm sure surveys and polls bore real fruit, in the form of overdue editorial decisions to make dictionaries more easily understood, more handy, or with greater faculty to aid us users. But all the user data that has been collected by research pales in comparison to the raw hard data that daily flows through the pipes of the top reference websites.

On these sites millions of queries pound the dictionary databases per day. Four million queries a day from one site alone, I heard, and it's not even in the top three. For over ten years data has been collected, all of it bound up in log files. Astounding to all who glance at it for sure, but, I'll go out on a limb here and tell you, no one I know of has bothered to give such usage the attention it duly deserves. I've heard lexicographers tell me outright it's all garbage – “the top words are the swear words, day in day out” and “the list is filled with misspelled words, gobbledegook and worse.” Publishers insist they already know what people need; they point to the frequency data every publisher keeps on words in the English language. Frequency data in turn informs the wordlists used for the different sizes of dictionaries they publish, from the wordlist of the smallest mass market paperback to that of the largest unabridged dictionary sitting on a wooden stand in front of the librarian's desk. Online dictionary usage follows patterns similar to those publishers' frequency of usage, but it's more like iTunes set to play on random. You'll note, the longer you listen, iTunes seems to play the Greatest Hits more often. So too with online dictionaries. People look up the Greatest Hits every day without fail. But here's the kicker, with lists that are millions of words deep, including so much “gobbledegook” it's too hard even for the trained lexicographical mind and editorial eye to separate word meat from the word bones. Computational analysis, so rarely applied outside corpus work, has to my knowledge never been applied to the logfiles, the dregs of actual usage left by the unwashed user. To further compound any science involved in digging deep into such data, the majority of reference website traffic flows through sites that have licensed publisher content but who are not required to let the publishers know what words get looked up and when. There are some publisher controlled sites that receive enough traffic to produce interesting data, so therein lies some hope for change and insightful development.

Before I forget to mention it, this blog post is really about distribution of dictionary content via the web. Getting publishers caught up on what hard usage data can illustrate is just a part of producing a better editorial process for dictionary content production, and it touches tangentally on issues of distribution. Getting usage data at all these days from content you control and distribute is a tricky feat, one most publishers may believe is only achievable from building a destination website, along the lines of Dictionary.com or Merriam-Webster.com. I hope not every traditional reference publisher is thinking that, but I fear most are. Publishers who have already built websites with hundreds of thousands or even millions of database queries per day may even be happy with their usage numbers; although they've probably locked their minds down on how those database hits affect the value of ads or how they may turn users into customers of their print or electronic wares. That's a major mistake.

Digital publishing, generally speaking thus far in man's evolution, is the broadest and easiest form of distribution ever conceived. No dictionary publisher has had this fact drilled into their skulls yet, but there are a few so called internet companies that fully comprehend the kind of distribution reference publishers must come to fundamentally understand, appreciate and perhaps even leverage. Here's a super short list of factoids to keep in mind as you mull all this over:

  1. 50 million tweets flow through Twitter's servers a day although tomorrow that number will be higher and in six months it will be three times that number. Those are rather high numbers, huh? Given the shift occuring giving real-time search such high marks for the Business of the Future, Twitter is sitting pretty. It is in fact the model for all others to follow – broad user adoption around the world, low threshold for participation, growing like crazy.
  2. 30% of Ask.com traffic is reference based, and if Google or Microsoft gave out such figures, it's likely to be a similar percentage of Google and Bing search traffic too. (There's good reason for that dictionary link on nearly every Google search results page.)

Where people communicate, where they consume content or any type, and where they search for answers, dictionary content seems to play an important role in people's comprehension and understanding of one another. Honestly, digital content doesn't need to reside on sites one or two clicks away. It really needs to be built-in everywhere. Due to the traditional business of the dictionary publishing industry, digital reference content has mirrored its physical print parent in terms of distribution. Locked in, locked down, shipped once. That practice is not gonna fly much longer. Businesses built on those models are already crumbling. Print dictionaries are rapidly phasing out – I've heard talk of 60% drops in print sales this past year from the industry leaders. When that segment of their business is reduced to niche, publishers will have some serious decisions to make beyond just that of early retirement.


So to avoid going down with the ship, what follows is my bit of free advice for dictionary publishers. The first two numbered points are critical. They need to be properly understood. The bullet points are forward facing details on how to take the two crucial points to heart and create business from them.

  1. Be everywhere, because that's the only place dictionary content really matters. Remember, no more sitting moldy on a shelf. If you can make a product and it gets stale the moment it leaves an editor's hands, think twice about making it and then just don't. Living breathing reference looks far different than what publishers and editors already know and probably love too much. No one is actually doing living breathing real-time everywhere dictionary reference yet, but that doesn't mean it can't be done or business models can't spring into existence to support such a thing. Do it justice dictionary publishers. C'mon! You have it in your means to really rock the word. Besides, people really like you. No one wants to see you fade away like the covers on your books.
  2. What do dictionary publishers have against making their content available broadly available, let alone broadly available for free? What do they fear? Will they see additional losses of print sales? Will electronic licensing and royalties evaporate? Publishers are missing the massive forest ecosystem for a couple of handsome trees. There are millions of potential users in the world, ready to look up a word from a quality dictionary. Some want print and will always buy it and use it more than anything else, no matter what. Some will use the dictionary that came with their Kindle when they're reading on their Kindle. Still others, particularly in Asia, prefer a handheld electronic device. For the vast majority, they are going to want their dictionary in more places than one. Being human, their needs will shift from moment to moment. The internet is the sole method by which content can be delivered to nearly all of them. For applications or devices that live offline, there will always be licensing deals. No serious dictionary publishing house can be without a sweet side dish that is its licensing business.
  • Build scaleable distribution around an open API. Build a few toy models for fun, on the web or as desktop software, to illustrate key features the API makes available to developers. An open API will allow web and software developers to pull in dictionary content from any application with access to the web.
  • Publishers who want some semblence of control can make developers register for an API key. Such a low barrier won't keep serious developers away.
  • Popping Fresh Data. The API feeds client applications on an as-needed basis, which means editors can be working on content one moment and push it out to the world the very next. Yes, they'll be constantly working; no one said real-time would be easy. New editorial and lexicographical work flows will obviously need to emerge, not only for content going out over the API, but also – and this brings us back full circle to usage data – for content coming in; consumer usage data (anonymized of course) can be analyzed by lexicographers, studied by licensees, and utilized in existing sales cycles.
  • What's the business on the table with an open API? Well, recall 30% of search engine traffic is looking for reference content. That's a start! This open API model will allow developers to more easily reach users, and provide dictionary content those users really want to see: high quality dictionary content free of charge. Publishers needn't charge developers to access the API. Publishers should instead focus on making money on the incidentals. The more use dictionary content gets, the better position the publisher will be in to monetize. This was Google's modus operandi in its meteoric rise.
  • Launch with content that's already been published. Don't hesitate. Make everything of quality available from the API.
  • Iterate often. This could be the most important point beyond that of the open API. If at first you don't succeed, try a different angle. You will fail once or twice or many times. Learn from the failures. Keeping development teams on staff that are agile and sharp will help.