Google is making it easier to search for instances of key words and phrases used by the US presidential candidates in videos uploaded to YouTube.
With the help of our speech recognition technologies, videos from YouTube’s Politicians channels are automatically transcribed from speech to text and indexed. Using the gadget you can search not only the titles and descriptions of the videos, but also their spoken content. Additionally, since speech recognition tells us exactly when words are spoken in the video, you can jump right to the most relevant parts of the videos you find.
Thing is, it’s not quite there yet. Searching for Sri Lanka for example brings up Obama’s well known reference to the country during his appearance at Google but also brings up a wholly unrelated video:
In late 2006 I saw for the first time real time video transcription during Strong Angel III that was far more impressive in that its accuracy was (at the time) higher than what one would have expected. For archived video, I would have thought that with all the computing power Google has at its disposal that the accuracy would be that much better.
It’s not as if the technology is new either. Podzinger used to do it, and very well, with audio (they’ve since made it into a full blown corporate product). Blinkx, started by Sri Lankan Suranga Chandratillake also seems to be doing it, though I’ve not used it myself. Search engines like PodScope are also based on similar technologies (the algorithms for text extraction from audio or video should be more or less the same).
As Google admits:
Speech recognition is a difficult problem that hasn’t yet been completely solved, but we’re constantly working to refine our algorithms and improve the accuracy and relevance of these transcribed results.
My interest in this type of technology is to resolve (data) conflicts by holding politicians, amongst other influential figures in authority, accountable to what they say by making it easier to search through archives of their public statements (Ameritocracy is a great text only example of this). However, this technology can also give rise to conflict. The heading for this post comes from an error in the speech recognition that attributes “shit” to Barack Obama when he actually says “shift” (again during the Q & A session at Google HQ around 8 months ago).
Given that Barack speaks with an American accent and that the Google Gadget only indexes American politicians / public figures, I expected the algorithms catch the nuances of inflection and delivery unique to the regions of that country.
Perhaps an idea would be to make this participatory, so that the search results can be qualified by users thereby increasing the accuracy of the engine over time?
Using Safari on Leopard, I also found that I couldn’t go directly to the point in any video where the phrase was “found”, though this may be a chink with YouTube’s flash player.
Google’s on to something here and though it is certainly useful, there’s a long way more to go. And with the gadzillion servers they have at their beck and call, can’t think of any better company than Google to crack the audio / video transcription problems.
In the meanwhile, let’s just hope that McCain’s camp doesn’t get too worked up over what Barack didn’t say…