The team at the Internet Archive has taken on a project to capture, in pretty much real time, the chyrons – or better known as the narrative text – that appear at the bottom of the TV screen of various news networks such as CNN, FOX News, BBC and MSNBC. Amazingly, across just those networks, a total of 4 million of these snippets were collected in all of two weeks. But what is really brilliant is how they are actually collecting these snippets:
The work of the Internet Archive’s TV architect Tracey Jaquith, the Third Eye project applies OCR (ed: Optical Character Recognition) to the “lower thirds” of TV cable news screens to capture the text that appears there. The chyrons are not captions, which provide the text for what people are saying on screen, but rather are text narrative that accompanies news broadcasts.
Created in real-time by TV news editors, chyrons sometimes include misspellings. The OCR process also frequently adds another element where text is not rendered correctly, leading to entries that may be garbled. To make sense out of the noise, Jaquith applies algorithms that choose the most representative chyrons from each channel collected over 60-second increments. This cleaned-up feed is what fuels the Twitter bots that post which chyrons are appearing on TV news screens.
The Internet archive team has opened this up as an API for all to use, and they have also taken all of them and turned them into a Twitter feed.
Someone is going to have a ton of fun with this.