To be able to process daily news items as soon as possible, the NewsReader project team of Vrije Universiteit Amsterdam wanted to develop an architecture capable of doing this. The team aims to extract information about events automatically and link today’s news streams to information collected from earlier news articles and complementary resources such as encyclopaedias or company profiles.

To this end, they apply state-of-the-art language technology to process texts from daily incoming news. This provides the basis to build a ‘history recorder’. With current technology, however, keeping track of the vast amount of information coming in daily takes a great deal of time. Processing one article typically takes about 6 minutes on one standard machine; with approximately 1 million new articles per day for English only (from LexisNexis), with a historical backlog of millions of articles, this produces an enormous amount of data. One of the main challenges in this project lies in scaling up linguistic processing and maximising the use of computational resources to manage the daily stream of incoming information.

Read how SURFnet supported the team on the In The Field blog.