Architecture
Our technology has evolved over the last 20 years to become a robust and trusted platform for discovering, indexing and searching news content. It is widely used by thousands of users on a daily basis.
Challenges
Unlike traditional search, indexing
news and stories has different challenges. Content must be
discovered and indexed as close to real-time as possible.
Accuracy is a key challenge. The
platform must accurately identify and retrieve items in a
timely fashion without impacting the resources of other web
properties. Once the item is retrieved it must be deconstructed
to simple plain text by removing boilerplate templates. Meta
information such as the title, publication date, article text
is precisely extracted from a range of HTML sources. Not all
HTML is equal. The structure and standards used in pages can
vary greatly.