Machine Learning

Extracting Stories from News Websites for HOME

HOME wanted to develop a machine learning solution to extract stories from news websites and use these insights to inform marketing strategies and optimise bid management. Datatonic worked with them to improve their Object Detection model and create an end-to-end, fully automated, scalable GCP solution to drive their centralised search engine.

Impact
  • Improved model performance ~59% to ~75% mAP@.50
  • Reduced training time from hours to <100 minutes
The challenge

HOME is a strategic marketing agency that grows brands by helping them gain and maintain attention. HOME does this by blending business consultancy know-how with marketing agency delivery.

As part of their ongoing efforts to bring state-of-the-art technology to their clients, HOME is developing a platform (called ‘Sensible’) which renders external data, such as media pages or real word images from multiple sources into a centralised search engine. The original prototype for this platform focuses on extracting news stories with title and image content from news websites via Computer Vision. The insights collected by this machine learning process will be consumed internally by the strategy teams and delivered directly to clients and the Doubleclick media platform to optimise bid management.

HOME wanted to build out the current prototype Computer Vision model into an automated, performant and scalable machine learning pipeline on Google Cloud Platform which would allow them to move quickly towards a deployed MVP for Sensible. HOME successfully built a draft TensorFlow Object Detection model, despite the absence of ML expertise in-house. This model was not production-ready, and all processes from data quality & augmentation to model training and serving were not defined. They needed to move fast from this R&D environment to an improved production-ready solution to successfully drive the first release of Sensible.

Our Solution

HOME wanted to leverage our expertise in developing AI pipelines in GCP to quickly create a state-of-the-art Object Detection solution which would require minimum maintenance, while simultaneously allowing for iterative improvements and additions. To help HOME achieve their goals, the Datatonic team developed an end-to-end, fully automated model training and serving pipeline in 3 weeks orchestrated respectively with Google Cloud Composer and Google Cloud Functions.

The training framework enables the HOME team to automatically re-train the model on newly predicted data hosted on Google Cloud Storage on a weekly basis.  

The serving framework enables the HOME team to automatically label and serve new screenshots of newspapers’ main page as they land on Google Cloud Storage, gathering all relevant information for serving in a BigQuery table. Relevant information for each story includes: detection bounding boxes, width, height, area, prominence score, prediction certainty value, origin website, image attached to article header, text, keywords, sentiment. The last 5 elements are derived with Cloud APIs: Cloud Vision API is used to extract origin website, text and presence of image; Cloud Language API is used to extract keywords and sentiment.

The key component of the solution is the fine-tuning of the TF Object Detection model: the performance boost from 59% to 75% mAP@.50 is mainly due to optimised training parameter selection and data augmentation procedure, both products of an extensive R&D, which also allowed to reduce training time from hours to <100 minutes.

What's Next?

The team at HOME has successfully developed a first MVP for the Sensible platform, and has since built up on the model thanks to the expertise gathered through our policy of collaboration, knowledge-sharing and cross-fertilisation of approaches taken and decisions made.

 

 

Testimonial

Phill Midwinter, Technology Director at HOME, commented:

“The work delivered by Datatonic has been first rate. They took our Proof of Concept to the next level and delivered far better results than we’d hoped for in an extremely short period of time. At each step they helped us to realise the potential of new approaches within Google Cloud Platform. That training has been just as important as the project’s delivery. Since their direct involvement finished we’ve pushed performance to 96% based upon their direction. We’ve also been able to bring the skills in this project to bear with other clients across GCP.”