Big Data Experiment

Big Data Experiment

May 1, 2015 Uncategorized by colinalb

inFood – Using public data to uncover nutrition secrets

Comparing food products by looking at the package information is a bit harder than it might seem

Background

Everyone’s talking Big Data, Hadoop, Data Warehouses, Business Intelligence and AI, but what it is all about really? I have managed IT projects in this field (successfully) but I wanted to add depth to my knowledge and at the same time have a play with some of the technologies, just to see what I could achieve without big expense.

I homed in on the fact that the nutritional information on food packaging is somewhat misleading. It is actually very hard for the average person to get a good understanding of whether one thing is healthier than another. I searched the web and found a few sites, but not one could answer these sorts of questions:

The Types Questions to be Answered:

Compare nutritional value of products in your shopping basket
Which “drink” has the most selenium/aspartame/sugar
Which ginger biscuits have the least sugar?
Which cereal is healthiest?
Can in Compare Apples with Oranges?

The Mission

So I set about building a web site/application that would bring together food data from different sources so that anyone could put together a side-by-side, visually informative comparison of nutritional values of similar and dissimilar food products .

Its all about the data

Getting, cleaning, normalising, validating, organising and refreshing the data

I have opinions on many things but as I know nothing about nutrition, the data would have to speak for itself.

Getting hold of the data turned out to be less than straightfoward.   Government provided data has come from two sources so far: the UK Food Standards Agency and the US Food and Drug Authority.  But most people want to compare supermarket foods.  I concluded that the most effective way to get the supermarket data is to scrape it off the websites. This has been a journey of several wrong turns, but got there eventually.  I have put together a fast and scalable distributed web scraper.  (I’ll post about this another day).

Data Consistency

The consistency of nutritional information is a big problem, even two companies selling the same product may display the details differently. One company might include saturated fats details when another will not, even for the same product.

Labelling

Two companies will label the same attributes on the same product inconsistently.

Normalisation

I looked at how to normalise the data so that side by side nutritional comparisons of dissimilar products would be meaningful. To make it harder, one biscuit vendor might publish only the sugar content of a 25g biscuit, whereas another will only publish the values for a 100g portion.

Portion sizes

The food manufacturers are sometimes a bit crafty with their portion sizes and how this is displayed on the back of pack.  I haven’t solved this one fully yet, but I am getting there.  A tougher one is when cereal manufacturers only report the nutritional content of their product when combined with milk.

Data Reputation

There is published government data (from most countries) on food content, and I felt it was necessary to include this. Again, normalisation involved some data manipulation.

Data Recency

Fortunately food ingredients data, changes infrequently so I am not under pressure to update the data every few days. Although if the data is not fresh enough, the perceived value of the inferences are undermined.

Data Provenance

It is important to show the provenance of the data, to reassure users that this is not just being made up.

Technologies Used

PHP
MySQL
Zurb Foundation responsive web framework
nodejs
Redis
Google Charts and Highcharts

Techniques Used

 

Focus on answering the questions that nobody else is able to answer.
Do so with as few clicks as possible.
Develop with all three platform sizes simultaneously (desktop, pad, phone)
Extensive automated unit test scripts that run across all of the data.
Heavy use of existing libraries to avoid writing too much code myself.
Continuous code refactoring – and keep the code DRY (don’t repeat yourself)
Measure code complexity, keep McCabe complexity below 10.
Have fun with it!

Lessons Learned

I stuck with Highcharts for a while, but when I came across a problem with time based data, I ended up using google charts. I now have both.
Effective web scraping for big data is not for the faint hearted – it takes perseverance.
Tuning the data search configuration is a journey not a task.

Conclusion So Far

It is all about the data and there is real insight to be gained from presenting existing data in a new way. The data must be accurate and whole or user confidence is undermined. At this stage I only have 22,000 products, hardly big data, but what data that I show is correct.

There are some extremely powerful tools available for negligible cost to help process and present big data in a way that can deliver considerable value. If this experiment changes a few people’s dietary habits towards better foods, then I will be delighted.

What next?

I just learned that Google (quietly) launched a food comparison capability in 2014. At first I was disheartened by this, but my view changed then I learned that a) they did so because of the large volume of pertinent search traffic, people often search for ‘compare food a with food b’ and b) it only uses the government sourced data. We shall see what happens as I continue to improve the user experience and the range of food data. I have been asked by a few people about monetisation. We shall just have to see whether this experiment becomes a lucrative endeavour or not…