Gaussian Mixture Models to Help Solve America's Hunger Problem
In April 2019, 84.51° was named a finalist for Fast Company's World Changing Ideas in the AI+Data Category for its data science behind Kroger's Zero Hunger | Zero Waste initiative. This award celebrates businesses, policies, and nonprofits that are poised help shift society to a more sustainable and equitable future. Read more about the award and see the full list of finalists, honorable mentions, and judges here. -Natalia Connolly, Data Scientist
There is a strange imbalance going on within the food supply and demand system. According to the UN’s Food and Agricultural Organization, roughly one third of food is thrown away around the world – by grocery stores, restaurants, and consumers like you and me. At the same time, an estimated 42 millions Americans are food insecure, and 1 in 6 children go to bed hungry, per the Department of Agriculture. Something simply does not add up!
In response to this issue, Kroger recently launched the Zero Hunger Zero Waste initiative; in pursuit of limiting waste and solving hunger, both Kroger and 84.51˚ are using big data tools to better understand what’s really happening. In August 2017, analysts from Kroger and 84.51˚ got together for a hackathon to work toward this goal. A dozen or so teams attended, each producing a unique analysis and proposing a different solution.
I participated in this hack day as part of a team that approached the problem using a Gaussian mixture algorithm. We modeled both need (food shortage) and waste (food surplus) by location, centering Gaussian distributions on the latitude/longitude coordinates of each county in the data.
To begin, let’s look at a sample of our food shortage data. We have information on each county’s area in square meters, population, and food insecurity rates:
Need data sample
The means of each Gaussian distribution were the latitude, longitude pairs in the table above. The covariance of each distribution is a county’s area in square meters, the normalization the county’s food insecurity rate.
If you’re not familiar with covariance and normalization, let me explain. For a one-dimensional normal distribution, variance is a measure of the spread of a set of numbers from their mean. For two-dimensional (or multivariate in general) normal distributions, variance becomes generalized to a covariance matrix, which describes variance of the variables together. Normalization is a constant that controls the “height” of the bell curve.
Generally speaking, a two-dimensional Gaussian looks like this:
Example of a 2D Gaussian
So we create one of these for each county in our data, and aggregate them into a grid covering the entire United States. We now have a probability density function (PDF) that consists of a sum of normal distributions, constructed as above. We can sample it across the grid.
Here’s how this works: imagine that you have a one-dimensional normal distribution centered on zero (mean = 0) with a variance of 1. If you wanted to visualize it (make a curve y(x)), you would specify some range of x’s (e.g., from -5 to 5) and then step across that range with some increment (e.g., 0.1) and calculate the corresponding y value from the distribution. You would eventually get your familiar bell-shaped curve with low values of y at large x’s and high values of y near x = 0.
In the same way we can figure out the shape of a multi-dimensional Gaussian distribution. In our case, we have two dimensions; so for every point (x,y) – which corresponds to a geographical location somewhere in the United States – we get a value from our probability density function. Instead of a single flat bell curve that peaks at x=mean, we will now get a surface that “budges up” near the means of the constituent Gaussians. For each point within the grid we draw a number from the resultant probability density function. For example, if we only had two counties, the PDF might look something like this:
Example of adding two 2D Gaussians
Using this approach on the entire map, we highlighted peaks in the distribution across the country. These peaks represent areas of greatest need.
Food scarcity map
The highlighted areas make sense: Detroit, Houston, and Atlanta, among others. All are areas where one would expect to find high levels of food scarcity, as they are major metropolitan areas with significant socio-economic challenges.
Now let’s see which Kroger stores have the largest amount of surplus food. Here is the dataframe for the latitude and longitude of the stores, as well as the amount of waste in pounds:
Food waste data sample
The norms of the Gaussians are now the waste fractions, and for the variance we will assume that each Kroger can service an area of 30 miles x 30 miles. Then the map of food surplus looks like this:
Food surplus map
One can clearly see Los Angeles showing up as a significant source of food surplus. Interestingly, many of the areas where we saw the largest need are also at least somewhat close to the areas of the biggest food surplus.
Given this information, one could now easily connect areas of need with nearby areas of surplus. One could go even further: given the cost to transport a vehicle load of food, what is the surplus-to-scarcity flow as a function of cost/fuel?
While we did not have enough time during the Zero Hunger Zero Waste hackathon to pursue this much further, the conversation is still going. As a first step, the Zero Hunger Zero Waste team selected Detroit as a test market based on the resulting 3-dimensional map. Other markets of consideration are Atlanta and Houston.
There has been a lot of interest in building on this model by incorporating additional relevant data inputs. These inputs include Kroger customer and associate government assistance rates as well as pantry location within the Feeding America network to build out cost to transport component. The next step is to leverage an intern project this summer specifically dedicated to ZHZW.