By Adam Grossman on .
How we trained our robots to clean radar images for Dark Sky.
Removing noise from radar images has turned out to be one of the biggest technical challenges we've faced while developing Dark Sky's forecasting system.
We generate our short-term rain predictions using data from the National Weather Service's network of 150 NEXRAD radar stations, spread across the US. This data comes in as a raw data format, and the very first thing we do after grabbing it is convert it to images. The problem is that the data is filled with noise -- false data that looks like rain, but is really something else:
The noise can be caused by many factors: ground clutter, hazy / humid conditions, or even swarms of birds and insects. Oftentimes most of the radar image is noise rather than actual precipitation. If we didn't remove it, we'd be predicting rain for thousands of square miles that were actually dry.
We recently made some changes to how we identify and remove noise from radar images, so now might be a good time to explain how it works...
Step 1 - Is it even supposed to be raining?
With the launch of Forecast, we have a powerful new tool at our disposal. Forecast aggregates a large number of weather models from around the world, and can give us a good sense of where in the world we should expect rain for the next week. We use this data to generate a map of where rain is likely to be occurring right now:
The blue regions represent areas where rain is likely to be occuring, the yellow regions where rain is unlikely but may occur, and the clear regions are areas where we are confident no rain is currently happening. If a radar station is located in a clear region, we can aggressively remove everything from the image since we know anything that shows up is likely to be noise. If the radar station overlaps a blue or yellow region, we move on to the next step...
Step 2 - Blob segmentation
The first thing we need to do is take the image and isolate all the radar blobs. Each blob is a self-contained region of either "signal" (i.e. precipitation), or "noise". At this stage, we're not concerned with identifying which is which, we simply want to split up the image into chunks that we can deal with individually.
Segmentation is a complicated by the fact that you'll often find noise regions overlapping actual storms, which means we can't simply isolate the blobs based on whether or not they're surrounded by black:
Over the past year, we've developed a bunch of heuristics for separating out the blobs, and it generally works pretty well. We also rely heavily on the computer vision library OpenCV for a number of algorithms pertaining to contour identification.
Step 3 - Neural net classification
Once we've got a bunch of individual blobs, we can look at each one and determine whether or not it is likely to be noise. We do this by collecting a wide range of statistics on the blob: how fast it's moving (using OpenCV's optical flow algorithms), the temperature and humidity in the area, the storm intensity (i.e. how bright the pixels are), a histogram of intensity, whether or not the blob overlaps the center of the radar station (an area where most noise is found), etc.
We even use a metric I call the "gerry index", which is normally used to determine whether or not congressional districts have been gerrymandered, by measuring how squiggly their artificial boundaries are. We've found that rain blobs tend to be more squiggly than noise blobs, thus more closely resembling districts that have been re-zoned by corrupt politicians.
Once we've compiled these stats, we feed them into a neural network (we use the FANN C Library) which is a tiny little 25-neuron "brain" that has been trained to classify radar blobs as either signal or noise. In order for this to work, we need to train the neural net by hand, using thousands of blobs that we've classified ourselves. This is an ongoing process, and I'll review hundreds of images every couple weeks to keep the dumb little robot as up-to-date as possible (the most boring part of my job, by far!).
If we determine that a blob is noise, we remove it from the image. Here is a before and after composite image of the entire Continental United States:
Step 4 - Almost done
After do we this, there's still some amount of noise left in the image, mostly stray random speckles or Evil Death Rays caused by something momentarily obstructing the radar beam. We have separate little algorithms for cleaning up each of these, and we're constantly tweaking them or adding more as time marches on.
After all the cleaning is done, we pass the image on down the pipeline to be analysed by the actual forecasting system.
This entire process has to happen quickly; we constantly pull in new data, and we need to make sure it gets processed in realtime. In all, it takes between 1 and 2 seconds to convert a raw radar data file to a cleaned image, and we dedicate 16 cores worth of virtual servers to the task.
It never works absolutely perfectly, and it's a constant battle to make sure we aren't leaving too much noise or -- worse -- removing actual rain, but after nearly two years of constant refinement I think we've made significant progress: After all, none of Dark Sky would be possible without this crucial step!