Finding sources of data: extracting data from the published literature

Connecting mathematical models to predicting reality usually involves comparing your model to data, and finding model parameters that make the model most closely match observations in data. And of course statistical models are wholly developed using sources of data.

Becoming adept at finding sources of data relevant to a model you are studying is a learned skill, but unfortunately one that isn’t taught in any textbook!

One thing to keep in mind is that any data that appears in a journal publication is fair game to use, even if it appears in graphical format only.  If the data is in graphical format, there are free programs, such as DataThief, that can be used to extract the data into a numerical file.

Let’s take an example: in the paper “Comparative estimation of the reproduction number for pandemic influenza from daily case notification data” by Chowell et al, there is a figure (Figure 1) that shows daily case notification data of Spanish Flu hospitalizations in 1918 in San Francisco:




To extract the data from this graph, first download DataThief to your laptop from the DataThief website.  Then download the jpeg for the figure from the html of the paper (or, alternatively, take a screen shot of the figure in the paper).

Now start up DataThief. And click on File->Open and from the file browser select the figure file.  You should now see the figure in the DataThief control window, with three colored points (red green and blue) that look like a circle with an X through them (there is another blue point that looks like a circle with a + through it… ignore that one for now).  Drag one of the circles to the intersection of the x and y axes. Drag another one to the upper value on the x axis, and the third one to the upper value on the y axis.  At the top left hand of the control window you will see Ref 0 in red, Ref 1 in blue, and Ref 2 in green.  Here is where you enter the x,y coordinates in the coordinate system of the figure for each of the three datum points that you just placed.  Those three points define the coordinate system of the figure, and now if you move that blue point with the + through it around the figure, it will tell you where it is placed in the coordinate system of the figure.

We would like to extract the line in the figure.  To do so, first click on the icon at the top middle of the control window that looks like a line graph.  You will notice that a green and red circle appear with a + through them. Drag the green circle to the beginning of your line, and point its little arm in the direction the line is going.  Drag the red circle to the end of the line.  Make sure the blue circle with the + is on the line… from the color under that circle DataThief determines the color of the line.

Now we are ready to trace the line, but first we must make sure that our trace is a different color than the line (the default color of the DataThief trace line is black, which unfortunately is the color of the line we are trying to trace here).  To change the color of the trace line, click on Edit->Preferences->General->Trace and chose a nice contrasting color.

Now click on the icon just to the left of the line graph at the top middle of the control window (the icon with the * * *  in it).  This will make DataThief attempt to trace the line.

Now, I don’t know about your computer, but on my computer, DataThief gives me an error message “Trace did not reach end point, too many doubles”.  This usually means the line had portions that were too pointy or that it was perhaps a bit too blurry in places for the algorithms underlying the DataThief program to follow it.  When I click OK after the error message, I find that DataThief got hung up at the peak of the plot.

When this happens, you  have to give DataThief a hand to help it to determine which direction the line goes. To do this, you need to use the Dump option.  Click on Settings->Show dump, and a grey icon wil appear in the control window.

Click on the circle in the icon and drag your mouse. A little circle with an arm will come out of it.  Place that circle on the line in your Figure just before (but not on) the trace failed, and point its arm in the direction the line goes and drag its tip to where the trace failed.  Drag out another little circle, and place it where the trace failed and put its arm in the direction of the line to perhaps where the line makes its next zig (or zag).  Hit the Trace (the * * *) icon again.  This time, for me, the trace got quite a bit further, but didn’t quite make it to the end.  So I added a couple of more helping datum points from the Dump.

If you accidentally remove too many points from the Dump, you can always put them back in the Dump by dragging them there.

Finally, after a couple of iterations of giving DataThief hints, the trace was successful!  Now we need to export the data.  To do this, first save your DataThief file by clicking File->Save As (then give it a descriptive name so you can find it later).  Now click on File->Export Data and save the output text file with a descriptive name (I called mine flu_1918_sf.out)

Take a look at that file… there is a bit of a problem in that DataThief doesn’t recognize that the points along the x axis are in integer units.  You can do one of two things; read the file into Matlab and R and take the average of the y within each day, *or* you can use DataThief in points mode, and extract the data point by point (for Figures that consist only of points, and not a line, you have to use DataThief in points mode).  To aggregate the data by day in R, you can use the file aggregate.R, which produces the file flu_1918_sf_aggregated.out  Note that the aggregated y values by day are not integers (even though they should be because this is count data). This is because there is slight uncertainty in the actual value due to your inability to perfectly place the axis datum points, and also due to the fact that the line in the Figure is not infinitely thin.  To double check how sensitive your results are to these issues, repeat the process (starting with the placement of the axis datum points) and see how much the values in the file change… the amount of the change gives you an idea of your uncertainty on the data points.  Ideally, this uncertainty will be small enough that it does not affect the conclusions of your model analysis.

To change DataThief to points mode, click on Settings->Points Mode.  If you were just using DataThief in line mode and did a track a whole bunch of points will show up(!).  To get rid of them, click Data->Clear Data.

Now, drag points from the Dump (notice, they no longer have directional arms!) and place them along the line at the integer values of X.  Patiently place your datum points along the lines to get a point for each day.  Now, click on File->Export and save the data to a file with a descriptive file name.

Leave a Reply