- Converting scanned graphs to data
-
Contents
Converting Scanned Graphs to Data
Extracting (x,y) data from scanned graphs can be useful for analyzing data from published graphs, analog instruments, strip chart recordings, or any other hard copy graph or plot.
History
Developments of the 20th century computer were fueled largely by the needs of the scientific community to store and analyze large amounts of scientific data. With the later development of the personal computer and graphics printer, scientists were able to easily display and print graphical representations of their data sets. In recent years, hardware and software (image scanners and digitizing software) have been developed that allow scientists to easily extract (and therefore analyze) data from printed materials such as strip chart recordings, old graphs, and graphs published in journals.
Although digital video systems and scanners were developed primarily to input and manipulate pixel images such as pictures, drawings and photographs[1], it soon became clear that information could be extracted from other types of scanned images. Optical Character Recognition (OCR) software was developed to extract alphanumeric characters automatically from the scanned pixel image. If the alphanumeric characters could be extracted from textual images, then (x,y) coordinates of data points could be extracted automatically from graphical images to convert graphs to data.
Basic Concept
For decades, extracting original (x,y) data values from printed graphs has been accomplished using digitizing tablets[2]. Digitizing tablets use the position of the puck (mouse-like device) on the digitizing board to correspond to a calibrated coordinate system associated with the underlying graph. In addition, digital plotters have been used in a manner similar to digitizing tablets, by manually moving the plotter pen to various positions over the hard copy graph[3]. This basic concept can also be used to convert the pixels in scanned images to (x,y) data values. The software simply assigns a coordinate system to the pixels in the image based upon the scaling values entered from the original graph.
To use a scanner as a digitizer and accurately convert a scanned graph to (x,y) data, the scanned image must be properly scaled, which requires 4 points to be defined (the lowest x, highest x, lowest y, and highest y locations and corresponding values must be entered). Although the basic concept of converting the image pixel values to scaled values is straightforward, practical consideration such as correcting for tilted graphs, overcoming the limits of screen resolution, and developing line following routines to automate the digitizing process must be addressed.
Practical Considerations
Tilted Graphs
Although desktop scanners can provide very high resolution and accuracy in the scanning of paper images, it is generally difficult to load the paper into the scanner perfectly orthogonal. Therefore, the scanned images are often slightly tilted, and a perfectly orthogonal situation shown is rarely achieved. This small tilt in the image, even less than one degree, can result in unacceptable levels of error in the digitized (x,y) values if there is not a correction made. The tilt of the graph can be determined by measuring the delta y pixel and delta x pixel locations when defining the axis lines, and the corresponding correction made.
Overcoming the limits of screen resolution
The early versions of digitizing software simply digitized the screen image, rather than the full scanner image. With today’s high resolution scanners and high end computers, typical scanned image dimensions can be several thousand pixels by several thousand pixels. These large images cannot be completely and accurately represented on a 640x480 or 1024x768 computer monitor. Therefore, in order to digitize the image at full scanner resolution, the entire image must be read into memory and only portions of the image displayed and digitized as the digitizing process occurs. This full scanner resolution digitizing yields much more accurate results than screen digitizing, and virtually no data are lost from the original scanned image.
Developing Line Following Routines
Although using a scanner to digitize hard copy graphs works much like a digitizing tablet, the scanner and digitizing software have the potential advantage of being fully automatic. Rather than sitting in front a digitizing tablet for hours to digitize manually, graphs can be digitized automatically in seconds.
Raster scanning data is a very simple way to convert the scanned image to (x,y) data values, however, scientists generally need single-valued vectorized data (one y for each x in sequential order). This requirement means that there must be a function in the digitizing software that automatically follows the data line, and assigns one (and only one) y value for each x value along a given curve.
If only one y value is to be assigned for each x value along the curve, then the middle of the data line is generally assumed to represent the actual (x,y) value. The middle of the data line is the point halfway between the top and bottom interface of the line. Although the mid-line assignment method works for many types of curves, it is generally not accurate for curves with sharp peaks. This error occurs due to the finite width of the data line on the up side of the peak overlapping the data line on the down side of the peak, thus creating an artificially low bottom surface of the peak. Due to the potential problem in assigning the (x,y) values for sharp peaks, the more generally applicable standard point assignment method is simply to measure the line thickness once in a flat area, then subtract off half of that line thickness from the top interface of the line.
Once values of x and y values have been assigned to a point on the line, the line following routine moves one pixel unit in the x direction, begins an up and down search, and repeats the point assignment process. Although the line following process is straightforward for a simple curve, more complex graphs often need to be digitized. Some additional useful features include the ability to select line follow direction, line follow side (top or bottom), scale (linear or logarithmic), resolution (distance between x values), as well as to pause the digitizing process for manual interaction and/or adjustment.
Results
In order to estimate the accuracy of graph digitizing software, the standard geometric function y = sin(x) was generated and printed using a spreadsheet program and laser printer. The hard copy graph was then scanned at 300 dpi using a full page scanner. The software was then used to automatically follow the data line on the scanned image to extract the digitized (x,y) data values from the image.
The software extracted 2,000 (x,y) data points in less than 10 seconds, once the axis limits were set. The average y deviation of the digitized values from the actual values was 0.002 inches, with a maximum deviation of 0.012 inches. The results indicate that the values obtained from the scanned image are precise and accurate, with typical deviations of approximately one scanner unit (the deviations that due occur result largely from imperfections in the printing and scanning processes). Using a scanner as a digitizer to convert graphs to (x,y) data can save countless hours and improve scientific results over manual digitizing methods.
External links
- Engauge Digitizer is an open source digitizing software available for Linux and Windows [1]
- Plot Digitizer is a free Java program for digitizing scanned plots. It is available for a number of platforms [2]
- graph digitizing software is available from Silk Scientific, Inc. www.silkscientific.com
- DigitizeIt [3]
- [4] Graphics software FindGraph from UNIPHYZ Lab contains digitizing capability.
- General graphics package ORIGIN [5] includes tools for digitization.
- GetData Graph Digitizer [6] is a standalone software to digitize plots.
- Dagra, a program for Microsoft Windows, uses Bezier curves to digitize graphs.
- g3data
- OmniGraphSketcher [7], a Macintosh application, allows interactive point selection from graphics displayed on screen (via a transparent window).
References
Categories:- Data analysis software
Wikimedia Foundation. 2010.