Reclaim Your Data from Scanned Formats

I once heard exclaimed at a Safe Software conference, “PDF is the format where data goes to die!” Sure enough, reports, charts and data in business are frequently destined for this dead-end, one-way, read-only format. Don’t you wish you could reach inside and retrieve data from a PDF and perform some useful analysis?

When most people think of converting PDFs (or other scanned formats) back into text, they imagine traditional OCR. Optical Character Recognition software scans the PDF document and tries to turn it back into text. However, the results are often far from satisfactory. Instead of a spreadsheet or table that you see embedded in the PDF, you get an unstructured jumble of text requiring considerable manipulation to produce any useful result. Worse, for mappers, coordinates and degree-minute-second format are often scrambled because of the special characters employed.

Enter BIS and Grouper. This technology can not only be configured with a variety of different OCR engines but also further trained to do something useful for different industries. It can pull out the translated data and then save it in a familiar way such as a spreadsheet or a database table.  For WhiteStar utility customers, we apply BIS technology to produce intermediary output that can then be fed into the WhiteStar Legal Mapper® to produce a correctly placed GIS polygon in your enterprise mapping system. Many other industries can also benefit from this approach including forestry, energy, midstream pipeline and government. 

Once converted, these data are then available to serve your business needs fully inside GIS, for example to improve compliance, mitigate risk and improve business decisions. One WhiteStar midstream pipeline customer is currently exploring converting their PDF engineering report data into GIS format to assist with PHMSA “Mega Rule” compliance. Once the trapped PDF data have been converted, they can be mashed up against other data streams such as WhiteStar Culture®  containing wetlands, contours and building footprints to assess and mitigate engineering risk and encroachment along rights-of-way.

WhiteStar Culture® wetlands and pipeline data overlaid inside ArcGIS Pro in the Dallas, TX area.

In summary, customers commonly have tens of thousands of PDF files full of critical engineering data on their networks that cannot yet be fully used for analysis. Traditional OCR (optical character recognition) extracts typewritten data into a jumble. Supercharging the OCR process with WhiteStar’s domain expertise can turn your data back into a useful format to support better decision making.

Robert C. White, Jr.
President and CEO
WhiteStar Corporation

Previous
Previous

Delete & Switch

Next
Next

WhiteStar.com: Now with Live Chat