CyberPDF: Smart and Secure Coordinate-based Automated Health PDF Data Batch Extraction

Abstract:

Data extraction from files is a prevalent activity in today’s electronic health record systems which can be laborious. When document analysis is repetitive (e.g., processing a series of files with the same layout and extraction requirements), relying on data-entry staff to manually perform such tasks is costly and highly insecure. Particularly analyzing a large list of PDF files (as a widely used format) to extract specific data and migrate them to other destinations for later use is both tedious and frustrating to do manually. This paper addresses a very practical requirement of batch extracting data from PDF files in health data document analysis. Specifically, we propose a Coordinate Based Information Extraction System (CBIES) to instrument a smart and automatic PDF batch data extraction tool, releasing health organizations from duplicate efforts and reducing labor costs. The proposed technique enables users to query a representative PDF document and extract the same data from a series of files in the batch analysis manner swiftly. Furthermore, since security and privacy considerations are essential part of any health record systems, it is included in our approach. Based on CBIES, we implement a prototype tool for PDF batch data extraction technique named, CyberPDF. The tool exhibits great efficiency, security and accuracy in multi-file data processing, supported through an empirical evaluation in this research.