Revitalizing Data in Historical Documents


Dusty old documents--can we revitalize the data held captive within them?  Can we make the data computer readable? Can we organize the data, index it, and make it easily search able?  If so, we can provide valuable information found in old books, microfilm records, and government archival documents.

Having valuable information about parental lineage, sicknesses, and causes of death would help medical researches track down data pertinent about inherited diseases.  Having indexable historical information from emigration certificates, land purchases, and employment records, would help historians piece together the past.  And, having tables and forms of scientific data could provide scientists with yet another vast store of information mine.


The Challenges are: turning images of data into computer-readable data, organizing raw data into meaningful information packets,indexing the information for search and query, and integrating information from multiple sources.


We address these challenges by applying research/technology ideas in conceptual-model knowledge bases.  The knowledge represented can describe both that application of interest as well as information about how tables and forms present data for human understanding. such knowledge can be leveraged to find, combine, and interpret raw units of data that are laid out in old documents in geometric patterns.  Once found and interpreted, the conceptual-model of the knowledge base also provides the basis for storing organizing raw information for high-quality search, retrieval, and indexing.

Research Problems

We assume that raw data units and their positions in documents along with the position and orientation of separating lines are given to us.  Raw, individual words and their locations within a document may be provided as output from optical character recognizes or from human extractor in the case of handwritten documents.  Raw horizontal and vertical lines are extractable by pattern-matching and image-processing techniques.  Our objective is to use the knowledge about the layout, the patterns, the lines, and the application objects and their relationships to recognize, piece together, organize, and efficiently store information.

So, how do we build conceptual-model knowledge bases to achieve this objective?  How do we exploit relative positions to infer logical connectivity? How do we recognize attributes factored into tables headers? How do we unravel nested structures? How can we establish boundaries between groups of related information? How dow we determine if an object appears more than once in a document and avoid representing it as multiple different objects? How do we integrate information packets together into a global picture together into a global picture that represents the information in a single document or a group of related documents? And, how do we efficiently index the information so that we not only can quickly retrieve that extracted information but can also relate it to its original source for human verification and further research?