Revitalizing Data in Historical Documents
We address these challenges by applying research/technology ideas in conceptual-model knowledge bases. The knowledge represented can describe both that application of interest as well as information about how tables and forms present data for human understanding. such knowledge can be leveraged to find, combine, and interpret raw units of data that are laid out in old documents in geometric patterns. Once found and interpreted, the conceptual-model of the knowledge base also provides the basis for storing organizing raw information for high-quality search, retrieval, and indexing.
We assume that raw data units and their positions in documents along with the position and orientation of separating lines are given to us. Raw, individual words and their locations within a document may be provided as output from optical character recognizes or from human extractor in the case of handwritten documents. Raw horizontal and vertical lines are extractable by pattern-matching and image-processing techniques. Our objective is to use the knowledge about the layout, the patterns, the lines, and the application objects and their relationships to recognize, piece together, organize, and efficiently store information.
So, how do we build conceptual-model knowledge bases to achieve this objective? How do we exploit relative positions to infer logical connectivity? How do we recognize attributes factored into tables headers? How do we unravel nested structures? How can we establish boundaries between groups of related information? How dow we determine if an object appears more than once in a document and avoid representing it as multiple different objects? How do we integrate information packets together into a global picture together into a global picture that represents the information in a single document or a group of related documents? And, how do we efficiently index the information so that we not only can quickly retrieve that extracted information but can also relate it to its original source for human verification and further research?