Methods Inf Med 2007; 46(06): 700-708
DOI: 10.1055/s-0038-1625431
Original Article
Schattauer GmbH

Document Recognition and XML Generation of Tabular Form Discharge Summaries for Analogous Case Search System

H. Kawanaka
1   Graduate School of Engineering, Mie University, Mie, Japan
,
T. Sumida
2   Faculty of Engineering, Mie University, Mie, Japan
,
K. Yamamoto
3   Dept. of Medical Informatics, Mie University Hospital, Mie, Japan
,
T. Shinogi
1   Graduate School of Engineering, Mie University, Mie, Japan
,
S. Tsuruoka
1   Graduate School of Engineering, Mie University, Mie, Japan
› Author Affiliations
Further Information

Publication History

Publication Date:
12 January 2018 (online)

Summary

Objectives : This paper discusses and develops a document image recognition, keyword extraction and automatic XML generation system to search analogous cases from paper-based documents. In this paper, we propose the document structure recognition method and automatic XML generation method for the tabular form discharge summary documents. This paper also develops the prototype system using the proposed method. Evaluation experiments using actual documents are doneto discuss the effectiveness of the developed system.

Methods : The developed system consists of the following methods. Paper-based summary documents are scanned by a scanner using 300 dpi first. Noise and tilt of the image are reduced by pre-processing, and the table structures are identified. Characters in the table are recognized and converted to text data by the OCR engine. XML documents are automatically generated using obtained results.

Results : In this paper, patient discharge summary documents archived at Mie University Hospital were used. The results show that XML documents can be automatically generated when standard tabular form documents are input into the developed system. In this experiment, it takes about 20 seconds to generate an XML document using the general personal computer. This paper also compares the developed system with a commercial product to discuss the effectiveness of the present system. Experimental results also show that the accuracy of table structure recognition is high and it can be used in a practical situation.

Conclusions : This paper showed the effectiveness of the proposed method to recognize the tabular form document images to generate XML documents.

 
  • References

  • 1. Friedman HH. Problem-Oriented Medical Diagnosis. 5th edition. 1991
  • 2. Seto K, Kamiyama T, Matsuo H. An ObjectModeling Method for Hospital Information Systems. The 9th World Congress on Medical Informatics 1998
  • 3. Lowe HJ, Antipov I, Hersh W, Smith CA, Mailhot M. Automated Semantic Indexing of Imaging Reports to Support Retrieval of Medical Images in the Multimedia Electronic Medical Record. Methods Inf Med 1999; 38 (04) 303-307.
  • 4. Kawanaka H, Otani Y, Yoshikawa T, Yamamoto K, Shinogi T, Tsuruoka S. Tendency Discovery from Incident Reports with Free Format Using Self Organizing Map. Japan Journal of Medical Informatics 2005; 25 (02) 87-96.
  • 5. Otani Y, Kawanaka H, Yoshikawa Y, Yamamoto K, Shinogi T, Tsuruoka S. Keyword Extraction from Incident Reports and Keyword Map Generation Method Using Self Organizing Map. Proc of2005 IEEE International Conference on Systems, Man and Cybernetics 2005. pp 1024-1029
  • 6. Otsu N. A Threshold Selection Method from Gray-Level Histograms. IEEE Trans. Systems, Man, and Cybernetics 1979; SMC-9 (No. 1). pp 62-66
  • 7. Otsu N. Discriminant and Least Squares Threshold Selection. Proc. of 41 JCPR 1978; 592-596.
  • 8. Akiyama T, Masuda I. A Segmentation Method for Document Images without the Knowledge of Document Formats. The IEICE Transactions on Information and Systems 1983; Vol. J66-D: pp 111-118
  • 9. Tanaka T, Tsuruoka S. Table Form Document Understanding Using Node Classification Method and HTML Document Generation. Proc of third IAPR Workshop on Document Analysis Systems 1998; 157-158.
  • 10. Ito Y, Ohno M, Tsuruoka S, Shinogi T. Document Structure Understanding on Subjects Registration Table. Proc of the fourth International Symposium on Advanced Intelligent System 2003; 571-574.
  • 11. Tsuruoka S, Hirano C, Yoshikawa T, Shinogi T. Image-based Structure Analysis for a Table of Contents and Conversion to XML Documents. Proc of Document Layout Interpretation and its Application 2001; 59-62.
  • 12. Smart Reading. Official Web Site of “Smart OCR Lite Edition”. http://www.smartread.biz/
  • 13. Panasonic Solution Technologies Co., Ltd. Color OCR Library Color OCR Library: “Yomitori Kakumei” SDK.. http://panasonic.co.jp/pss/pstc/products/colorocrlib/index.html
  • 14. Casasent D, Krishnapuram R. Curved Object Location by Hough Transformations and Inversions. PR 1987; 20 (02) 181-188.
  • 15. Krishnapuram R, Casasent D. Hough Space Transformation for Discrimination and Distortion Estimation. CVGIP 1987; 38: 299-316.
  • 16. Pao D, Li M.F, Jayakumar R. Detecting Parametric Curves Using the Straight Line Hough Transform. Proc of ICPR-B 1990; 620-625.
  • 17. Fujimoto K, Iwata Y, Nakata S. Parameter Extraction of Second Degree Curve from Hough Plane. The IEICE Transactions on Information and Systems 1991; J74-D2 9: 1184-1191.
  • 18. Yan J, Agui T, Nagao T. A Complex Transform for Extracting Circular Arcs and Strait Line Segments in Engineering Drawings. The IEICE Transactions on Information and Systems 1992; J75-D2 8: 1338-1345.
  • 19. Kieninger TG. Table structure recognition based on robust block segmentation. Proc Document Recognition V, SPIE 1998; 3305: 22-32.
  • 20. Aiello M. Document Image Analysis via Model Checking. APIA Notizie 2002; XV (01) 4-48
  • 21. Ogier J, Tombre K. Madonne. Document Image Analysis Techniques for Cultural Heritage Documents. Proc. of International Conference on Digital Cultural Heritage 2006; Madone Project, http://l3iexp.univ-lr.ff/madonne/
  • 22. Nagy G, Seth S, Viswanathan M. A Prototypical Document Image Analysis System for Technical Journals. IEEE Computer 1992; 25 (07) 10-24.