Presented at TELEMEDICON (National) on 4-Nov-2023
Presented at TELEMEDICON (National) on 4-Nov-2023
Presented at TELEMEDICON (National)
on 4-Nov-2023
Demystify: Al-Powered Diagnostic
Reports Digitisation Engine
Demystify: Al-Powered Diagnostic Reports Digitisation Engine
INTRODUCTION
INTRODUCTION
Diagnostic reports play a crucial role in monitoring patient health, treatment planning, and diagnosis.
Diagnostic reports play a crucial role in monitoring patient health, treatment planning, and diagnosis.
These, often containing sensitive Personally Identifiable Information (PII), are created and stored in diverse formats, including images and PDFs with scanned and embedded text, necessitating diligent masking measures for data privacy.
These, often containing sensitive Personally Identifiable Information (PII), are created and stored in diverse formats, including images and PDFs with scanned and embedded text, necessitating diligent masking measures for data privacy.
While this improves portability, it adversely affects access and use of the data.
While this improves portability, it adversely affects access and use of the data.
In order to provide user-friendly ways of accessing this information, the data contained in them needs to be digitized.
In order to provide user-friendly ways of accessing this information, the data contained in them needs to be digitized.
To address this issue, we have developed a lab report digitization engine that incorporates computer vision techniques such as Optical Character Recognition (OCR) and Natural Language Processing techniques such as Named Entity Recognition (NER) in order to extract standardized value, unit and reference range corresponding to subtests present in the lab reports.
To address this issue, we have developed a lab report digitization engine that incorporates computer vision techniques such as Optical Character Recognition (OCR) and Natural Language Processing techniques such as Named Entity Recognition (NER) in order to extract standardized value, unit and reference range corresponding to subtests present in the lab reports.
OBJECTIVE
OBJECTIVE
Creating a diagnostic report digitization engine that can extract semantically and medically accurate information from scanned diagnostic reports.
Creating a diagnostic report digitization engine that can extract semantically and medically accurate information from scanned diagnostic reports.
METHODOLOGY
METHODOLOGY
Each Pll masked document is pre-processed to determine if it is scanned or has embedded text.
Each Pll masked document is pre-processed to determine if it is scanned or has embedded text.
If embedded text is present, it is extracted using standard libraries, while scanned documents undergo Optical Character Recognition (OCR).
If embedded text is present, it is extracted using standard libraries, while scanned documents undergo Optical Character Recognition (OCR).
Subsequently, a custom Named Entity Recognition (NER) algorithm is applied to parsed text to identify and extract critical information such as components, methods, values, units, and reference ranges.
Subsequently, a custom Named Entity Recognition (NER) algorithm is applied to parsed text to identify and extract critical information such as components, methods, values, units, and reference ranges.
The NER model is based on a dataset compiled from 87,575 reports gathered from 652 lab partners across India, ensuring adaptability to diverse representations of entities.
The NER model is based on a dataset compiled from 87,575 reports gathered from 652 lab partners across India, ensuring adaptability to diverse representations of entities.
Continuous improvement is implemented through a dashboard where trained annotators correct the engine's outputs based on their expertise. Additionally, a daily verification process is implemented due to the sensitive nature of the operation.
Continuous improvement is implemented through a dashboard where trained annotators correct the engine's outputs based on their expertise. Additionally, a daily verification process is implemented due to the sensitive nature of the operation.



RESULTS
RESULTS
To test the capability of our system, we evaluated 800 diagnostic reports that were not previously included in the training phase and represented a random sample set.
To test the capability of our system, we evaluated 800 diagnostic reports that were not previously included in the training phase and represented a random sample set.
We measured true positives, false positives, true negatives and false positives across 28,992 component rows.
We measured true positives, false positives, true negatives and false positives across 28,992 component rows.
We found that the engine had an accuracy of 94.69%, precision was 0.9967.
We found that the engine had an accuracy of 94.69%, precision was 0.9967.
We also found the recall to be 0.9461 and the F1 score to be 0.9707.
We also found the recall to be 0.9461 and the F1 score to be 0.9707.
Performance Metrics
Performance Metrics

Confusion Matrix
Confusion Matrix



CONCLUSION
CONCLUSION
Similar studies on extracting information from lab reports in other countries have been conducted previously, and various Indian healthcare companies have also ventured into this area.
Similar studies on extracting information from lab reports in other countries have been conducted previously, and various Indian healthcare companies have also ventured into this area.
However, to the best of our knowledge, this study presents the first attempt to introduce a method for digitizing lab reports at this scale of data in India, as our dataset stands to be the largest of its kind in the country.
However, to the best of our knowledge, this study presents the first attempt to introduce a method for digitizing lab reports at this scale of data in India, as our dataset stands to be the largest of its kind in the country.
The strength of the study is in the dataset curation, which included diverse reports from multiple sources, locations, and diverse diverse patient profiles.
The strength of the study is in the dataset curation, which included diverse reports from multiple sources, locations, and diverse diverse patient profiles.
While this approach is useful for digitizing past records, unless there is widespread uptake of interoperability and adherence to reporting standards. highly accurate and reliable reporting will remain a challenge.
While this approach is useful for digitizing past records, unless there is widespread uptake of interoperability and adherence to reporting standards. highly accurate and reliable reporting will remain a challenge.
Though we hope learning codes can keep up with these challenges through access to varied reports spanning a spectrum of formats, with our datasets tracking the fast-changing reporting specifications from developments to technology.
Though we hope learning codes can keep up with these challenges through access to varied reports spanning a spectrum of formats, with our datasets tracking the fast-changing reporting specifications from developments to technology.
REFERENCES
REFERENCES
Kang YS, Kayaalp M. Extracting laboratory test information from biomedical text. J Pathol Inform. 2013;4:23.
Kang YS, Kayaalp M. Extracting laboratory test information from biomedical text. J Pathol Inform. 2013;4:23.
Hao T, Liu H, Weng C. Valx: A system for extracting and structuring numeric lab test comparison statements from text. 2017.
Hao T, Liu H, Weng C. Valx: A system for extracting and structuring numeric lab test comparison statements from text. 2017.
EXPLORE PUBLICATIONS
EXPLORE PUBLICATIONS