AWD 131-138 br keeping the data in a
keeping the data in a cloud platform and alerting the person using mobile technology. Undoubtedly the whole system can serve a large population by detecting a disease early with low cost and by giving satisfaction providing a freedom to choose the diagnostic tests. First we study the designing of classification methodologies for electronic medical records, a well studied problem for many diseases e.g. heart disease [6–11], breast cancer [12–14], etc. (ref Section 1.1). We find that classification of esophageal cancer has not been studied. Moreover, there is no consensus among the existing studies regarding the best classification technique. Our first result in this study is that kernel methods with SVM and Logistic Regression perform better than existing popular methods for classification, especially when the number of ‘clinical test’ features is low.
The second crucial aspect is a careful design of metric for de-termining the best classifier. This is driven by two peculiarities of the problem: imbalance in the AWD 131-138 and differential importance of the classes. Firstly, there are a lot more “normal” patients than “diseased” patients. Hence, we find that traditional metric of accuracy un-discerning and possibly misleading, since marking most patients as normal will automatically have a high accuracy. Secondly, the cost of classifying a diseased patient as non-diseased (called false normals) is much higher that the cost of classifying a normal patient as diseased, since in the second case the patient has to merely perform more tests, while in the first case, the patient risks making the disease worse. Hence, we use sensitivity (ratio of number of diseased patients detected to total number of diseased patients) as our metric, which is more discerning than accuracy.
We note that sensitivity of 100% corresponds to the situation where none of the diseased patients are classified as non-diseased. In many classifiers the ratio of examples predicted as positive or negative can be controlled using a parameter. For example, this can be achieved in SVMs and Logistic regression by changing the prediction thresholds. Moreover, in case of logistic regression, the probability threshold lies within the range [0, 1] and hence easier to select. In Section 3.3, we describe an algorithm for selection of threshold, which tries to make false normal zero. Under deoxyribonucleic acid (DNA) criterion, the quality of the classifier is determined by the number of normal patients classified as diseased (false abnormal), which is used as a metric in the subsequent results. To the best of our knowledge, this approach for evaluating EMR classifi-cation has not been used before.
Finally, in Section 4.2 we describe two potential applications which demonstrate the ability of patients to express their preferences for se-lecting clinical tests. This choice may be based on financial cost,
medical value or simply comfort factor for a patient. We select specific sets features (clinical tests) which satisfy a total budget constraint based on costs specified by the patients or doctor or service provider like insurance companies. These sets of features are used to classify patients, keeping the sensitivity at 100% using the algorithm developed above. We use the false abnormal rate to assess the efficiency of the resultant system. We report two case studies, one involving the cost of the tests and the other using “discomfort” factor of the tests. Our results de-monstrate that best set of tests, differ from each other based on the criteria (cost or “discomfort”) while maintaining sensitivity of 100%.
The rest of the document has been structured as follows. Section 2 contains data set preparation. This section also contains how the fea-tures are being selected based on need of an individual patients and states about the evaluation criteria of performance of the methods. Section 3 states about the methods for detecting the proposed esopha-geal cancer and selection of threshold. Section 4 report the results and discussion about classification of EMRs and case-studies on persona-lized selection of tests. We conclude with our remarks in Section 5 with a summary table.