Skip to content
Snippets Groups Projects
kunyin2's avatar
kunyin2 authored
0924ecba
History

598 Reproducibility Project

Citation to the original paper:

Oleynik M, Kugic A, Kas ́a ˇc Z, Kreuzthaler M. Evaluating shallow and deep learning strategies for the 2018 n2c2 shared task on clinical text clas- sification. J Am Med Inform Assoc. 2019 Nov 1;26(11):1247-1254. doi: 10.1093/jamia/ocz149. PMID: 31512729; PMCID: PMC6798565.

Original paper's repo:

https://github.com/bst-mug/n2c2

Code Dependencies

  • JDK8+
  • python3 (to run official evaluation scripts)
  • make (to compile fastText)
  • gcc/clang (to compile fastText)

Steps for using this code repo:

    1. You need to get the original/input dataset. One way to get is from https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/ by submitting data access request.
    1. Once you have the input dataset, you need to put the data under the folder: /data/test , /data/train separatelly. 70% of the data is for training, and 30% of the data is for testing.
    1. Run the program SentenceDumper.java to generate the sentences.txt file.
    1. Run the program VocabularyDumper.java to generate the vocab.txt file.
    1. Install fasttext program to your working machine. You can follow this link to install fasttext: https://fasttext.cc/docs/en/supervised-tutorial.html
    1. Copy sentences.txt file to the folder scripts, then run the script train_embeddings.sh which will generate n2c2-fasttext model.
    1. Copy vocab.txt file to the folder scripts. Download BioWordVec_PubMed_MIMICIII_d200.bin from https://github.com/ncbi-nlp/BioSentVec. Run the script print_pre_trained_vectors.sh to generate pre_trained embedding. Run the script print_self_trained_vectors.sh to generate self_trained embedding. Then, copy both the embeddings to the related folder as the generated class file by java.
    1. To start the main program: navigate to ClassifierRunner.java program. Then, run the main method, and the program will start running. For my case, I disabled two Classifiers because the pre_trained embedding can not generated from my laptop due to my laptop has only 8 GB memory and failed to run the BioWordVec_PubMed_MIMICIII_d200.bin.
    1. The program first loads all the classifiers, then it starts to parse the training/test dataset into list of patients class. For each classifier, it uses two validators: n2c2 official metrics and accuracy, fp/fn metrics to validate the model. Lastly, it writes the results under the stats folder, with -basic.csv and -official.csv as suffix.

Table of results:

I uploaded all outputs under original_output folder, including: Baseline model, RBC model, SVM model, LR model, LSTM model.

Overall F1 score per criterion on the test set, compared with the baseline, a majority classifier:

Criterion Baseline RBC SVM SELF-LR SELF-LSTM
Abdominal 0.3944 0.872 0.6028 0.5959 0.5411
Advanced-cad 0.3435 0.7902 0.7281 0.7133 0.4538
Alcohol-abuse 0.4911 0.4881 0.4911 0.4911 0.4911
Asp-for-mi 0.4416 0.7095 0.6063 0.5962 0.4305
Creatinine 0.4189 0.8071 0.6532 0.7073 0.5855
Dietsupp-2mos 0.3385 0.9185 0.5814 0.6038 0.4267
Drug-abuse 0.4911 0.691 0.4911 0.4911 0.6546
English 0.4591 0.8644 0.4591 0.4591 0.4557
Hba1c 0.3723 0.9382 0.6267 0.5393 0.5216
Keto-1yr 0.5 0.5 0.5 0.5 0.5
Major-diabetes 0.3333 0.8369 0.7555 0.7643 0.4407
Makes-decisions 0.4911 0.4911 0.4911 0.4911 0.4911
Mi-6mos 0.4756 0.8752 0.6815 0.4756 0.4691
Overall (micro) 0.7608 0.91 0.8035 0.8031 0.73
Overall (macro) 0.427 0.7525 0.5899 0.5714 0.497

Overall accuracy per criterion on the test set, compared with the baseline, a majority classifier

Criterion Baseline RBC SVM SELF-LR SELF-LSTM
Abdominal 0.651162 0.883720 0.651162 0.662790 0.569767
Advanced-cad 0.523255 0.790697 0.732558 0.720930 0.616279
Alcohol-abuse 0.523255 0.953488 0.965116 0.965116 0.965116
Asp-for-mi 0.790697 0.860465 0.755813 0.767441 0.767441
Creatinine 0.720930 0.837209 0.720930 0.755813 0.709302
Dietsupp-2mos 0.511627 0.918604 0.581395 0.604651 0.441860
Drug-abuse 0.965116 0.965116 0.965116 0.965116 0.9302325
English 0.848837 0.941860 0.848837 0.848837 0.837209
Hba1c 0.593023 0.9418604 0.651162 0.58139 0.511627
Keto-1yr 1.0 1.0 1.0 1.0 1.0
Major-diabetes 0.965116 0.837209 0.755813 0.7674418 0.523255
Makes-decisions 0.906976 0.965116 0.965116 0.965116 0.965116
Mi-6mos 0.906976 0.965116 0.930232 0.767441 0.965116
Overall (micro) 0.764758 0.912343 0.809481 0.808586 0.7495527
Overall (macro) 0.764758 0.91234 0.809481 0.808586 -