598 Reproducibility Project
Citation to the original paper:
Oleynik M, Kugic A, Kas ́a ˇc Z, Kreuzthaler M. Evaluating shallow and deep learning strategies for the 2018 n2c2 shared task on clinical text clas- sification. J Am Med Inform Assoc. 2019 Nov 1;26(11):1247-1254. doi: 10.1093/jamia/ocz149. PMID: 31512729; PMCID: PMC6798565.
Original paper's repo:
https://github.com/bst-mug/n2c2
Code Dependencies
- JDK8+
- python3 (to run official evaluation scripts)
- make (to compile fastText)
- gcc/clang (to compile fastText)
Steps for using this code repo:
-
- You need to get the original/input dataset. One way to get is from https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/ by submitting data access request.
-
- Once you have the input dataset, you need to put the data under the folder:
/data/test
,/data/train
separatelly. 70% of the data is for training, and 30% of the data is for testing.
- Once you have the input dataset, you need to put the data under the folder:
-
- Run the program
SentenceDumper.java
to generate thesentences.txt
file.
- Run the program
-
- Run the program
VocabularyDumper.java
to generate thevocab.txt
file.
- Run the program
-
- Install fasttext program to your working machine. You can follow this link to install fasttext: https://fasttext.cc/docs/en/supervised-tutorial.html
-
- Copy
sentences.txt
file to the folderscripts
, then run the scripttrain_embeddings.sh
which will generaten2c2-fasttext
model.
- Copy
-
- Copy
vocab.txt
file to the folderscripts
. DownloadBioWordVec_PubMed_MIMICIII_d200.bin
from https://github.com/ncbi-nlp/BioSentVec. Run the scriptprint_pre_trained_vectors.sh
to generate pre_trained embedding. Run the scriptprint_self_trained_vectors.sh
to generate self_trained embedding. Then, copy both the embeddings to the related folder as the generated class file by java.
- Copy
-
- To start the main program: navigate to
ClassifierRunner.java
program. Then, run the main method, and the program will start running. For my case, I disabled two Classifiers because the pre_trained embedding can not generated from my laptop due to my laptop has only 8 GB memory and failed to run theBioWordVec_PubMed_MIMICIII_d200.bin
.
- To start the main program: navigate to
-
- The program first loads all the classifiers, then it starts to parse the training/test dataset into list of patients class. For each classifier, it uses two validators:
n2c2 official metrics
andaccuracy, fp/fn metrics
to validate the model. Lastly, it writes the results under thestats
folder, with-basic.csv
and-official.csv
as suffix.
- The program first loads all the classifiers, then it starts to parse the training/test dataset into list of patients class. For each classifier, it uses two validators:
Table of results:
I uploaded all outputs under original_output
folder, including: Baseline model, RBC model, SVM model, LR model, LSTM model.
Overall F1 score per criterion on the test set, compared with the baseline, a majority classifier:
Criterion | Baseline | RBC | SVM | SELF-LR | SELF-LSTM |
---|---|---|---|---|---|
Abdominal | 0.3944 | 0.872 | 0.6028 | 0.5959 | 0.5411 |
Advanced-cad | 0.3435 | 0.7902 | 0.7281 | 0.7133 | 0.4538 |
Alcohol-abuse | 0.4911 | 0.4881 | 0.4911 | 0.4911 | 0.4911 |
Asp-for-mi | 0.4416 | 0.7095 | 0.6063 | 0.5962 | 0.4305 |
Creatinine | 0.4189 | 0.8071 | 0.6532 | 0.7073 | 0.5855 |
Dietsupp-2mos | 0.3385 | 0.9185 | 0.5814 | 0.6038 | 0.4267 |
Drug-abuse | 0.4911 | 0.691 | 0.4911 | 0.4911 | 0.6546 |
English | 0.4591 | 0.8644 | 0.4591 | 0.4591 | 0.4557 |
Hba1c | 0.3723 | 0.9382 | 0.6267 | 0.5393 | 0.5216 |
Keto-1yr | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 |
Major-diabetes | 0.3333 | 0.8369 | 0.7555 | 0.7643 | 0.4407 |
Makes-decisions | 0.4911 | 0.4911 | 0.4911 | 0.4911 | 0.4911 |
Mi-6mos | 0.4756 | 0.8752 | 0.6815 | 0.4756 | 0.4691 |
Overall (micro) | 0.7608 | 0.91 | 0.8035 | 0.8031 | 0.73 |
Overall (macro) | 0.427 | 0.7525 | 0.5899 | 0.5714 | 0.497 |
Overall accuracy per criterion on the test set, compared with the baseline, a majority classifier
Criterion | Baseline | RBC | SVM | SELF-LR | SELF-LSTM |
---|---|---|---|---|---|
Abdominal | 0.651162 | 0.883720 | 0.651162 | 0.662790 | 0.569767 |
Advanced-cad | 0.523255 | 0.790697 | 0.732558 | 0.720930 | 0.616279 |
Alcohol-abuse | 0.523255 | 0.953488 | 0.965116 | 0.965116 | 0.965116 |
Asp-for-mi | 0.790697 | 0.860465 | 0.755813 | 0.767441 | 0.767441 |
Creatinine | 0.720930 | 0.837209 | 0.720930 | 0.755813 | 0.709302 |
Dietsupp-2mos | 0.511627 | 0.918604 | 0.581395 | 0.604651 | 0.441860 |
Drug-abuse | 0.965116 | 0.965116 | 0.965116 | 0.965116 | 0.9302325 |
English | 0.848837 | 0.941860 | 0.848837 | 0.848837 | 0.837209 |
Hba1c | 0.593023 | 0.9418604 | 0.651162 | 0.58139 | 0.511627 |
Keto-1yr | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
Major-diabetes | 0.965116 | 0.837209 | 0.755813 | 0.7674418 | 0.523255 |
Makes-decisions | 0.906976 | 0.965116 | 0.965116 | 0.965116 | 0.965116 |
Mi-6mos | 0.906976 | 0.965116 | 0.930232 | 0.767441 | 0.965116 |
Overall (micro) | 0.764758 | 0.912343 | 0.809481 | 0.808586 | 0.7495527 |
Overall (macro) | 0.764758 | 0.91234 | 0.809481 | 0.808586 | - |