Med-BERT

The Med-BERT model is a natural language processing model for disease prediction based on EHR records. You can read more about it in the paper:

Laila Rasmy, Yang Xiang, Ziqian Xie, Cui Tao, and Degui Zhi. “Med-BERT: pre-trained contextualized embeddings on large-scale structured electronic health records for disease prediction.” npj digital medicine 2021 https://www.nature.com/articles/s41746-021-00455-y.

Due to vendor restrictions, the authors could not share their trained model:

Initially we really hoped to share our models but unfortunately, the pre-trained models are no longer sharable. According to SBMI Data Service Office: “Under the terms of our contracts with data vendors, we are not permitted to share any of the data utilized in our publications, as well as large models derived from those data.”

but they shared code to reproducte Med-BERT at https://github.com/ZhiGroup/Med-BERT.

If you have access to data that aligns with Med-BERT’s requirements, you can leverage LARCC’s resources to create your own instance of Med-BERT. Here is an example for the pre-training phase:

Setup code dependencies. For this case, the pretraining code depends on tensorflow 1.x, which
- is only compatible with python 3.5 to 3.7. The cluster comes with python 3.9 by default and, currently, there is no module for any of these python versions. Thus, you will need to use Conda to create an environment with the desired python version.
- is compatible with protobuf versions prior 4.0.
- is compatible with cuda versions up to CUDA 10. LARCC’s gpus are only compatible with CUDA versions greater than 11.8, so you will need to use CPUs for the pretraining.
module load miniforge3 conda create --name my_tf1 python=3.7 tensorflow-gpu 'protobuf<=3.20' pandas numpy matplotlib

Download code and rename all spaces in folder names with _ to avoid conflicts in Linux.

cd $WORK
git clone https://github.com/ZhiGroup/Med-BERT.git
find Med-BERT -type d -name '*[[:space:]]*' | xargs -I '{}' sh -c "mv '{}' \`echo '{}' | sed 's/ /_/g'\`"

Preprocess the data you will use for the pretraining step. In the example below, the option --output_file='ehr_tf_features' will create a tensorflow formatted features file named ehr_tf_features required for the pretraining.

cd $WORK/Med-BERT/Pretraining_Code/Data_Pre-processing_Code
# NOTE: You can do the following on a batch job instead.
srun --partition=cpu384g --job-name med-bert --time=01:00:00 --ntasks-per-node=32 --cpu-bind=cores --pty /bin/bash -i
cd $WORK/Med-BERT/Pretraining_Code/Data_Pre-processing_Code
module load miniforge3
conda activate my_tf1
# NOTE: This assumes your input file is stored in the path below. Change it to something
# else if you store your data somewhere else
INPUT=$WORK/Med-BERT/Pretraining_Code/Data_Pre-processing_Code/data_file.tsv
OUT_PREFIX=preprocessed
python3 preprocess_pretrain_data.py "$INPUT" NA "$OUT_PREFIX"
python3 create_BERTpretrain_EHRfeatures.py \
    --input_file="$OUT_PREFIX.bencs.train" \
    --output_file='ehr_tf_features' \
    --vocab_file="$OUT_PREFIX.types" \
    --max_predictions_per_seq=1 \
    --max_seq_length=64
exit

Create a submission script for the pretraining phase. Assume the script below is written to $WORK/med-bert.sbatch.

#!/bin/bash
#SBATCH --partition=cpu384g
#SBATCH --time=3-00:00:00
#SBATCH --ntasks=32
#SBATCH --ntasks-per-core=1

PROJECT_NAME="Med-BERT"
PROJECT_DIR="/work/$USER/$PROJECT_NAME"
SCRATCH_DIR="/mnt/local/scratch/$USER"
WDIR="$SCRATCH_DIR/$SLURM_JOB_ID"
PRETRAIN_DIR="$WDIR/$PROJECT_NAME/Pretraining_Code"
OUTDIR="$WDIR/out"

# Copy project to scratch space to speed-up reading and writting of data.
# Checkpoints will be written to scratch too. You want to avoid writting to
# your home storage as it's orders of magnitude slower for read/write operations.
mkdir -p "$WDIR"
mkdir -p "$OUTDIR"
cp -r "$PROJECT_DIR" "$WDIR"
cd "$PRETRAIN_DIR"

# BACKUP settings:
#   These variables control how often and where the checkpoints produced
#   by the pretraining code is copied over to persistent storage.
BACKUP_DIR="$PROJECT_DIR/Pretraining_Code/checkpoint_$SLURM_JOB_ID"
BACKUP_FREQUENCY=600 # i.e. checkpoint every 10m
BACKUP_LOCK="$SCRATCH_DIR/backup_in_progress_$SLURM_JOB_ID"
BACKUP_CMD="rsync -a $OUTDIR/ $BACKUP_DIR/"

# Start background process to copy any produced checkpoint back to
# persistent storage
bash -c "
  while [ -e /proc/$$ ]; do
    sleep $BACKUP_FREQUENCY
    touch $BACKUP_LOCK
    $BACKUP_CMD
    rm $BACKUP_LOCK
  done
" &
backup_pid=$!

module load miniforge3
conda activate my_tf1

python3 run_EHRpretraining.py \
        --input_file="$PRETRAIN_DIR/Data_Pre-processing_Code/ehr_tf_features" \
        --output_dir="$OUTDIR" \
        --do_train=True \
        --do_eval=True \
        --bert_config_file="$PRETRAIN_DIR/config.json" \
        --train_batch_size=32 \
        --max_seq_length=64 \
        --max_predictions_per_seq=1 \
        --num_train_steps=4500000 \
        --num_warmup_steps=10000 \
        --learning_rate=5e-5

# Wait for active backup (if any) to finish
while [ -e $BACKUP_LOCK ]; do sleep 1; done
# Prevent any more auto-backups from starting and perform
# final backup
kill -9 $backup_pid
$BACKUP_CMD

Submit script to slurm with sbatch ~/med-bert.sbatch.