Go to file
2021-10-11 21:24:37 +08:00
process_data final version of Inst2Vec 2021-06-30 19:20:12 +08:00
.gitignore first commit 2021-06-06 21:18:17 +08:00
data_collator_for_language_model.py complete data processing and first vision of training script 2021-06-06 20:50:36 +08:00
LICENSE first commit 2021-06-06 21:18:17 +08:00
my_data_collator.py complete interface for downstream task 2021-06-08 15:43:57 +08:00
my_run_mlm_no_trainer.py final version of Inst2Vec 2021-06-30 19:20:12 +08:00
obtain_inst_vec.py complete interface for downstream task 2021-06-08 15:43:57 +08:00
README.md update README 2021-10-11 21:24:37 +08:00
run_mlm_no_trainer.py complete data processing and first vision of training script 2021-06-06 20:50:36 +08:00
train_my_tokenizer.py final version of Inst2Vec 2021-06-30 19:20:12 +08:00

Inst2Vec

Using HuggingFace Transformers to train a BERT for Assemble Language from scratch. We name it Inst2Vec for it is designed to generate vectors for assemble instructions.

It is a part of the model we proposed in the paper A Hierarchical Graph-based Neural Network for Malware Classification.

The preprocessing procedure can be found in process_data.

You can simply run python train_my_tokenizer.py to obtain an Assemble Tokenizer.

The script I use to train the Inst2Vec1 model is as follows:

python my_run_mlm_no_trainer.py \
    --per_device_train_batch_size 8192 \
    --per_device_eval_batch_size 16384 \
    --num_warmup_steps 4000 --output_dir ./ \
    --seed 1234 --preprocessing_num_workers 32 \
    --max_train_steps 150000 \
    --eval_every_steps 1000