process_data | ||
.gitignore | ||
data_collator_for_language_model.py | ||
LICENSE | ||
my_data_collator.py | ||
my_run_mlm_no_trainer.py | ||
obtain_inst_vec.py | ||
README.md | ||
run_mlm_no_trainer.py | ||
train_my_tokenizer.py |
Inst2Vec Model
Using HuggingFace Transformers to train a BERT for Assemble Language from scratch. We name it Inst2Vec
for it is designed to generate vectors for assemble instructions.
It is a part of the model we proposed in the paper A Hierarchical Graph-based Neural Network for Malware Classification.
The preprocessing procedure can be found in process_data.
You can simply run python train_my_tokenizer.py
to obtain an Assemble Tokenizer.
The script I use to train the Inst2Vec1
model is as follows:
python my_run_mlm_no_trainer.py \
--per_device_train_batch_size 8192 \
--per_device_eval_batch_size 16384 \
--num_warmup_steps 4000 --output_dir ./ \
--seed 1234 --preprocessing_num_workers 32 \
--max_train_steps 150000 \
--eval_every_steps 1000