Go to file

huihun 979573651d first commit		2024-04-11 16:43:57 +08:00
process_data	first commit	2024-04-11 16:43:57 +08:00
.gitignore	first commit	2021-06-06 21:18:17 +08:00
data_collator_for_language_model.py	complete data processing and first vision of training script	2021-06-06 20:50:36 +08:00
LICENSE	first commit	2021-06-06 21:18:17 +08:00
my_data_collator.py	complete interface for downstream task	2021-06-08 15:43:57 +08:00
my_run_mlm_no_trainer.py	first commit	2024-04-11 16:43:57 +08:00
my_utils.py	first commit	2024-04-11 16:43:57 +08:00
obtain_inst_vec.py	first commit	2024-04-11 16:43:57 +08:00
README.md	update readme	2021-12-10 14:47:44 +08:00
run_mlm_no_trainer.py	complete data processing and first vision of training script	2021-06-06 20:50:36 +08:00
train_my_tokenizer.py	first commit	2024-04-11 16:43:57 +08:00

README.md

Inst2Vec Model

Using HuggingFace Transformers to train a BERT with dynamic mask for Assemble Language from scratch. We name it Inst2Vec for it is designed to generate vectors for assemble instructions.

It is a part of the model introduced in the ICONIP 2021 paper A Hierarchical Graph-based Neural Network for Malware Classification.

The preprocessing procedure can be found in process_data.

You can simply run python train_my_tokenizer.py to obtain an Assemble Tokenizer.

The script I use to train the Inst2Vec1 model is as follows:

python my_run_mlm_no_trainer.py \
    --per_device_train_batch_size 8192 \
    --per_device_eval_batch_size 16384 \
    --num_warmup_steps 4000 --output_dir ./ \
    --seed 1234 --preprocessing_num_workers 32 \
    --max_train_steps 150000 \
    --eval_every_steps 1000