Inst2Vec/README.md

19 lines
966 B
Markdown
Raw Permalink Normal View History

2021-06-30 19:22:16 +08:00
# Inst2Vec Model
2021-12-10 14:47:44 +08:00
Using [HuggingFace Transformers](https://github.com/huggingface/transformers) to train a BERT with dynamic mask for Assemble Language from scratch. We name it `Inst2Vec` for it is designed to generate vectors for assemble instructions.
2021-10-11 21:34:52 +08:00
2021-12-10 14:47:44 +08:00
It is a part of the model introduced in the ICONIP 2021 paper [A Hierarchical Graph-based Neural Network for Malware Classification](https://link.springer.com/chapter/10.1007%2F978-3-030-92273-3_51).
2021-10-11 21:34:52 +08:00
The preprocessing procedure can be found in [process_data](./process_data/readme.md).
You can simply run `python train_my_tokenizer.py` to obtain an Assemble Tokenizer.
The script I use to train the `Inst2Vec1` model is as follows:
```
python my_run_mlm_no_trainer.py \
--per_device_train_batch_size 8192 \
--per_device_eval_batch_size 16384 \
--num_warmup_steps 4000 --output_dir ./ \
--seed 1234 --preprocessing_num_workers 32 \
--max_train_steps 150000 \
--eval_every_steps 1000
```