History

zyr b4810738b8 update README		2021-06-07 18:43:10 +08:00
..
check_length.py	complete data processing and first vision of training script	2021-06-06 20:50:36 +08:00
convert_space_format.py	complete data processing and first vision of training script	2021-06-06 20:50:36 +08:00
count_word_for_vocab.py	complete data processing and first vision of training script	2021-06-06 20:50:36 +08:00
create_negative_examples.py	complete data processing and first vision of training script	2021-06-06 20:50:36 +08:00
merge_examples_to_json.py	complete data processing and first vision of training script	2021-06-06 20:50:36 +08:00
readme.md	update README	2021-06-07 18:43:10 +08:00
utils.py	update the procedure of processing data	2021-06-07 18:42:22 +08:00

readme.md

Pre-processing steps

1. run `convert_space_format.py`

Convert the string <space> to SPACE

linux32_0ixxxx.all -> inst.i.pos.txt located at /home/ming/malware/data/elfasm_inst_pairs

2. remove the repete lines in the `inst.i.pos.txt`

Using python script is too slow. We use the shell instead.

cat inst.i.pos.txt | sort -n | uniq > inst.i.pos.txt.clean

3. run `create_negtive_examples.py`

We use the next file of the current file as its negative examples, which is apparently rational.

Specifically, for each instruction in the current positive file, we randomly choose a line in its next file and select one of two instructions in the line as its negative example.

generate inst.i.neg.txt located at /home/ming/malware/data/elfasm_inst_pairs

4. run `merge_examples_to_json.py`

We dump the positive and negative examples with their corresponding labels into several json files. Each json file contains 20m lines of examples.

generate inst.i.{0-127}.json located at /home/ming/malware/inst2vec_bert/data/asm_bert

5. run `check_length.py`

We will specify the length padded to when we use the tokenizer, tokenizer.enable_padding(..., length=).

So we need to know the longest sentences in the dataset.

readme.md

Pre-processing steps

1. run convert_space_format.py

2. remove the repete lines in the inst.i.pos.txt

3. run create_negtive_examples.py

4. run merge_examples_to_json.py

5. run check_length.py

1. run `convert_space_format.py`

2. remove the repete lines in the `inst.i.pos.txt`

3. run `create_negtive_examples.py`

4. run `merge_examples_to_json.py`

5. run `check_length.py`