Pre-processing steps

1. run `convert_space_format.py`

Convert the string <space> to SPACE

linux32_0ixxxx.all -> inst.i.pos.txt located at /home/ming/malware/data/elfasm_inst_pairs

2. remove the repete lines in the `inst.i.pos.txt`

Using python script is too slow. We use the shell instead.

cat inst.i.pos.txt | sort -n | uniq > inst.i.pos.txt.clean

3. create_negtive_examples

We use the next file of the current file as its negative examples, which is apparently rational.

Specifically, for each instruction in the current positive file, we randomly choose a line in its next file and select one of two instructions in the line as its negative example.

python create_negtive_examples.py, generating inst.i.neg.txt located at /home/ming/malware/data/elfasm_inst_pairs

4. merge all of the files

We catenate all of the inst.i.pos.txt.clean files and remove the possible repeting lines between different files:

cat inst.*.pos.txt.clean | sort -n | uniq > inst.all.pos.txt.clean

We process the files containing negative examples similarly.

cat inst.*.neg.txt.clean | sort -n | uniq > inst.all.neg.txt.clean

Based on the inst.all.pos.txt.clean, we remove the lines from inst.all.neg.txt.clean if they also occur in inst.all.pos.txt.clean. This can be completed by python clean.py.

5. convert to json format

We first add labels for positive examples and negative examples

cat inst.all.neg.txt.clean | sed 's/^/0\t&/g' > inst.all.neg.txt.clean.label
cat inst.all.pos.txt.clean | sed 's/^/1\t&/g' > inst.all.pos.txt.clean.label

We dump the positive and negative examples with their corresponding labels into several json files, using python merge_examples_to_json.py.

Generate inst.all.{0,1}.json located at /home/ming/malware/inst2vec_bert/data/asm_bert.

6. get the maximum of length in examples

We will specify the length padded to when we use the tokenizer, tokenizer.enable_padding(..., length=).

So we need to know the longest sentences in the dataset.

The result is 28, so I set length=32

7. get the size of vocab of examples

Similarly, we also need to specify the size of vocabulary when we train the tokenizer, WordLevelTrainer(vocab_size=, ...).

So we need to know how many characters in the dataset.

The result is 1016, so I set vocab_size=2000.

2.3 KiB Raw Blame History