modify the procedure of processing data

This commit is contained in:
zyr 2021-06-08 16:25:10 +08:00
parent fb61bb2a7b
commit fe2de236b5

View File

@ -12,27 +12,51 @@ cat inst.i.pos.txt | sort -n | uniq > inst.i.pos.txt.clean
``` ```
### 3. run `create_negtive_examples.py` ### 3. create_negtive_examples
We use the next file of the current file as its negative examples, which is apparently rational. We use the next file of the current file as its negative examples, which is apparently rational.
Specifically, for each instruction in the current positive file, we randomly choose a line in its next file and select one of two instructions in the line as its negative example. Specifically, for each instruction in the current positive file, we randomly choose a line in its next file and select one of two instructions in the line as its negative example.
generate `inst.i.neg.txt` located at `/home/ming/malware/data/elfasm_inst_pairs` `python create_negtive_examples.py`, generating `inst.i.neg.txt` located at `/home/ming/malware/data/elfasm_inst_pairs`
### 4. run `merge_examples_to_json.py`
We dump the positive and negative examples with their corresponding labels into several json files.
Each json file contains 20m lines of examples.
generate `inst.i.{0-127}.json` located at `/home/ming/malware/inst2vec_bert/data/asm_bert` ### 4. merge all of the files
We catenate all of the `inst.i.pos.txt.clean` files and remove the possible repeting lines between different files:
``` shell
cat inst.*.pos.txt.clean | sort -n | uniq > inst.all.pos.txt.clean
```
### 5. run `check_length.py` We process the files containing negative examples similarly.
``` shell
cat inst.*.neg.txt.clean | sort -n | uniq > inst.all.neg.txt.clean
```
Based on the `inst.all.pos.txt.clean`, we remove the lines from `inst.all.neg.txt.clean` if they also occur in `inst.all.pos.txt.clean`. This can be completed by `python clean.py`.
### 5. convert to json format
We first add labels for positive examples and negative examples
```shell
cat inst.all.neg.txt.clean | sed 's/^/0\t&/g' > inst.all.neg.txt.clean.label
cat inst.all.pos.txt.clean | sed 's/^/1\t&/g' > inst.all.pos.txt.clean.label
```
We dump the positive and negative examples with their corresponding labels into several json files, using `python merge_examples_to_json.py`.
Generate `inst.all.{0,1}.json` located at `/home/ming/malware/inst2vec_bert/data/asm_bert`.
### 6. get the maximum of length in examples
We will specify the length padded to when we use the tokenizer, `tokenizer.enable_padding(..., length=)`. We will specify the length padded to when we use the tokenizer, `tokenizer.enable_padding(..., length=)`.
So we need to know the longest sentences in the dataset. So we need to know the longest sentences in the dataset.
<!-- ### 5. run `count_word_for_vocab.py` The result is `28`, so I set `length=32`
### 7. get the size of vocab of examples
Similarly, we also need to specify the size of vocabulary when we train the tokenizer, `WordLevelTrainer(vocab_size=, ...)`. Similarly, we also need to specify the size of vocabulary when we train the tokenizer, `WordLevelTrainer(vocab_size=, ...)`.
So we need to know how many characters in the dataset. So we need to know how many characters in the dataset.
Something is wrong with `p.join()`, so I just set `vocab_size=2`. --> The result is `1016`, so I set `vocab_size=2000`.