modify the procedure of processing data
This commit is contained in:
parent
fb61bb2a7b
commit
fe2de236b5
@ -12,27 +12,51 @@ cat inst.i.pos.txt | sort -n | uniq > inst.i.pos.txt.clean
|
||||
```
|
||||
|
||||
|
||||
### 3. run `create_negtive_examples.py`
|
||||
### 3. create_negtive_examples
|
||||
We use the next file of the current file as its negative examples, which is apparently rational.
|
||||
|
||||
Specifically, for each instruction in the current positive file, we randomly choose a line in its next file and select one of two instructions in the line as its negative example.
|
||||
|
||||
generate `inst.i.neg.txt` located at `/home/ming/malware/data/elfasm_inst_pairs`
|
||||
`python create_negtive_examples.py`, generating `inst.i.neg.txt` located at `/home/ming/malware/data/elfasm_inst_pairs`
|
||||
|
||||
### 4. run `merge_examples_to_json.py`
|
||||
We dump the positive and negative examples with their corresponding labels into several json files.
|
||||
Each json file contains 20m lines of examples.
|
||||
|
||||
generate `inst.i.{0-127}.json` located at `/home/ming/malware/inst2vec_bert/data/asm_bert`
|
||||
### 4. merge all of the files
|
||||
We catenate all of the `inst.i.pos.txt.clean` files and remove the possible repeting lines between different files:
|
||||
``` shell
|
||||
cat inst.*.pos.txt.clean | sort -n | uniq > inst.all.pos.txt.clean
|
||||
```
|
||||
|
||||
### 5. run `check_length.py`
|
||||
We process the files containing negative examples similarly.
|
||||
``` shell
|
||||
cat inst.*.neg.txt.clean | sort -n | uniq > inst.all.neg.txt.clean
|
||||
```
|
||||
|
||||
Based on the `inst.all.pos.txt.clean`, we remove the lines from `inst.all.neg.txt.clean` if they also occur in `inst.all.pos.txt.clean`. This can be completed by `python clean.py`.
|
||||
|
||||
|
||||
### 5. convert to json format
|
||||
We first add labels for positive examples and negative examples
|
||||
```shell
|
||||
cat inst.all.neg.txt.clean | sed 's/^/0\t&/g' > inst.all.neg.txt.clean.label
|
||||
cat inst.all.pos.txt.clean | sed 's/^/1\t&/g' > inst.all.pos.txt.clean.label
|
||||
```
|
||||
|
||||
We dump the positive and negative examples with their corresponding labels into several json files, using `python merge_examples_to_json.py`.
|
||||
|
||||
Generate `inst.all.{0,1}.json` located at `/home/ming/malware/inst2vec_bert/data/asm_bert`.
|
||||
|
||||
|
||||
### 6. get the maximum of length in examples
|
||||
We will specify the length padded to when we use the tokenizer, `tokenizer.enable_padding(..., length=)`.
|
||||
|
||||
So we need to know the longest sentences in the dataset.
|
||||
|
||||
<!-- ### 5. run `count_word_for_vocab.py`
|
||||
The result is `28`, so I set `length=32`
|
||||
|
||||
|
||||
### 7. get the size of vocab of examples
|
||||
Similarly, we also need to specify the size of vocabulary when we train the tokenizer, `WordLevelTrainer(vocab_size=, ...)`.
|
||||
|
||||
So we need to know how many characters in the dataset.
|
||||
|
||||
Something is wrong with `p.join()`, so I just set `vocab_size=2`. -->
|
||||
The result is `1016`, so I set `vocab_size=2000`.
|
Loading…
Reference in New Issue
Block a user