update README

This commit is contained in:
zyr 2021-06-07 18:43:10 +08:00
parent 1d1acb29b3
commit b4810738b8

View File

@ -2,21 +2,37 @@
### 1. run `convert_space_format.py` ### 1. run `convert_space_format.py`
Convert the string `<space>` to `SPACE` Convert the string `<space>` to `SPACE`
### 2. run `create_negtive_examples.py` `linux32_0ixxxx.all -> inst.i.pos.txt` located at `/home/ming/malware/data/elfasm_inst_pairs`
### 2. remove the repete lines in the `inst.i.pos.txt`
Using python script is too slow. We use the shell instead.
``` shell
cat inst.i.pos.txt | sort -n | uniq > inst.i.pos.txt.clean
```
### 3. run `create_negtive_examples.py`
We use the next file of the current file as its negative examples, which is apparently rational. We use the next file of the current file as its negative examples, which is apparently rational.
Specifically, for each instruction in the current positive file, we randomly choose a line in its next file and select one of two instructions in the line as its negative example. Specifically, for each instruction in the current positive file, we randomly choose a line in its next file and select one of two instructions in the line as its negative example.
### 3. run `merge_examples_to_json.py` generate `inst.i.neg.txt` located at `/home/ming/malware/data/elfasm_inst_pairs`
### 4. run `merge_examples_to_json.py`
We dump the positive and negative examples with their corresponding labels into several json files. We dump the positive and negative examples with their corresponding labels into several json files.
Each json file contains 20m lines of examples. Each json file contains 20m lines of examples.
### 4. run `check_length.py` generate `inst.i.{0-127}.json` located at `/home/ming/malware/inst2vec_bert/data/asm_bert`
### 5. run `check_length.py`
We will specify the length padded to when we use the tokenizer, `tokenizer.enable_padding(..., length=)`. We will specify the length padded to when we use the tokenizer, `tokenizer.enable_padding(..., length=)`.
So we need to know the longest sentences in the dataset. So we need to know the longest sentences in the dataset.
### 5. run `count_word_for_vocab.py` <!-- ### 5. run `count_word_for_vocab.py`
Similarly, we also need to specify the size of vocabulary when we train the tokenizer, `WordLevelTrainer(vocab_size=, ...)`. Similarly, we also need to specify the size of vocabulary when we train the tokenizer, `WordLevelTrainer(vocab_size=, ...)`.
So we need to know how many characters in the dataset. So we need to know how many characters in the dataset.
Something is wrong with `p.join()`, so I just set `vocab_size=2`. -->