63 lines
3.1 KiB
Markdown
63 lines
3.1 KiB
Markdown
|
# malware_ngram_classif
|
||
|
This project demonstrates malware analysis through generating n-gram from
|
||
|
executable file opcodes and using ML to create/ train a model to identify
|
||
|
whether a given file is malware or benign.
|
||
|
The python scripts in src directory are written to extract opcodes from
|
||
|
32 bit Windows PE files, however these can be modified to be applied on
|
||
|
32 bit PE files, Linux ELF or Android APK files.
|
||
|
|
||
|
Once opcode sequence from a datset is extracted, the n-gram sequences can be
|
||
|
generated for any value of n along with their frequency. Further an appropriate
|
||
|
feature selection technique can be used to select a list of n-grams as features
|
||
|
to be trained for developing a model to classify a file as malware or benign.
|
||
|
The sample directory has an example result of dataset constituting of 1700 benign
|
||
|
and malware 32 bit PE files.
|
||
|
|
||
|
Following are broad steps to run the scripts.
|
||
|
|
||
|
1. Building corpus. Given a data set of executable files. create the corpus.
|
||
|
Create two seprate fodlers for malware and benign files and run the python
|
||
|
scrpt to create the corpus csv file in current directory.
|
||
|
Usage: python3 buildCorpus.py <path-to-datset> <class-of-dataset>
|
||
|
|
||
|
Example:
|
||
|
python3 ./buildCorpus.py ./RawData/benign benign
|
||
|
|
||
|
Note: Ensure the corpus.csv do not have any 0 sized entry.
|
||
|
Manually remove them if any
|
||
|
|
||
|
2. Create the ngram from the corpus files using ngram.py. This script will generate
|
||
|
the n-grams from the corpus for the given value n and also create scatter plot of
|
||
|
all n-grams generated for the given corpus files with their frequencies.
|
||
|
Repeat this for all values of n you want to generate the n-grams. Note that higher
|
||
|
the values of n are CPU intensive and can take hours if your dataset is large
|
||
|
|
||
|
Usage: python3 ngram.py <malware-corpus-csv> <benign-corpus-csv> <value-n>
|
||
|
|
||
|
Example: python3 ngram.py malware_malware.corpus.csv benign_benign.corpus.csv 3
|
||
|
creates 2 files - 3gram.csv and 3gram.html
|
||
|
|
||
|
3. Extract feature list based on any threshold filter using the command in selFeatures.bash
|
||
|
Based on your experient join all n-grams or try different permutations to create a
|
||
|
feature list.
|
||
|
|
||
|
Example: sort -t"," -k2 -n 3gram.csv | awk -F, '$2>1000 {print $1}' > featureLst
|
||
|
Note to remove the first line from featureLst which is the csv header
|
||
|
|
||
|
4. Create ML input by extracting frequencies of the festures from the corpus files using
|
||
|
extractFeatures.py. Create temp folder and move the benign and maleare corpus file created
|
||
|
in Step 1. to this tmp folder. Then use the script to create the ML input csv file for the
|
||
|
featureLst file created in Step3.
|
||
|
|
||
|
Usage: python3 extractFeatures.py <path-containg-both-malware-benign-corpus-csv> <feature-list-file>
|
||
|
|
||
|
Example: python3 extractFeatures.py tmp/ featureLst
|
||
|
|
||
|
5. Run classifiers on the ml_input.csv file. You can use Weka or any other ML tools to run
|
||
|
your favorite classification algorithms. Additionally use classifiers.py in src directory
|
||
|
to run Decision Tree, KNN, XG-Boost and Random Forest classifiers.
|
||
|
|
||
|
Usage: python3 classifiers.py <ml-input.csv> <classifier-type>
|
||
|
|
||
|
Example: python3 classifiers.py ml_input.csv 5
|