malware_ngram_classif/README.md
2024-02-27 17:30:16 +08:00

3.1 KiB

malware_ngram_classif

This project demonstrates malware analysis through generating n-gram from executable file opcodes and using ML to create/ train a model to identify whether a given file is malware or benign. The python scripts in src directory are written to extract opcodes from 32 bit Windows PE files, however these can be modified to be applied on 32 bit PE files, Linux ELF or Android APK files.

Once opcode sequence from a datset is extracted, the n-gram sequences can be generated for any value of n along with their frequency. Further an appropriate feature selection technique can be used to select a list of n-grams as features to be trained for developing a model to classify a file as malware or benign. The sample directory has an example result of dataset constituting of 1700 benign and malware 32 bit PE files.

Following are broad steps to run the scripts.

  1. Building corpus. Given a data set of executable files. create the corpus. Create two seprate fodlers for malware and benign files and run the python scrpt to create the corpus csv file in current directory. Usage: python3 buildCorpus.py

    Example: python3 ./buildCorpus.py ./RawData/benign benign

    Note: Ensure the corpus.csv do not have any 0 sized entry. Manually remove them if any

  2. Create the ngram from the corpus files using ngram.py. This script will generate the n-grams from the corpus for the given value n and also create scatter plot of all n-grams generated for the given corpus files with their frequencies. Repeat this for all values of n you want to generate the n-grams. Note that higher the values of n are CPU intensive and can take hours if your dataset is large

    Usage: python3 ngram.py

    Example: python3 ngram.py malware_malware.corpus.csv benign_benign.corpus.csv 3 creates 2 files - 3gram.csv and 3gram.html

  3. Extract feature list based on any threshold filter using the command in selFeatures.bash Based on your experient join all n-grams or try different permutations to create a feature list.

    Example: sort -t"," -k2 -n 3gram.csv | awk -F, '$2>1000 {print $1}' > featureLst Note to remove the first line from featureLst which is the csv header

  4. Create ML input by extracting frequencies of the festures from the corpus files using extractFeatures.py. Create temp folder and move the benign and maleare corpus file created in Step 1. to this tmp folder. Then use the script to create the ML input csv file for the featureLst file created in Step3.

    Usage: python3 extractFeatures.py

    Example: python3 extractFeatures.py tmp/ featureLst

  5. Run classifiers on the ml_input.csv file. You can use Weka or any other ML tools to run your favorite classification algorithms. Additionally use classifiers.py in src directory to run Decision Tree, KNN, XG-Boost and Random Forest classifiers.

    Usage: python3 classifiers.py <ml-input.csv>

    Example: python3 classifiers.py ml_input.csv 5