malware_ngram_classif/README.md

63 lines
3.1 KiB
Markdown
Raw Permalink Normal View History

2024-02-27 17:30:16 +08:00
# malware_ngram_classif
This project demonstrates malware analysis through generating n-gram from
executable file opcodes and using ML to create/ train a model to identify
whether a given file is malware or benign.
The python scripts in src directory are written to extract opcodes from
32 bit Windows PE files, however these can be modified to be applied on
32 bit PE files, Linux ELF or Android APK files.
Once opcode sequence from a datset is extracted, the n-gram sequences can be
generated for any value of n along with their frequency. Further an appropriate
feature selection technique can be used to select a list of n-grams as features
to be trained for developing a model to classify a file as malware or benign.
The sample directory has an example result of dataset constituting of 1700 benign
and malware 32 bit PE files.
Following are broad steps to run the scripts.
1. Building corpus. Given a data set of executable files. create the corpus.
Create two seprate fodlers for malware and benign files and run the python
scrpt to create the corpus csv file in current directory.
Usage: python3 buildCorpus.py <path-to-datset> <class-of-dataset>
Example:
python3 ./buildCorpus.py ./RawData/benign benign
Note: Ensure the corpus.csv do not have any 0 sized entry.
Manually remove them if any
2. Create the ngram from the corpus files using ngram.py. This script will generate
the n-grams from the corpus for the given value n and also create scatter plot of
all n-grams generated for the given corpus files with their frequencies.
Repeat this for all values of n you want to generate the n-grams. Note that higher
the values of n are CPU intensive and can take hours if your dataset is large
Usage: python3 ngram.py <malware-corpus-csv> <benign-corpus-csv> <value-n>
Example: python3 ngram.py malware_malware.corpus.csv benign_benign.corpus.csv 3
creates 2 files - 3gram.csv and 3gram.html
3. Extract feature list based on any threshold filter using the command in selFeatures.bash
Based on your experient join all n-grams or try different permutations to create a
feature list.
Example: sort -t"," -k2 -n 3gram.csv | awk -F, '$2>1000 {print $1}' > featureLst
Note to remove the first line from featureLst which is the csv header
4. Create ML input by extracting frequencies of the festures from the corpus files using
extractFeatures.py. Create temp folder and move the benign and maleare corpus file created
in Step 1. to this tmp folder. Then use the script to create the ML input csv file for the
featureLst file created in Step3.
Usage: python3 extractFeatures.py <path-containg-both-malware-benign-corpus-csv> <feature-list-file>
Example: python3 extractFeatures.py tmp/ featureLst
5. Run classifiers on the ml_input.csv file. You can use Weka or any other ML tools to run
your favorite classification algorithms. Additionally use classifiers.py in src directory
to run Decision Tree, KNN, XG-Boost and Random Forest classifiers.
Usage: python3 classifiers.py <ml-input.csv> <classifier-type>
Example: python3 classifiers.py ml_input.csv 5