malware_ngram_classif/README.md

# malware_ngram_classif
This project demonstrates malware analysis through generating n-gram from 
executable file opcodes and using ML to create/ train a model to identify
whether a given file is malware or benign.
The python scripts in src directory are written to extract opcodes from
32 bit Windows PE files, however these can be modified to be applied on
32 bit PE files, Linux ELF or Android APK files.

Once opcode sequence from a datset is extracted, the n-gram sequences can be
generated for any value of n along with their frequency. Further an appropriate
feature selection technique can be used to select a list of n-grams as features
to be trained for developing a model to classify a file as malware or benign.
The sample directory has an example result of dataset constituting of 1700 benign
and malware 32 bit PE files.

Following are broad steps to run the scripts.

1. Building corpus. Given a data set of executable files. create the corpus.
   Create two seprate fodlers for malware and benign files and run the python
   scrpt to create the corpus csv file in current directory.
   Usage: python3 buildCorpus.py <path-to-datset> <class-of-dataset>

   Example:
   python3 ./buildCorpus.py ./RawData/benign benign

   Note: Ensure the corpus.csv do not have any 0 sized entry.
   Manually remove them if any

2. Create the ngram from the corpus files using ngram.py. This script will generate
   the n-grams from the corpus for the given value n and also create scatter plot of 
   all n-grams generated for the given corpus files with their frequencies. 
   Repeat this for all values of n you want to generate the n-grams. Note that higher
   the values of n are CPU intensive and can take hours if your dataset is large
   
   Usage: python3 ngram.py <malware-corpus-csv> <benign-corpus-csv> <value-n>

   Example: python3 ngram.py malware_malware.corpus.csv benign_benign.corpus.csv 3
    creates 2 files - 3gram.csv and 3gram.html

3. Extract feature list based on any threshold filter using the command in selFeatures.bash
   Based on your experient join all n-grams or try different permutations to create a 
   feature list.
   
   Example: sort -t"," -k2 -n 3gram.csv | awk -F,  '$2>1000 {print $1}' > featureLst
   Note to remove the first line from featureLst which is the csv header

4. Create ML input by extracting frequencies of the festures from the corpus files using
   extractFeatures.py. Create temp folder and move the benign and maleare corpus file created
   in Step 1. to this tmp folder. Then use the script to create the ML input csv file for the 
   featureLst file created in Step3.
   
   Usage: python3 extractFeatures.py <path-containg-both-malware-benign-corpus-csv> <feature-list-file>

   Example: python3 extractFeatures.py tmp/ featureLst

5. Run classifiers on the ml_input.csv file. You can use Weka or any other ML tools to run
   your favorite classification algorithms. Additionally use classifiers.py in src directory
   to run Decision Tree, KNN, XG-Boost and Random Forest classifiers.
   
   Usage: python3 classifiers.py <ml-input.csv> <classifier-type>

   Example: python3 classifiers.py ml_input.csv 5
first commit 2024-02-27 17:30:16 +08:00			`# malware_ngram_classif`
			`This project demonstrates malware analysis through generating n-gram from`
			`executable file opcodes and using ML to create/ train a model to identify`
			`whether a given file is malware or benign.`
			`The python scripts in src directory are written to extract opcodes from`
			`32 bit Windows PE files, however these can be modified to be applied on`
			`32 bit PE files, Linux ELF or Android APK files.`

			`Once opcode sequence from a datset is extracted, the n-gram sequences can be`
			`generated for any value of n along with their frequency. Further an appropriate`
			`feature selection technique can be used to select a list of n-grams as features`
			`to be trained for developing a model to classify a file as malware or benign.`
			`The sample directory has an example result of dataset constituting of 1700 benign`
			`and malware 32 bit PE files.`

			`Following are broad steps to run the scripts.`

			`1. Building corpus. Given a data set of executable files. create the corpus.`
			`Create two seprate fodlers for malware and benign files and run the python`
			`scrpt to create the corpus csv file in current directory.`
			`Usage: python3 buildCorpus.py <path-to-datset> <class-of-dataset>`

			`Example:`
			`python3 ./buildCorpus.py ./RawData/benign benign`

			`Note: Ensure the corpus.csv do not have any 0 sized entry.`
			`Manually remove them if any`

			`2. Create the ngram from the corpus files using ngram.py. This script will generate`
			`the n-grams from the corpus for the given value n and also create scatter plot of`
			`all n-grams generated for the given corpus files with their frequencies.`
			`Repeat this for all values of n you want to generate the n-grams. Note that higher`
			`the values of n are CPU intensive and can take hours if your dataset is large`

			`Usage: python3 ngram.py <malware-corpus-csv> <benign-corpus-csv> <value-n>`

			`Example: python3 ngram.py malware_malware.corpus.csv benign_benign.corpus.csv 3`
			`creates 2 files - 3gram.csv and 3gram.html`

			`3. Extract feature list based on any threshold filter using the command in selFeatures.bash`
			`Based on your experient join all n-grams or try different permutations to create a`
			`feature list.`

			`Example: sort -t"," -k2 -n 3gram.csv \| awk -F, '$2>1000 {print $1}' > featureLst`
			`Note to remove the first line from featureLst which is the csv header`

			`4. Create ML input by extracting frequencies of the festures from the corpus files using`
			`extractFeatures.py. Create temp folder and move the benign and maleare corpus file created`
			`in Step 1. to this tmp folder. Then use the script to create the ML input csv file for the`
			`featureLst file created in Step3.`

			`Usage: python3 extractFeatures.py <path-containg-both-malware-benign-corpus-csv> <feature-list-file>`

			`Example: python3 extractFeatures.py tmp/ featureLst`

			`5. Run classifiers on the ml_input.csv file. You can use Weka or any other ML tools to run`
			`your favorite classification algorithms. Additionally use classifiers.py in src directory`
			`to run Decision Tree, KNN, XG-Boost and Random Forest classifiers.`

			`Usage: python3 classifiers.py <ml-input.csv> <classifier-type>`

			`Example: python3 classifiers.py ml_input.csv 5`