<<<<<<< HEAD
Make sure you have ruby and python installed on your computer. Also make sure MUMmer is installed.
There are two parts to run the scraper. The first part is written in javascript. Navigate to http://www.ncbi.nlm.nih.gov/genome/browse/?report=5 and follow the instructions in scraper_and_data/scraper.js to download the initial .csv. Information was added to this scraped.csv by hand. This will be used for the next part of the scraper. The second part of the scraper uses ruby. Open ruby top level in the scraper_and_data/ folder. Load in scraper.rb and run
get_and_write_url()
After changing new_scraped.csv to scraped.csv run
change_urls()
After changing new_scraped.csv to scraped.csv run
get_genomes()
To find more information about changing the levels used and how many downloads to do at once look at scraper.rb for more options. NOTE: Do not download too many at once. Ther server will kick you out!
Before calculating the mums change the line 47 in cluster/calculate_mums.csv to the correct path to MUMmer on your computer. Open ruby top level in the cluster/ folder. Load in calculate\mums.csv and run
calculate_mums(["0","1"],"level_0_1")
This calculates all pairwise MUMs for all genomes in levels 0 and 1. You can now create distance matrices using the MUMs. Run
create_distance_matrix("Num_Mums","level_0_1")
You can also replace "Num_Mums" with "Total_mums_length". To run kmeans and get a plot for the clutsers run the following via the command line (while in the clitser/ folder)
python kmeans.py
To run the Feed Forward NN classifier on the training examples, simply untar the DNA_Neural_Classifier.tar file, cd into it, and run:
python NN-DNA-4Layer-MULTI.py
This will both train a neural net for each region as well as predict output for the testing input in the file. The main neural net code is contained in NN-DNA-4Layer-MULTI.py and a helper function is in GC_content.py (unused).
The function trainNN(NUM_TRAINING_EXAMPLES, dictname, filename) is the heart of the code, which runs the standard Feed-Forward Backpropagation algorithm with 2 hidden layers.
The function predict(interval_size, capped_size, dictname, filename) pushes a training example through the trained network.
The helper functions sample_nth() and convert_to_GC_placement() work to sample nucleotides at intervals, and convert them to 0s and 1s, as explained in the writeup.
The basic structure of the NN was taken from: http://iamtrask.github.io/2015/07/12/basic-python-network/
=======
Neural DNA Learner
This work was done for a graduate project in Comp Genomics, aiming to train a simple neural network to recognize location based on GC content.