SEQ03 redux

testing the fall 2018 sequence processing method on old samples

Because we are now using a targeted set of ~1020 contigs, we can trim, map and call SNPs on individual sequencing runs and delete the intermediate files when done.

Prepare the directories for sequence storage

In [ ]:
# !mkdir /local/shared/pinsky_lab/sequencing/hiseq_2014_08_07_SEQ03/

Go to the Princeton webpage to retrieve the seq data - took about 5 minutes for SEQ18 - 576 samples

https://htseq.princeton.edu/cgi-bin/login.pl?redirect_url=https://htseq.princeton.edu/cgi-bin/dashboard.pl done

  • Click on the sequencing run of interest in the box on the left that says “Recently Entered Samples"
  • In the box titled Sample Provenance, click on the link following "This library was utilized within the following output(s):” - repeat for each lane
  • In the “Data and Statistics” box, in the bottom right corner is a green button that says “Download"
  • click wget and then the copy button to get all files. This will copy the wget commands for all files.
  • paste in amphiprion in the directory you made in the previous step

Make directories in personal workspace - 4 pools

In [3]:
%%bash
cd ~/02-apcl-ddocent
# mkdir 03seq
cd ./03seq/ 
# mkdir logs samples scripts
mkdir bcsplit 01Pool 02Pool 03Pool 04Pool 
cd bcsplit/
mkdir lane1 lane2

Move index files and names files from laptop to amphiprion with Fetch

see seq_proc.Rmd notebook in RStudio to generate index and names files for large batches of pools/samples

Copy barcodes files into each of the logs folders

In [4]:
%%bash
cd ~/02-apcl-ddocent
# cp 16seq/logs/barcodes 03seq/logs/

Run barcode splitter in lane1 folder and lane 2 folder - takes about 8 hours for 192 samples,

started 4:12pm on Monday, still running at 11pm, not sure if it finished. Retry through terminal at 8:42am Tuesday - done by 10am on Wednesday (didn't check sooner)

In [ ]:
cd ~/02-apcl-ddocent/03seq/bcsplit/lane1
nohup /local/home/michelles/14_programs/paired_sequence_utils/barcode_splitter.py --bcfile ../../logs/index-seq03 --idxread 2 --suffix .fastq.gz /local/shared/pinsky_lab/sequencing/hiseq_2014_08_07_SEQ03/clownfish-ddradseq-seq03-for-222-cycles-ha1wgadxx_1_read_1_passed_filter.fastq.gz /local/shared/pinsky_lab/sequencing/hiseq_2014_08_07_SEQ03/clownfish-ddradseq-seq03-for-222-cycles-ha1wgadxx_1_read_2_index_read_passed_filter.fastq.gz &
cd ~/02-apcl-ddocent/03seq/bcsplit/lane2
nohup /local/home/michelles/14_programs/paired_sequence_utils/barcode_splitter.py --bcfile ../../logs/index-seq03 --idxread 2 --suffix .fastq.gz /local/shared/pinsky_lab/sequencing/hiseq_2014_08_07_SEQ03/clownfish-ddradseq-seq03-for-222-cycles-ha1wgadxx_2_read_1_passed_filter.fastq.gz /local/shared/pinsky_lab/sequencing/hiseq_2014_08_07_SEQ03/clownfish-ddradseq-seq03-for-222-cycles-ha1wgadxx_2_read_2_index_read_passed_filter.fastq.gz &

Check the output statistics in the nohup and move to logs with a rename

In [1]:
!head ~/02-apcl-ddocent/03seq/bcsplit/lane1/nohup.out
line: P012      ATCACG
line: P013      TGACCA
line: P014      CAGATC
line: P015      TAGCTT
P012    ATCACG  28126167        21.52%
P013    TGACCA  31264992        23.92%
P014    CAGATC  32478502        24.85%
P015    TAGCTT  36411697        27.86%
unmatched       None    2426479 1.86%
(END) helles/02-apcl-ddocent/03seq/bcsplit/lane1/nohup.out (END) 
In [2]:
!mv ~/02-apcl-ddocent/03seq/bcsplit/lane1/nohup.out ~/02-apcl-ddocent/03seq/logs/lane1.out
In [3]:
!head ~/02-apcl-ddocent/03seq/bcsplit/lane2/nohup.out
line: P012      ATCACG
line: P013      TGACCA
line: P014      CAGATC
line: P015      TAGCTT
P012    ATCACG  27611210        21.50%
P013    TGACCA  30622380        23.85%
P014    CAGATC  31834877        24.79%
P015    TAGCTT  35645407        27.76%
unmatched       None    2689386 2.09%
(END) helles/02-apcl-ddocent/03seq/bcsplit/lane2/nohup.out (END) 
In [4]:
!mv ~/02-apcl-ddocent/03seq/bcsplit/lane2/nohup.out ~/02-apcl-ddocent/03seq/logs/lane2.out

Concatenate the results in the bcsplit folder - took about 10 minutes

For large batches, generated the command lines in a script in the seq_proc.Rmd notebook and output the lines into files called SEQXX_cat_all.sh to be copy and pasted here

In [6]:
%%bash
cd ~/02-apcl-ddocent/03seq/bcsplit/
cat ./lane1/P012-read-1.fastq.gz ./lane2/P012-read-1.fastq.gz > ../01Pool/P012.fastq.gz
cat ./lane1/P013-read-1.fastq.gz ./lane2/P013-read-1.fastq.gz > ../02Pool/P013.fastq.gz
cat ./lane1/P014-read-1.fastq.gz ./lane2/P014-read-1.fastq.gz > ../03Pool/P014.fastq.gz
cat ./lane1/P015-read-1.fastq.gz ./lane2/P015-read-1.fastq.gz > ../04Pool/P015.fastq.gz

Run process radtags with script - takes ~2 hours for 192 samples

  • Had to nano the script to point to the correct place and changed my email to gmail so I will get it on my phone (if it sends)
  • If this notebook closes, these processes stop - ended up pasting this into a terminal window because I was afraid the network would drop and the process would stop

  • Started at 1:24pm on Wednesday

  • Finished between 2:55-3:15 (still waiting for pool 15)
In [7]:
%%bash
cd ~/02-apcl-ddocent/03seq/01Pool/
nohup ../scripts/12process.sh &
cd ~/02-apcl-ddocent/03seq/02Pool/
nohup ../scripts/13process.sh &
cd ~/02-apcl-ddocent/03seq/03Pool/
nohup ../scripts/14process.sh &
cd ~/02-apcl-ddocent/03seq/04Pool/
nohup ../scripts/15process.sh &
Null message body; hope that's ok
Null message body; hope that's ok
Null message body; hope that's ok
Null message body; hope that's ok
Using Phred+Using Phred+33 encoding for quality scores.
33Filtering reads for adapter sequence:
 encoding for quality scores.
  Filtering reads for adapter sequence:
ACACTCTTTCCCTACACGACGCTCTTCCGATCT  
ACACTCTTTCCCTACACGACGCTCTTCCGATCT
        11 mismatches allowed to adapter sequence.
 mismatches allowed to adapter sequence.
Found Found 11 input file(s).
 input file(s).
Searching for single-end, inlined barcodes.
Searching for single-end, inlined barcodes.
Error opening barcode file 'Error opening barcode file '../logs/barcodes../logs/barcodes'
'
Using Phred+Using Phred+3333 encoding for quality scores.
 encoding for quality scores.
Filtering reads for adapter sequence:
Filtering reads for adapter sequence:
    ACACTCTTTCCCTACACGACGCTCTTCCGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT

        11 mismatches allowed to adapter sequence.
 mismatches allowed to adapter sequence.
Found Found 11 input file(s).
 input file(s).
Searching for single-end, inlined barcodes.
Searching for single-end, inlined barcodes.
Error opening barcode file 'Error opening barcode file '../logs/barcodes../logs/barcodes'
'
mv: mv: cannot stat `process_radtags.log'cannot stat `process_radtags.log'mv: mv: cannot stat `process_radtags.log'cannot stat `process_radtags.log': No such file or directory
: No such file or directory
: No such file or directory: No such file or directory

cat: ./logs/13process.outcat: ./logs/12process.out: No such file or directory
cat: : No such file or directory
./logs/14process.outcat: ./logs/15process.out: No such file or directory
: No such file or directory

Move the process_radtags logs to the logs folder - wrote command lines for large batches in seq_proc.Rmd notebook and saved the lines in files called SEQXX-move_radlogs.sh

The script did this for us so we don't have to

In [ ]:
#%%bash
#mv 01Pool/process_radtags.log ./logs/processP012.log
#mv 02Pool/process_radtags.log ./logs/processP013.log
#mv 03Pool/process_radtags.log ./logs/processP014.log
#mv 04Pool/process_radtags.log ./logs/processP015.log

Run the readprocesslog.py script to convert the process log into a tsv that can be imported into the database through R

In [3]:
%%bash
cd ~/02-apcl-ddocent/03seq/01Pool/
readprocesslog.py ../logs/12process.out
cd ~/02-apcl-ddocent/03seq/02Pool/
readprocesslog.py ../logs/13process.out
cd ~/02-apcl-ddocent/03seq/03Pool/
readprocesslog.py ../logs/14process.out
cd ~/02-apcl-ddocent/03seq/04Pool/
readprocesslog.py ../logs/15process.out
AAACAC	645918	22882	95845	524382
AAACGA	989633	31015	84631	869313
AAAGTC	671161	19191	78400	570435
AACGGT	1314761	38271	110639	1159773
AACTTC	1526224	37867	140843	1340611
TGCTCA	1667170	40528	137172	1481598
AAGAAC	1214826	31020	115257	1062716
AATGTG	919153	28130	74837	811917
ACATGT	2096491	56096	185558	1845189
ACCAAA	742841	25082	66677	647685
ACGATA	1148387	50136	94122	998916
ACGTTT	681775	27802	65212	585713
ACTAGG	907035	47081	77454	778419
ACTCCA	1008071	33461	90156	879684
AGACTC	1199296	31286	111345	1051272
AGCATT	9424	7066	568	1782
AGGAGA	770977	31419	66245	669647
AGTAAG	994715	45627	80600	864106
AGTCCT	1416998	46975	121301	1242049
AGTTAC	1400756	33845	126077	1234531
ATAACC	1208097	31824	118030	1052613
ATCGCA	1772897	44046	160360	1560410
ATCTCG	1118391	39992	106258	966979
ATGGAG	958936	44893	89551	820163
ATGTCC	1187240	30292	115583	1035897
CAATTG	1831634	472974	126981	1225191
CAGAGT	2081532	73189	171990	1826597
CATCAG	1409020	68794	116112	1217667
CATCTC	1882276	48339	177101	1648262
CCACTT	1263787	48566	112161	1097240
CCCATA	1391400	63546	123319	1197842
CCTGAA	1817905	76019	186255	1548212
CGAAAC	2073061	51603	183852	1828484
CGAATG	732236	46826	61109	621001
GACGTT	1093496	41170	86826	960204
GATACA	927039	54902	75394	792640
GCAGAA	1066763	33865	90277	937868
GGGATA	477523	11520	39830	423903
GGTGAA	846487	37338	69993	735443
GTAGCT	813862	24766	68669	716573
GTCTAT	885181	146366	65804	669330
GTTCAG	754735	33546	64601	653005
TAAGAC	996074	30372	98341	862904
TACCAG	1182851	69563	107273	1000788
TCAATC	1094934	30754	97674	961299
TCCAAA	305914	9160	25942	269326
TCTGCT	922816	26233	81042	811297
TCTTAG	1024449	36699	85472	897308
AAACAC	1450003	35461	124665	1282566
AAACGA	1561	183	268	1103
AAAGTC	859	129	445	284
AACGGT	868	207	226	434
AACTTC	1716531	44387	163839	1499742
TGCTCA	1278347	63613	112609	1096165
AAGAAC	953	261	393	299
AATGTG	2231	1328	489	411
ACATGT	199565	6169	16437	175942
ACCAAA	2102748	51292	175220	1865987
ACGATA	1964946	74020	149167	1732123
ACGTTT	43798	5335	3570	34727
ACTAGG	1033137	46258	75808	906139
ACTCCA	1410050	42563	113306	1247115
AGACTC	973958	41787	90371	836969
AGCATT	887497	100338	63131	719961
AGGAGA	946601	49403	70861	821819
AGTAAG	1006992	82078	73698	846485
AGTCCT	832588	39542	65988	723004
AGTTAC	1558113	40526	130410	1379444
ATAACC	1934555	48521	184532	1691954
ATCGCA	2063361	51926	170875	1830485
ATCTCG	902185	132498	64191	701517
ATGGAG	1027033	58861	81383	881990
ATGTCC	653118	30534	65884	553694
CAATTG	1649428	559956	93033	991043
CAGAGT	584206	24377	47117	509887
CATCAG	2049473	87600	165703	1786499
CATCTC	1615522	39224	132521	1435762
CCACTT	1541717	251764	103138	1180153
CCCATA	89002	4013	7439	77122
CCTGAA	1098067	105548	87070	900822
CGAAAC	1251808	61624	104742	1079793
CGAATG	1607257	75124	128349	1395996
GACGTT	1796245	53712	143865	1589750
GATACA	1544439	61442	120209	1354920
GCAGAA	2051587	51493	171752	1817850
GGGATA	2093871	162022	152288	1769620
GGTGAA	2573311	132424	199961	2228280
GTAGCT	2119111	209576	153540	1746487
GTCTAT	2318104	136456	190693	1979438
GTTCAG	1515282	118494	114277	1275427
TAAGAC	1355649	40310	123521	1185237
TACCAG	2280981	106734	187555	1975863
TCAATC	1618097	43176	147335	1419663
TCCAAA	1603312	132200	125933	1337669
TCTGCT	1024742	61368	80880	877546
TCTTAG	1127657	149053	84279	889532
AAACAC	131247	55232	15465	60152
AAACGA	973077	116711	71230	780800
AAAGTC	1075395	83237	98501	888998
AACGGT	1352994	48851	136392	1161351
AACTTC	1118157	39503	138316	935233
TGCTCA	5000623	353725	391127	4233059
AAGAAC	1508365	59344	188744	1253722
AATGTG	1158206	95803	94470	963064
ACATGT	307597	42576	24170	239649
ACCAAA	1252390	50844	121196	1074424
ACGATA	1567578	64955	144160	1351297
ACGTTT	48298	9263	3907	34939
ACTAGG	1076057	99474	85022	886601
ACTCCA	1148966	87246	94563	962132
AGACTC	910604	47287	92944	766427
AGCATT	1262277	95292	117398	1043900
AGGAGA	1187269	38518	115393	1027899
AGTAAG	91835	7295	7351	76708
AGTCCT	1132035	52244	91982	982690
AGTTAC	1120770	56050	106582	953151
ATAACC	1165633	67724	116614	976125
ATCGCA	1221018	47890	121140	1046410
ATCTCG	1383863	59213	133485	1184914
ATGGAG	9460545	483285	721605	8213036
ATGTCC	1035847	54768	100950	875433
CAATTG	1698531	179794	157344	1353985
CAGAGT	1353200	97430	129269	1120255
CATCAG	1891222	97537	184071	1601122
CATCTC	160248	15473	17917	126176
CCACTT	1192013	125628	110573	950601
CCCATA	1306591	113552	108383	1078903
CCTGAA	224118	11414	23821	187898
CGAAAC	768860	51737	75197	638726
CGAATG	258224	6415	25121	225471
GACGTT	1404213	88392	132515	1176941
GATACA	1223428	100341	99988	1017642
GCAGAA	1097559	112341	88943	891388
GGGATA	921522	66833	76704	773663
GGTGAA	1066019	80526	93195	887373
GTAGCT	1407149	119813	125852	1155303
GTCTAT	873769	73123	77020	719683
GTTCAG	1518853	97780	142270	1272067
TAAGAC	1554477	145463	126073	1276020
TACCAG	1270215	58277	128542	1077665
TCAATC	1163843	119846	97641	941212
TCCAAA	1192065	110911	111577	964419
TCTGCT	37620	10089	3370	24036
TCTTAG	1410524	114473	135399	1154375
AAACAC	35999	2741	4849	28231
AAACGA	857084	67199	62394	723516
AAAGTC	706033	49821	63520	589450
AACGGT	707034	52735	54317	596740
AACTTC	714973	54933	69548	587365
TGCTCA	5020	2272	1214	1526
AAGAAC	94430	8602	17706	67727
AATGTG	758750	52749	67887	634747
ACATGT	820390	61546	64276	690835
ACCAAA	1154896	30687	96461	1022296
ACGATA	1067291	57043	80632	924486
ACGTTT	1169117	65346	101214	997164
ACTAGG	1664680	56661	131397	1468684
ACTCCA	1374248	33332	109606	1224544
AGACTC	1502420	32028	129419	1333776
AGCATT	1353315	45231	107299	1193866
AGGAGA	1342116	36512	105661	1193183
AGTAAG	1445186	34462	119899	1283895
AGTCCT	1854755	49259	157860	1638661
AGTTAC	1905430	102598	146232	1647881
ATAACC	1693231	41677	156118	1487414
ATCGCA	1421295	33378	116484	1264626
ATCTCG	1232405	25678	106983	1093830
ATGGAG	1526688	49280	124749	1345237
ATGTCC	1378470	35800	114856	1221169
CAATTG	1125738	62494	90723	967372
CAGAGT	2890741	91403	235797	2549560
CATCAG	103641	12912	16839	73540
CATCTC	1312405	30134	114533	1161416
CCACTT	1634366	46834	130619	1449134
CCCATA	1993865	52529	160707	1770905
CCTGAA	2873871	75170	262052	2523937
CGAAAC	2095825	50589	193909	1842115
CGAATG	1537128	40141	131574	1358081
GACGTT	2853225	75485	218659	2544929
GATACA	2007506	62351	174413	1761079
GCAGAA	2036833	48120	159597	1818958
GGGATA	3444220	81923	254957	3090767
GGTGAA	892527	19950	66549	801643
GTAGCT	3128499	67583	254478	2790756
GTCTAT	2887651	68672	231054	2573629
GTTCAG	1897828	48047	156225	1684291
TAAGAC	1471347	36814	117808	1309411
TACCAG	1223042	41508	104823	1070762
TCAATC	4567	887	2533	1143
TCCAAA	1568254	44073	132039	1384557
TCTGCT	1928673	52905	160235	1706341
TCTTAG	1860937	56823	163079	1632255

Move the newly created tsvs to the laptop with fetch and import them into the database

Rename the sample files - used script in seq_proc.Rmd to generate mass command line text stored in SEQXX_all_rename.sh - this caused some problems because the copy and paste was off and it ended up deleting the APCL-analysis folder. Because of this I am changing the first cd to a full path

In [4]:
%%bash
cd ~/02-apcl-ddocent/03seq/
cd 01Pool/
sh rename.for.dDocent_se_gz ../logs/names_012.tsv
mv APCL* ../samples/
cd ../02Pool/
sh rename.for.dDocent_se_gz ../logs/names_013.tsv
mv APCL* ../samples/
cd ../03Pool/
sh rename.for.dDocent_se_gz ../logs/names_014.tsv
mv APCL* ../samples/
cd ../04Pool/
sh rename.for.dDocent_se_gz ../logs/names_015.tsv
mv APCL* ../samples/
pwd
APCL_L0302.F
TGCTCA
APCL_L0350.F
TGCTCA
APCL_L0397.F
TCTTAG
APCL_L0426.F
CATCTC
/local/home/michelles/02-apcl-ddocent/03seq/04Pool
In [5]:
!head /local/home/michelles/02-apcl-ddocent/03seq/logs/names_012.tsv
APCL_L0302.F	TGCTCA
APCL_L0301.F	TCTTAG
APCL_L0300.F	TCTGCT
APCL_L0298.F	TCAATC
APCL_L0297.F	TACCAG
APCL_L0296.F	TAAGAC
APCL_L0295.F	GTTCAG
APCL_L0294.F	GTCTAT
APCL_L0293.F	GTAGCT
APCL_L0292.F	GGTGAA

Once all of the samples have been moved, move the gz files to the logs and then you can delete the pool directories

In [7]:
!rm -r ~/02-apcl-ddocent/03seq/*Pool

Identify regenotyped samples, cat and move them.

All samples from SEQ03 were catted when we catted SEQ28.

Copy reference.fasta over from jonsfiles

In [8]:
% cp /local/home/michelles/02-apcl-ddocent/jonsfiles/reference.fasta /local/home/michelles/02-apcl-ddocent/03seq/samples

Trim, map , and call SNPs using dDocent

for this phase, dDocent will ask questions.

  • Are the number of samples correct?

yes

  • Maximum number of processors to use for this analysis

This depends on how many people are trying to use amphiprion right now. The trim and map section isn't too intensive so it should be ok to use alot. I used 30 for SEQ18 - 576 samples

30

  • Maximum memory

Again, consider who else is using the machine, for SEQ18 - 576 samples I used 150

150

  • Quality trim?

yes

  • Perform assembly?

no - this is for creating the reference originally

  • Map reads?

yes

  • Adjust default parameters

default match score is 1, default mismatch is 4, gap penalty is 6 - I used defaults for all

  • call SNPs

yes

  • enter email address to be notified when it is done running
In [ ]:
!dDocent
dDocent 2.5.5 

Contact jpuritz@uri.edu with any problems 

 
Checking for required software
/local/home/michelles/14_programs/dDocent/dDocent: line 84: [: 1.0-r82: integer expression expected

All required software is installed!

dDocent run started Wed Oct 3 17:11:27 EDT 2018 

0 individuals are detected. Is this correct? Enter yes or no and press [ENTER]

Combine TotalRawSNPs into master SNP doc for further analysis.