Making a jupyter notebook for the processing of sequence files in Fall 2018.

Prepare the directories for sequence storage

In [1]:
% lsmagic
Out[1]:
Available line magics:
%alias  %alias_magic  %autocall  %automagic  %autosave  %bookmark  %cat  %cd  %clear  %colors  %config  %connect_info  %cp  %debug  %dhist  %dirs  %doctest_mode  %ed  %edit  %env  %gui  %hist  %history  %killbgscripts  %ldir  %less  %lf  %lk  %ll  %load  %load_ext  %loadpy  %logoff  %logon  %logstart  %logstate  %logstop  %ls  %lsmagic  %lx  %macro  %magic  %man  %matplotlib  %mkdir  %more  %mv  %notebook  %page  %pastebin  %pdb  %pdef  %pdoc  %pfile  %pinfo  %pinfo2  %popd  %pprint  %precision  %profile  %prun  %psearch  %psource  %pushd  %pwd  %pycat  %pylab  %qtconsole  %quickref  %recall  %rehashx  %reload_ext  %rep  %rerun  %reset  %reset_selective  %rm  %rmdir  %run  %save  %sc  %set_env  %store  %sx  %system  %tb  %time  %timeit  %unalias  %unload_ext  %who  %who_ls  %whos  %xdel  %xmode

Available cell magics:
%%!  %%HTML  %%SVG  %%bash  %%capture  %%debug  %%file  %%html  %%javascript  %%js  %%latex  %%markdown  %%perl  %%prun  %%pypy  %%python  %%python2  %%python3  %%ruby  %%script  %%sh  %%svg  %%sx  %%system  %%time  %%timeit  %%writefile

Automagic is ON, % prefix IS NOT needed for line magics.
In [ ]:
%%bash
mkdir /local/shared/pinsky_lab/sequencing/hiseq_2018_09_24_SEQ18
mkdir /local/shared/pinsky_lab/sequencing/hiseq_2018_09_24_SEQ19
mkdir /local/shared/pinsky_lab/sequencing/hiseq_2018_09_24_SEQ20
mkdir /local/shared/pinsky_lab/sequencing/hiseq_2018_09_24_SEQ21

Go to the Princeton webpage to retrieve the seq data - took about 5 minutes for SEQ18 - 576 samples

https://htseq.princeton.edu/cgi-bin/login.pl?redirect_url=https://htseq.princeton.edu/cgi-bin/dashboard.pl done

  • Click on the sequencing run of interest in the box on the left that says “Recently Entered Samples"
  • In the box titled Sample Provenance, click on the link following "This library was utilized within the following output(s):” - repeat for each lane
  • In the “Data and Statistics” box, in the bottom right corner is a green button that says “Download"
  • click wget and then the copy button to get all files. This will copy the wget commands for all files.
  • paste in amphiprion in the directory you made in the previous step

Make directories in personal workspace - 12 pools in each seq

In [ ]:
%%bash
cd ~/02-apcl-ddocent
mkdir 18seq 19seq 20seq 21seq
cd ../18seq/ 
mkdir bcsplit 01Pool 02Pool 03Pool 04Pool 05Pool 06Pool 07Pool 08Pool 09Pool 10Pool 11Pool 12Pool logs samples scripts
cd bcsplit/
mkdir lane1 lane2
cd ../../19seq/
mkdir bcsplit 01Pool 02Pool 03Pool 04Pool 05Pool 06Pool 07Pool 08Pool 09Pool 10Pool 11Pool 12Pool logs samples scripts
cd bcsplit/
mkdir lane1 lane2
cd ../../20seq/
mkdir bcsplit 01Pool 02Pool 03Pool 04Pool 05Pool 06Pool 07Pool 08Pool 09Pool 10Pool 11Pool 12Pool logs samples scripts
cd bcsplit/
mkdir lane1 lane2
cd ../../21seq/
mkdir bcsplit 01Pool 02Pool 03Pool 04Pool 05Pool 06Pool 07Pool 08Pool 09Pool 10Pool 11Pool 12Pool logs samples scripts
cd bcsplit/
mkdir lane1 lane2

Move index files and names files from laptop to amphiprion with Fetch

see seq_proc.Rmd notebook in RStudio to generate index and names files for large batches of pools/samples

Copy barcodes files into each of the logs folders

In [ ]:
%%bash
cd ~/02-apcl-ddocent
cp 16seq/logs/barcodes 18seq/logs/ 
cp 16seq/logs/barcodes 19seq/logs 
cp 16seq/logs/barcodes 20seq/logs 
cp 16seq/logs/barcodes 21seq/logs

SEQ18_APCL_MRS

Run barcode splitter in lane1 folder and lane 2 folder - takes about 8 hours for 192 samples,

started 10am on Monday, finished at about 7:30pm on Monday - took about 10 hours for 576 samples

In [ ]:
%%bash
cd ~/02-apcl-ddocent/18seq/bcsplit/lane1
nohup barcode_splitter.py --bcfile ../../logs/index-seq18.tsv --idxread 2 --suffix .fastq.gz /local/shared/pinsky_lab/sequencing/hiseq_2018_09_24_SEQ18/18_APCL_SEQ-for-196-cycles-HLMHLBCX2_1_Read_1_passed_filter.fastq.gz /local/shared/pinsky_lab/sequencing/hiseq_2018_09_24_SEQ18/18_APCL_SEQ-for-196-cycles-HLMHLBCX2_1_Read_2_Index_Read_passed_filter.fastq.gz
cd ~/02-apcl-ddocent/18seq/bcsplit/lane2
nohup barcode_splitter.py --bcfile ../../logs/index-seq18.tsv --idxread 2 --suffix .fastq.gz /local/shared/pinsky_lab/sequencing/hiseq_2018_09_24_SEQ18/18_APCL_SEQ-for-196-cycles-HLMHLBCX2_2_Read_1_passed_filter.fastq.gz /local/shared/pinsky_lab/sequencing/hiseq_2018_09_24_SEQ18/18_APCL_SEQ-for-196-cycles-HLMHLBCX2_2_Read_2_Index_Read_passed_filter.fastq.gz

Took about 10 hours and there was no output text in nohup.out and no stats.

Concatenate the results in the bcsplit folder - takes about 1 minute

For large batches, generated the command lines in a script in the seq_proc.Rmd notebook and output the lines into files called SEQXX_cat_all.sh to be copy and pasted here

For SEQ18 the pool order kept getting rearranged to be P075, P076, P073, P074 and so these samples might have gotten rearranged. I manuallly changed the output before pasting the lines as you see them below. I think that everything is correct because the correct index code was assigned to the correct pool and so anything with that index would have been put in the correct pool folder, but if there are future problems, that is the most likely source

In [ ]:
%%bash
cat ./lane1/P075-read-1.fastq.gz ./lane2/P075-read-1.fastq.gz > ../03Pool/P075.fastq.gz
cat ./lane1/P076-read-1.fastq.gz ./lane2/P076-read-1.fastq.gz > ../04Pool/P076.fastq.gz
cat ./lane1/P073-read-1.fastq.gz ./lane2/P073-read-1.fastq.gz > ../01Pool/P073.fastq.gz
cat ./lane1/P074-read-1.fastq.gz ./lane2/P074-read-1.fastq.gz > ../02Pool/P074.fastq.gz
cat ./lane1/P077-read-1.fastq.gz ./lane2/P077-read-1.fastq.gz > ../05Pool/P077.fastq.gz
cat ./lane1/P078-read-1.fastq.gz ./lane2/P078-read-1.fastq.gz > ../06Pool/P078.fastq.gz
cat ./lane1/P083-read-1.fastq.gz ./lane2/P083-read-1.fastq.gz > ../07Pool/P083.fastq.gz
cat ./lane1/P084-read-1.fastq.gz ./lane2/P084-read-1.fastq.gz > ../08Pool/P084.fastq.gz
cat ./lane1/P085-read-1.fastq.gz ./lane2/P085-read-1.fastq.gz > ../09Pool/P085.fastq.gz
cat ./lane1/P086-read-1.fastq.gz ./lane2/P086-read-1.fastq.gz > ../10Pool/P086.fastq.gz
cat ./lane1/P087-read-1.fastq.gz ./lane2/P087-read-1.fastq.gz > ../11Pool/P087.fastq.gz
cat ./lane1/P088-read-1.fastq.gz ./lane2/P088-read-1.fastq.gz > ../12Pool/P088.fastq.gz

Created process_radtags scripts with the seq_proc.Rmd notebook and then moved the files with fetch to amphiprion in the 18seq/scripts folder. In fetch, also highlighted the scripts, clicked get info, and then checked "execute" instead of having to chmod u+x for all of them.

Also wrote a script to generate the command lines to run process rad tags. These need to be run in separate windows (not sure how that works through a jupyter notebook). The command lines are in the files SEQXX_process_all.sh

Started at 9:45pm on Monday night for SEQ18, finished before 6am on Tuesday.

In [ ]:
%%bash
nohup ./scripts/P073_process.sh &
nohup ./scripts/P074_process.sh &
nohup ./scripts/P075_process.sh &
nohup ./scripts/P076_process.sh &
nohup ./scripts/P077_process.sh &
nohup ./scripts/P078_process.sh &
nohup ./scripts/P083_process.sh &
nohup ./scripts/P084_process.sh &
nohup ./scripts/P085_process.sh &
nohup ./scripts/P086_process.sh &
nohup ./scripts/P087_process.sh &
nohup ./scripts/P088_process.sh &

Move the process_radtags logs to the logs folder - wrote command lines for large batches in seq_proc.Rmd notebook and saved the lines in files called SEQXX-move_radlogs.sh

In [ ]:
%%bash
mv 01Pool/process_radtags.log ./logs/processP075.log
mv 02Pool/process_radtags.log ./logs/processP076.log
mv 03Pool/process_radtags.log ./logs/processP073.log
mv 04Pool/process_radtags.log ./logs/processP074.log
mv 05Pool/process_radtags.log ./logs/processP077.log
mv 06Pool/process_radtags.log ./logs/processP078.log
mv 07Pool/process_radtags.log ./logs/processP083.log
mv 08Pool/process_radtags.log ./logs/processP084.log
mv 09Pool/process_radtags.log ./logs/processP085.log
mv 10Pool/process_radtags.log ./logs/processP086.log
mv 11Pool/process_radtags.log ./logs/processP087.log
mv 12Pool/process_radtags.log ./logs/processP088.log

it looks like I named the P073-P076 logs to the wrong names - fixing it here

In [14]:
%%bash
cd ~/02-apcl-ddocent/18seq/logs/
pwd
mv processP075.log processP073.log.b
mv processP073.log processP075.log
mv processP073.log.b processP073.log
mv processP076.log processP074.log.b
mv processP074.log processP076.log
mv processP074.log.b processP074.log
/local/home/michelles/02-apcl-ddocent/18seq/logs

Run the readprocesslog.py script to convert the process log into a tsv that can be imported into the database through R

In [28]:
# latlon_3.py - for use in Chapter 10 PCfB
# Read in each line of the example file, split it into 
# separate components, and write certain output to a separate file

# Set the input file name
# (The program must be run from within the directory 
#  that contains this data file)
#InFileName = './logs/16process.out'
InFileName = '/local/home/michelles/02-apcl-ddocent/18seq/logs/processP073.log'

# Open the input file for reading
InFile = open(InFileName, 'r')

# Initialize the counter used to keep track of line numbers
LineNumber = 0

# Open the output file for writing
# Do this *before* the loop, not inside it
OutFileName=InFileName + ".tsv"

OutFile=open(OutFileName,'w') # You can append instead with 'a'

# Loop through each line in the file
for Line in InFile:
        # Skip the header, line # 0
        if LineNumber < 62:
                if LineNumber > 13:
                        # Remove the line ending characters
                        Line=Line.strip('\n')
                        
                        # Separate the line into a list of its tab-delimited components
                        ElementList=Line.split('\t')
                        Barcode   = ElementList[0]
                        Total_Reads = ElementList[1]
                        Lack_RadTag = ElementList[2]  
                        Low_Quality = ElementList[3]
                        Retained = ElementList[4]
                        
                        # Use the % operator to generate a string 
                        # We can use this for output both to the screen and to a file
                        OutputString = "%s\t%s\t%s\t%s\t%s" % (ElementList[0], ElementList[1], ElementList[2], ElementList[3], ElementList[4])
                           
                        # Can still print to the screen then write to a file
                        print (OutputString)
                        
                        # Unlike print statements, .write needs a linefeed
                        OutFile.write(OutputString+"\n")  
                
        # Index the counter used to keep track of line numbers
        LineNumber = LineNumber + 1
        # After the loop is completed, close the files
InFile.close()
OutFile.close()
AAACAC	184410	544	2287	180651
AAACGA	3049	12	64	2959
AAAGTC	254	3	10	240
AACGGT	424	4	35	383
AACTTC	42899	135	595	41979
TGCTCA	294126	777	3494	288375
AAGAAC	15143	48	196	14828
AATGTG	348810	1739	4301	340906
ACATGT	128123	529	1634	125283
ACCAAA	135301	390	1787	132476
ACGATA	307763	781	3946	301508
ACGTTT	411092	4607	5438	399045
ACTAGG	107275	431	1376	104846
ACTCCA	135777	384	1662	133027
AGACTC	554874	1445	6934	543964
AGCATT	73469	356	961	71750
AGGAGA	180997	583	2261	177258
AGTAAG	295916	1038	3664	289711
AGTCCT	106254	398	1387	103968
AGTTAC	284683	915	3885	278372
ATAACC	495362	2550	6621	483717
ATCGCA	505607	1482	6762	494689
ATCTCG	609247	1763	8079	596235
ATGGAG	180117	649	2292	176142
ATGTCC	381	5	31	343
CAATTG	375559	1880	4581	367250
CAGAGT	1216	15	33	1164
CATCAG	577	16	15	543
CATCTC	339022	1180	4152	331830
CCACTT	622318	3002	8066	608307
CCCATA	611199	1664	7833	598970
CCTGAA	444268	1317	5615	435273
CGAAAC	150051	431	1810	147086
CGAATG	4101	87	61	3931
GACGTT	279543	977	3571	273749
GATACA	47201	120	638	46232
GCAGAA	352914	1012	4271	345785
GGGATA	151676	558	1844	148573
GGTGAA	141935	495	1600	139112
GTAGCT	421092	1292	5228	412733
GTCTAT	126800	381	1576	124240
GTTCAG	263050	911	2994	257900
TAAGAC	164730	486	1923	161403
TACCAG	286207	1014	3555	280324
TCAATC	345607	1161	4121	338703
TCCAAA	269829	732	3331	264393
TCTGCT	257406	741	3099	252326
TCTTAG	51885	1289	641	49741
In [30]:
%ls 02-apcl-ddocent/18seq/logs
barcodes         names_078.tsv  processP073.log      processP083.log
index-seq18.tsv  names_083.tsv  processP073.log.tsv  processP084.log
names_073.tsv    names_084.tsv  processP074.log      processP085.log
names_074.tsv    names_085.tsv  processP075.log      processP086.log
names_075.tsv    names_086.tsv  processP076.log      processP087.log
names_076.tsv    names_087.tsv  processP077.log      processP088.log
names_077.tsv    names_088.tsv  processP078.log
In [31]:
# latlon_3.py - for use in Chapter 10 PCfB
# Read in each line of the example file, split it into 
# separate components, and write certain output to a separate file

# Set the input file name
# (The program must be run from within the directory 
#  that contains this data file)
#InFileName = './logs/16process.out'
InFileName = '/local/home/michelles/02-apcl-ddocent/18seq/logs/processP074.log'

# Open the input file for reading
InFile = open(InFileName, 'r')

# Initialize the counter used to keep track of line numbers
LineNumber = 0

# Open the output file for writing
# Do this *before* the loop, not inside it
OutFileName=InFileName + ".tsv"

OutFile=open(OutFileName,'w') # You can append instead with 'a'

# Loop through each line in the file
for Line in InFile:
        # Skip the header, line # 0
        if LineNumber < 62:
                if LineNumber > 13:
                        # Remove the line ending characters
                        Line=Line.strip('\n')
                        
                        # Separate the line into a list of its tab-delimited components
                        ElementList=Line.split('\t')
                        Barcode   = ElementList[0]
                        Total_Reads = ElementList[1]
                        Lack_RadTag = ElementList[2]  
                        Low_Quality = ElementList[3]
                        Retained = ElementList[4]
                        
                        # Use the % operator to generate a string 
                        # We can use this for output both to the screen and to a file
                        OutputString = "%s\t%s\t%s\t%s\t%s" % (ElementList[0], ElementList[1], ElementList[2], ElementList[3], ElementList[4])
                           
                        # Can still print to the screen then write to a file
                        print (OutputString)
                        
                        # Unlike print statements, .write needs a linefeed
                        OutFile.write(OutputString+"\n")  
                
        # Index the counter used to keep track of line numbers
        LineNumber = LineNumber + 1
        # After the loop is completed, close the files
InFile.close()
OutFile.close()
AAACAC	286581	921	3986	280149
AAACGA	23464	68	319	22969
AAAGTC	308347	1000	3847	301991
AACGGT	479304	2311	6190	468136
AACTTC	402800	1285	5759	393856
TGCTCA	297985	807	3764	291881
AAGAAC	279586	1145	3489	273624
AATGTG	143766	590	1948	140574
ACATGT	366207	1493	4770	357945
ACCAAA	458046	1765	6234	447605
ACGATA	48343	218	629	47232
ACGTTT	405859	2780	5756	395044
ACTAGG	257218	907	3284	251646
ACTCCA	323123	1445	4256	315915
AGACTC	165108	689	2063	161512
AGCATT	606406	2667	7970	592646
AGGAGA	429575	1223	5382	420463
AGTAAG	140100	436	1752	137101
AGTCCT	513189	2545	6601	501268
AGTTAC	350908	1302	4749	342896
ATAACC	278692	950	3867	272383
ATCGCA	127602	323	1774	124838
ATCTCG	508599	1338	6918	498000
ATGGAG	275501	1085	3708	269080
ATGTCC	345918	941	4513	338782
CAATTG	211679	899	2768	206858
CAGAGT	441716	2067	5447	431750
CATCAG	25657	154	369	24959
CATCTC	4876	23	78	4750
CCACTT	392079	1505	5332	383154
CCCATA	610451	1696	7990	598219
CCTGAA	488638	1662	5860	478629
CGAAAC	188222	625	2296	184383
CGAATG	190016	1082	2401	185445
GACGTT	136772	645	1755	133741
GATACA	118510	382	1563	115855
GCAGAA	591034	2100	7099	578419
GGGATA	460005	1615	6137	449902
GGTGAA	575240	2448	7007	562910
GTAGCT	457521	1600	6085	447439
GTCTAT	318993	1050	3871	312431
GTTCAG	121644	529	1439	118990
TAAGAC	595626	1908	7236	583349
TACCAG	247707	1443	3411	241487
TCAATC	34009	120	438	33274
TCCAAA	235048	626	3032	230274
TCTGCT	439300	1475	5491	429869
TCTTAG	256021	1537	3131	250179
In [32]:
# latlon_3.py - for use in Chapter 10 PCfB
# Read in each line of the example file, split it into 
# separate components, and write certain output to a separate file

# Set the input file name
# (The program must be run from within the directory 
#  that contains this data file)
#InFileName = './logs/16process.out'
InFileName = '/local/home/michelles/02-apcl-ddocent/18seq/logs/processP075.log'

# Open the input file for reading
InFile = open(InFileName, 'r')

# Initialize the counter used to keep track of line numbers
LineNumber = 0

# Open the output file for writing
# Do this *before* the loop, not inside it
OutFileName=InFileName + ".tsv"

OutFile=open(OutFileName,'w') # You can append instead with 'a'

# Loop through each line in the file
for Line in InFile:
        # Skip the header, line # 0
        if LineNumber < 62:
                if LineNumber > 13:
                        # Remove the line ending characters
                        Line=Line.strip('\n')
                        
                        # Separate the line into a list of its tab-delimited components
                        ElementList=Line.split('\t')
                        Barcode   = ElementList[0]
                        Total_Reads = ElementList[1]
                        Lack_RadTag = ElementList[2]  
                        Low_Quality = ElementList[3]
                        Retained = ElementList[4]
                        
                        # Use the % operator to generate a string 
                        # We can use this for output both to the screen and to a file
                        OutputString = "%s\t%s\t%s\t%s\t%s" % (ElementList[0], ElementList[1], ElementList[2], ElementList[3], ElementList[4])
                           
                        # Can still print to the screen then write to a file
                        print (OutputString)
                        
                        # Unlike print statements, .write needs a linefeed
                        OutFile.write(OutputString+"\n")  
                
        # Index the counter used to keep track of line numbers
        LineNumber = LineNumber + 1
        # After the loop is completed, close the files
InFile.close()
OutFile.close()
AAACAC	611438	4141	10121	593862
AAACGA	685006	3842	11635	665387
AAAGTC	651189	2848	9889	634697
AACGGT	818631	3567	13299	796202
AACTTC	715900	2762	12630	696563
TGCTCA	497017	2607	7865	483722
AAGAAC	672194	2900	9886	655363
AATGTG	805988	3400	12012	785724
ACATGT	280651	1098	4081	273974
ACCAAA	391728	1370	5957	381880
ACGATA	359954	1441	5385	351155
ACGTTT	468370	3467	7615	455034
ACTAGG	251822	1672	3686	245072
ACTCCA	271383	950	3958	265106
AGACTC	286043	817	4167	279549
AGCATT	766704	3560	11513	747352
AGGAGA	627866	2076	8778	613139
AGTAAG	527695	1991	7728	514689
AGTCCT	318912	2799	4576	309999
AGTTAC	337237	911	5149	329195
ATAACC	378399	1238	5963	369043
ATCGCA	566520	1612	8858	552920
ATCTCG	327208	949	5029	319498
ATGGAG	666548	2232	10130	649593
ATGTCC	827845	2202	12229	808739
CAATTG	705520	2614	10147	688504
CAGAGT	832933	4021	12680	811129
CATCAG	580259	2288	8004	566658
CATCTC	454706	1210	6512	443903
CCACTT	962338	3176	13927	939838
CCCATA	622992	1809	9331	608492
CCTGAA	548341	1798	8233	535321
CGAAAC	835921	2532	11880	816704
CGAATG	847102	4283	14240	823436
GACGTT	904852	3435	13166	883126
GATACA	728563	1949	10416	711878
GCAGAA	824197	2419	10663	806151
GGGATA	729056	2369	11009	711963
GGTGAA	444854	1543	6113	434672
GTAGCT	483419	1901	7447	471315
GTCTAT	341803	1427	4861	333750
GTTCAG	423	5	20	396
TAAGAC	593	6	46	540
TACCAG	483320	1526	7484	471652
TCAATC	564	13	34	514
TCCAAA	607422	1589	8574	593552
TCTGCT	508928	1395	7333	497302
TCTTAG	751368	2129	10222	734644
In [33]:
# latlon_3.py - for use in Chapter 10 PCfB
# Read in each line of the example file, split it into 
# separate components, and write certain output to a separate file

# Set the input file name
# (The program must be run from within the directory 
#  that contains this data file)
#InFileName = './logs/16process.out'
InFileName = '/local/home/michelles/02-apcl-ddocent/18seq/logs/processP076.log'

# Open the input file for reading
InFile = open(InFileName, 'r')

# Initialize the counter used to keep track of line numbers
LineNumber = 0

# Open the output file for writing
# Do this *before* the loop, not inside it
OutFileName=InFileName + ".tsv"

OutFile=open(OutFileName,'w') # You can append instead with 'a'

# Loop through each line in the file
for Line in InFile:
        # Skip the header, line # 0
        if LineNumber < 62:
                if LineNumber > 13:
                        # Remove the line ending characters
                        Line=Line.strip('\n')
                        
                        # Separate the line into a list of its tab-delimited components
                        ElementList=Line.split('\t')
                        Barcode   = ElementList[0]
                        Total_Reads = ElementList[1]
                        Lack_RadTag = ElementList[2]  
                        Low_Quality = ElementList[3]
                        Retained = ElementList[4]
                        
                        # Use the % operator to generate a string 
                        # We can use this for output both to the screen and to a file
                        OutputString = "%s\t%s\t%s\t%s\t%s" % (ElementList[0], ElementList[1], ElementList[2], ElementList[3], ElementList[4])
                           
                        # Can still print to the screen then write to a file
                        print (OutputString)
                        
                        # Unlike print statements, .write needs a linefeed
                        OutFile.write(OutputString+"\n")  
                
        # Index the counter used to keep track of line numbers
        LineNumber = LineNumber + 1
        # After the loop is completed, close the files
InFile.close()
OutFile.close()
AAACAC	1047823	5454	13943	1022118
AAACGA	1238925	3664	15965	1212020
AAAGTC	771624	2346	9580	755320
AACGGT	1002155	3668	13324	979004
AACTTC	2749	60	213	2471
TGCTCA	2821	25	239	2550
AAGAAC	2774	18	222	2524
AATGTG	35202	187	2790	32075
ACATGT	973979	3921	12593	952103
ACCAAA	1012023	2716	12949	990138
ACGATA	35113	116	2775	32096
ACGTTT	23866	1486	1918	20382
ACTAGG	21404	91	1613	19614
ACTCCA	771	9	35	718
AGACTC	535	7	46	477
AGCATT	554	11	36	504
AGGAGA	1133769	5164	15330	1105932
AGTAAG	845226	4554	11366	824448
AGTCCT	793167	4206	10567	773785
AGTTAC	968935	3776	12999	946571
ATAACC	622834	3332	9104	606666
ATCGCA	855875	2666	11979	836175
ATCTCG	930013	2744	12374	909872
ATGGAG	605620	5025	8756	587930
ATGTCC	521318	1301	6648	510550
CAATTG	569576	2059	6986	556659
CAGAGT	953165	4922	11078	931424
CATCAG	797494	2833	9372	780013
CATCTC	546	12	36	495
CCACTT	749	17	50	676
CCCATA	577529	1663	7473	565652
CCTGAA	701282	2224	7966	686899
CGAAAC	579	21	33	524
CGAATG	819	124	40	653
GACGTT	1067184	4025	13204	1043482
GATACA	406660	1795	5237	397203
GCAGAA	700544	2210	8086	686036
GGGATA	1048007	3598	13657	1024735
GGTGAA	618552	2309	7593	605089
GTAGCT	432274	1579	5573	422363
GTCTAT	580355	2285	7355	567342
GTTCAG	448188	1620	5633	438374
TAAGAC	573131	2356	7185	560229
TACCAG	689061	2110	8968	674511
TCAATC	836841	2368	10194	819587
TCCAAA	720641	2198	9356	704772
TCTGCT	1051618	3394	13143	1028691
TCTTAG	611	9	33	565
In [34]:
# latlon_3.py - for use in Chapter 10 PCfB
# Read in each line of the example file, split it into 
# separate components, and write certain output to a separate file

# Set the input file name
# (The program must be run from within the directory 
#  that contains this data file)
#InFileName = './logs/16process.out'
InFileName = '/local/home/michelles/02-apcl-ddocent/18seq/logs/processP077.log'

# Open the input file for reading
InFile = open(InFileName, 'r')

# Initialize the counter used to keep track of line numbers
LineNumber = 0

# Open the output file for writing
# Do this *before* the loop, not inside it
OutFileName=InFileName + ".tsv"

OutFile=open(OutFileName,'w') # You can append instead with 'a'

# Loop through each line in the file
for Line in InFile:
        # Skip the header, line # 0
        if LineNumber < 62:
                if LineNumber > 13:
                        # Remove the line ending characters
                        Line=Line.strip('\n')
                        
                        # Separate the line into a list of its tab-delimited components
                        ElementList=Line.split('\t')
                        Barcode   = ElementList[0]
                        Total_Reads = ElementList[1]
                        Lack_RadTag = ElementList[2]  
                        Low_Quality = ElementList[3]
                        Retained = ElementList[4]
                        
                        # Use the % operator to generate a string 
                        # We can use this for output both to the screen and to a file
                        OutputString = "%s\t%s\t%s\t%s\t%s" % (ElementList[0], ElementList[1], ElementList[2], ElementList[3], ElementList[4])
                           
                        # Can still print to the screen then write to a file
                        print (OutputString)
                        
                        # Unlike print statements, .write needs a linefeed
                        OutFile.write(OutputString+"\n")  
                
        # Index the counter used to keep track of line numbers
        LineNumber = LineNumber + 1
        # After the loop is completed, close the files
InFile.close()
OutFile.close()
AAACAC	636159	4899	8699	618617
AAACGA	427099	2267	6114	416005
AAAGTC	303089	1467	3930	295944
AACGGT	1110827	4831	14924	1084381
AACTTC	516713	1693	7031	505042
TGCTCA	1004575	4143	13129	981022
AAGAAC	579584	2534	7417	565913
AATGTG	1191666	5419	15706	1163147
ACATGT	831459	3139	10900	812851
ACCAAA	421573	2413	5826	410872
ACGATA	640909	4742	9333	623418
ACGTTT	723047	6105	11413	701605
ACTAGG	1008	11	95	901
ACTCCA	1948	41	39	1861
AGACTC	411425	1890	5784	401386
AGCATT	350910	1755	4615	342020
AGGAGA	317664	2382	4661	308949
AGTAAG	398126	2624	5690	387445
AGTCCT	312060	3147	4521	302668
AGTTAC	397331	2242	6038	386808
ATAACC	361194	1979	5607	351608
ATCGCA	625891	2659	9017	610975
ATCTCG	659483	2112	8704	644936
ATGGAG	382261	1577	5060	373059
ATGTCC	627891	2104	8108	613857
CAATTG	507492	2935	6641	494954
CAGAGT	343412	2215	4328	334483
CATCAG	448209	1587	5366	438761
CATCTC	422206	1317	5355	413202
CCACTT	444	24	35	385
CCCATA	555	15	38	501
CCTGAA	542810	1805	7002	530831
CGAAAC	365	19	23	319
CGAATG	222646	969	2980	217303
GACGTT	489227	2088	6396	478007
GATACA	386868	1170	5141	378185
GCAGAA	681266	3730	8891	664392
GGGATA	10900	75	872	9914
GGTGAA	15070	93	1102	13826
GTAGCT	21087	97	1737	19196
GTCTAT	15769	94	1189	14437
GTTCAG	9266	67	668	8514
TAAGAC	17907	81	1406	16372
TACCAG	29745	170	2232	27226
TCAATC	24400	100	1820	22376
TCCAAA	33577	154	2543	30765
TCTGCT	26630	138	2107	24299
TCTTAG	46094	279	3519	42157
In [35]:
# latlon_3.py - for use in Chapter 10 PCfB
# Read in each line of the example file, split it into 
# separate components, and write certain output to a separate file

# Set the input file name
# (The program must be run from within the directory 
#  that contains this data file)
#InFileName = './logs/16process.out'
InFileName = '/local/home/michelles/02-apcl-ddocent/18seq/logs/processP078.log'

# Open the input file for reading
InFile = open(InFileName, 'r')

# Initialize the counter used to keep track of line numbers
LineNumber = 0

# Open the output file for writing
# Do this *before* the loop, not inside it
OutFileName=InFileName + ".tsv"

OutFile=open(OutFileName,'w') # You can append instead with 'a'

# Loop through each line in the file
for Line in InFile:
        # Skip the header, line # 0
        if LineNumber < 62:
                if LineNumber > 13:
                        # Remove the line ending characters
                        Line=Line.strip('\n')
                        
                        # Separate the line into a list of its tab-delimited components
                        ElementList=Line.split('\t')
                        Barcode   = ElementList[0]
                        Total_Reads = ElementList[1]
                        Lack_RadTag = ElementList[2]  
                        Low_Quality = ElementList[3]
                        Retained = ElementList[4]
                        
                        # Use the % operator to generate a string 
                        # We can use this for output both to the screen and to a file
                        OutputString = "%s\t%s\t%s\t%s\t%s" % (ElementList[0], ElementList[1], ElementList[2], ElementList[3], ElementList[4])
                           
                        # Can still print to the screen then write to a file
                        print (OutputString)
                        
                        # Unlike print statements, .write needs a linefeed
                        OutFile.write(OutputString+"\n")  
                
        # Index the counter used to keep track of line numbers
        LineNumber = LineNumber + 1
        # After the loop is completed, close the files
InFile.close()
OutFile.close()
AAACAC	308115	1306	4082	300852
AAACGA	602781	2159	8593	588508
AAAGTC	303177	986	3938	296583
AACGGT	305261	1253	4196	298102
AACTTC	140407	363	1958	137227
TGCTCA	668158	1694	8516	653911
AAGAAC	446683	1334	5895	436974
AATGTG	475148	1871	6373	464245
ACATGT	525840	1777	7042	514161
ACCAAA	843217	2241	11719	824078
ACGATA	488436	1399	6625	477383
ACGTTT	494422	2939	7231	481045
ACTAGG	464111	1531	6028	453664
ACTCCA	761947	2129	10591	745546
AGACTC	315290	1197	4366	308132
AGCATT	232587	1476	3377	226490
AGGAGA	378	10	31	336
AGTAAG	286721	1525	4043	279511
AGTCCT	341727	2100	4681	333044
AGTTAC	411067	1040	5543	402164
ATAACC	430747	1929	5976	420578
ATCGCA	215092	674	2978	210345
ATCTCG	426178	983	5763	417256
ATGGAG	204020	653	2663	199682
ATGTCC	248990	741	3393	243597
CAATTG	320745	1162	3995	313534
CAGAGT	480330	2440	6418	468654
CATCAG	344732	1100	4363	337345
CATCTC	307	6	18	281
CCACTT	467132	1486	6285	456754
CCCATA	336442	907	4628	329158
CCTGAA	166321	478	2252	162766
CGAAAC	306655	835	4168	299820
CGAATG	424642	1367	5516	415344
GACGTT	561172	2031	7285	548587
GATACA	433361	1365	5561	424188
GCAGAA	520869	1530	6279	510169
GGGATA	11894	78	921	10866
GGTGAA	19257	70	1528	17591
GTAGCT	707	8	52	644
GTCTAT	16262	70	1296	14849
GTTCAG	19122	69	1431	17567
TAAGAC	5852	30	422	5380
TACCAG	271	5	10	253
TCAATC	401	12	16	368
TCCAAA	13442	49	1020	12330
TCTGCT	163134	480	2095	159592
TCTTAG	257357	796	3184	251963
In [3]:
%ls 02-apcl-ddocent/18seq/logs
barcodes         names_083.tsv        processP074.log      processP078.log
index-seq18.tsv  names_084.tsv        processP074.log.tsv  processP078.log.tsv
names_073.tsv    names_085.tsv        processP075.log      processP083.log
names_074.tsv    names_086.tsv        processP075.log.tsv  processP084.log
names_075.tsv    names_087.tsv        processP076.log      processP085.log
names_076.tsv    names_088.tsv        processP076.log.tsv  processP086.log
names_077.tsv    processP073.log      processP077.log      processP087.log
names_078.tsv    processP073.log.tsv  processP077.log.tsv  processP088.log
In [4]:
# latlon_3.py - for use in Chapter 10 PCfB
# Read in each line of the example file, split it into 
# separate components, and write certain output to a separate file

# Set the input file name
# (The program must be run from within the directory 
#  that contains this data file)
#InFileName = './logs/16process.out'
InFileName = '/local/home/michelles/02-apcl-ddocent/18seq/logs/processP083.log'

# Open the input file for reading
InFile = open(InFileName, 'r')

# Initialize the counter used to keep track of line numbers
LineNumber = 0

# Open the output file for writing
# Do this *before* the loop, not inside it
OutFileName=InFileName + ".tsv"

OutFile=open(OutFileName,'w') # You can append instead with 'a'

# Loop through each line in the file
for Line in InFile:
        # Skip the header, line # 0
        if LineNumber < 62:
                if LineNumber > 13:
                        # Remove the line ending characters
                        Line=Line.strip('\n')
                        
                        # Separate the line into a list of its tab-delimited components
                        ElementList=Line.split('\t')
                        Barcode   = ElementList[0]
                        Total_Reads = ElementList[1]
                        Lack_RadTag = ElementList[2]  
                        Low_Quality = ElementList[3]
                        Retained = ElementList[4]
                        
                        # Use the % operator to generate a string 
                        # We can use this for output both to the screen and to a file
                        OutputString = "%s\t%s\t%s\t%s\t%s" % (ElementList[0], ElementList[1], ElementList[2], ElementList[3], ElementList[4])
                           
                        # Can still print to the screen then write to a file
                        print (OutputString)
                        
                        # Unlike print statements, .write needs a linefeed
                        OutFile.write(OutputString+"\n")  
                
        # Index the counter used to keep track of line numbers
        LineNumber = LineNumber + 1
        # After the loop is completed, close the files
InFile.close()
OutFile.close()
AAACAC	311328	920	3599	305059
AAACGA	755744	2186	9135	740309
AAAGTC	108927	437	1147	106624
AACGGT	576085	3022	6703	562778
AACTTC	315241	974	3780	308809
TGCTCA	298717	746	3231	292782
AAGAAC	131174	374	1437	128658
AATGTG	455821	1969	5215	446204
ACATGT	434612	1397	4899	425774
ACCAAA	121636	322	1503	119154
ACGATA	189919	531	2239	186140
ACGTTT	152146	808	1869	148551
ACTAGG	235599	843	2607	230934
ACTCCA	254725	822	2984	249448
AGACTC	288005	1053	3232	282263
AGCATT	282	22	14	241
AGGAGA	68544	282	749	67136
AGTAAG	254476	2358	2755	248113
AGTCCT	70027	466	783	68428
AGTTAC	246702	704	2847	241986
ATAACC	22180	149	304	21574
ATCGCA	41478	148	479	40639
ATCTCG	121920	466	1486	119380
ATGGAG	265725	4785	2922	256433
ATGTCC	87270	246	937	85696
CAATTG	542588	1969	6129	531431
CAGAGT	85432	600	896	83436
CATCAG	155337	916	1741	151833
CATCTC	338982	1179	3714	332147
CCACTT	86070	969	1012	83667
CCCATA	787720	2645	8874	771082
CCTGAA	708	7	19	681
CGAAAC	15040	59	149	14756
CGAATG	82167	270	915	80402
GACGTT	39678	185	478	38773
GATACA	306129	933	3410	300085
GCAGAA	8253	36	124	8067
GGGATA	340583	1309	3839	333588
GGTGAA	1043061	4178	11249	1021601
GTAGCT	311462	1653	3505	304543
GTCTAT	249166	850	2682	244355
GTTCAG	329601	1269	3346	323208
TAAGAC	275493	1005	3021	270036
TACCAG	84210	404	976	82387
TCAATC	224252	1374	2426	219162
TCCAAA	1232833	5109	14343	1206595
TCTGCT	109658	572	1098	107388
TCTTAG	44921	524	504	43661
In [5]:
# latlon_3.py - for use in Chapter 10 PCfB
# Read in each line of the example file, split it into 
# separate components, and write certain output to a separate file

# Set the input file name
# (The program must be run from within the directory 
#  that contains this data file)
#InFileName = './logs/16process.out'
InFileName = '/local/home/michelles/02-apcl-ddocent/18seq/logs/processP084.log'

# Open the input file for reading
InFile = open(InFileName, 'r')

# Initialize the counter used to keep track of line numbers
LineNumber = 0

# Open the output file for writing
# Do this *before* the loop, not inside it
OutFileName=InFileName + ".tsv"

OutFile=open(OutFileName,'w') # You can append instead with 'a'

# Loop through each line in the file
for Line in InFile:
        # Skip the header, line # 0
        if LineNumber < 62:
                if LineNumber > 13:
                        # Remove the line ending characters
                        Line=Line.strip('\n')
                        
                        # Separate the line into a list of its tab-delimited components
                        ElementList=Line.split('\t')
                        Barcode   = ElementList[0]
                        Total_Reads = ElementList[1]
                        Lack_RadTag = ElementList[2]  
                        Low_Quality = ElementList[3]
                        Retained = ElementList[4]
                        
                        # Use the % operator to generate a string 
                        # We can use this for output both to the screen and to a file
                        OutputString = "%s\t%s\t%s\t%s\t%s" % (ElementList[0], ElementList[1], ElementList[2], ElementList[3], ElementList[4])
                           
                        # Can still print to the screen then write to a file
                        print (OutputString)
                        
                        # Unlike print statements, .write needs a linefeed
                        OutFile.write(OutputString+"\n")  
                
        # Index the counter used to keep track of line numbers
        LineNumber = LineNumber + 1
        # After the loop is completed, close the files
InFile.close()
OutFile.close()
AAACAC	171222	1255	2515	166476
AAACGA	398260	1678	5341	388907
AAAGTC	395168	1847	4987	386349
AACGGT	409633	2208	5723	399318
AACTTC	184912	1240	2358	180156
TGCTCA	343243	1009	4089	336232
AAGAAC	928026	4443	11321	907424
AATGTG	297306	3924	3646	288199
ACATGT	634996	2627	8119	620791
ACCAAA	775362	2243	9444	759176
ACGATA	517074	1646	6794	505722
ACGTTT	549371	3730	7517	535000
ACTAGG	375243	1457	4367	367363
ACTCCA	124858	324	1576	122195
AGACTC	878889	2423	10835	860529
AGCATT	804778	4668	10273	785124
AGGAGA	779610	2575	9248	763275
AGTAAG	412781	1375	5020	404123
AGTCCT	835586	5006	10063	815984
AGTTAC	83326	249	1110	81518
ATAACC	368800	1242	4684	360809
ATCGCA	1188357	3219	15169	1162942
ATCTCG	733162	2353	9407	717290
ATGGAG	932906	3420	11226	912500
ATGTCC	767626	2061	9363	751772
CAATTG	995208	4223	11814	972528
CAGAGT	909543	4402	10219	888318
CATCAG	354015	7559	4244	340245
CATCTC	457972	1630	5757	448020
CCACTT	865666	3149	11018	846508
CCCATA	697079	2504	8591	681476
CCTGAA	685316	2529	8002	670798
CGAAAC	287923	898	3503	281786
CGAATG	1034361	4061	12007	1011806
GACGTT	516561	2186	6574	504907
GATACA	730757	2204	8783	715825
GCAGAA	1366014	4730	15336	1338553
GGGATA	649121	2082	7869	635496
GGTGAA	401201	1406	4713	392591
GTAGCT	1105107	3958	13117	1081663
GTCTAT	833500	2784	9704	816092
GTTCAG	75281	325	902	73635
TAAGAC	817378	2818	9550	800371
TACCAG	351923	1107	4311	344530
TCAATC	604654	2141	7044	592049
TCCAAA	745959	2239	9211	730102
TCTGCT	489962	1507	5970	479335
TCTTAG	303611	2694	3577	295813
In [6]:
# latlon_3.py - for use in Chapter 10 PCfB
# Read in each line of the example file, split it into 
# separate components, and write certain output to a separate file

# Set the input file name
# (The program must be run from within the directory 
#  that contains this data file)
#InFileName = './logs/16process.out'
InFileName = '/local/home/michelles/02-apcl-ddocent/18seq/logs/processP085.log'

# Open the input file for reading
InFile = open(InFileName, 'r')

# Initialize the counter used to keep track of line numbers
LineNumber = 0

# Open the output file for writing
# Do this *before* the loop, not inside it
OutFileName=InFileName + ".tsv"

OutFile=open(OutFileName,'w') # You can append instead with 'a'

# Loop through each line in the file
for Line in InFile:
        # Skip the header, line # 0
        if LineNumber < 62:
                if LineNumber > 13:
                        # Remove the line ending characters
                        Line=Line.strip('\n')
                        
                        # Separate the line into a list of its tab-delimited components
                        ElementList=Line.split('\t')
                        Barcode   = ElementList[0]
                        Total_Reads = ElementList[1]
                        Lack_RadTag = ElementList[2]  
                        Low_Quality = ElementList[3]
                        Retained = ElementList[4]
                        
                        # Use the % operator to generate a string 
                        # We can use this for output both to the screen and to a file
                        OutputString = "%s\t%s\t%s\t%s\t%s" % (ElementList[0], ElementList[1], ElementList[2], ElementList[3], ElementList[4])
                           
                        # Can still print to the screen then write to a file
                        print (OutputString)
                        
                        # Unlike print statements, .write needs a linefeed
                        OutFile.write(OutputString+"\n")  
                
        # Index the counter used to keep track of line numbers
        LineNumber = LineNumber + 1
        # After the loop is completed, close the files
InFile.close()
OutFile.close()
AAACAC	169523	892	2425	165343
AAACGA	630419	2271	9686	615165
AAAGTC	510347	2368	7645	497779
AACGGT	656349	2924	10338	639481
AACTTC	560307	2034	9115	546296
TGCTCA	648183	2407	9057	633478
AAGAAC	332799	1219	4685	325148
AATGTG	601870	3774	8714	586468
ACATGT	195918	1182	3079	190684
ACCAAA	141647	640	2371	137807
ACGATA	388658	1336	6026	379233
ACGTTT	238006	2517	3988	230225
ACTAGG	162425	873	2284	158396
ACTCCA	202130	879	3293	196889
AGACTC	168312	662	2675	164048
AGCATT	166451	1947	2646	160912
AGGAGA	252414	1483	3810	245687
AGTAAG	267975	2145	3897	260452
AGTCCT	310973	3049	4651	301925
AGTTAC	361302	1936	5701	352023
ATAACC	187006	1410	3057	181570
ATCGCA	406333	1485	6304	396541
ATCTCG	401283	1414	6536	391385
ATGGAG	294887	1468	5141	286691
ATGTCC	442454	1641	6372	432034
CAATTG	388625	2359	5605	378442
CAGAGT	290174	2607	4213	281729
CATCAG	244122	2107	3714	236932
CATCTC	351344	1140	5012	343193
CCACTT	318469	1063	4604	311355
CCCATA	300352	988	4623	293237
CCTGAA	304803	1258	4531	297370
CGAAAC	511091	2021	7284	498957
CGAATG	301571	2599	4442	292915
GACGTT	339191	1590	4954	330825
GATACA	424956	2086	6427	414444
GCAGAA	365673	1917	5047	356935
GGGATA	290287	1477	4408	283059
GGTGAA	194731	720	2546	190519
GTAGCT	225050	979	3322	219509
GTCTAT	512641	2021	7280	500764
GTTCAG	443673	1881	5969	433457
TAAGAC	292436	1419	4200	285196
TACCAG	336684	1291	5083	328714
TCAATC	415087	1360	5547	406048
TCCAAA	375762	1555	5878	366567
TCTGCT	342143	1586	5071	333805
TCTTAG	197464	1615	2716	192201
In [7]:
# latlon_3.py - for use in Chapter 10 PCfB
# Read in each line of the example file, split it into 
# separate components, and write certain output to a separate file

# Set the input file name
# (The program must be run from within the directory 
#  that contains this data file)
#InFileName = './logs/16process.out'
InFileName = '/local/home/michelles/02-apcl-ddocent/18seq/logs/processP086.log'

# Open the input file for reading
InFile = open(InFileName, 'r')

# Initialize the counter used to keep track of line numbers
LineNumber = 0

# Open the output file for writing
# Do this *before* the loop, not inside it
OutFileName=InFileName + ".tsv"

OutFile=open(OutFileName,'w') # You can append instead with 'a'

# Loop through each line in the file
for Line in InFile:
        # Skip the header, line # 0
        if LineNumber < 62:
                if LineNumber > 13:
                        # Remove the line ending characters
                        Line=Line.strip('\n')
                        
                        # Separate the line into a list of its tab-delimited components
                        ElementList=Line.split('\t')
                        Barcode   = ElementList[0]
                        Total_Reads = ElementList[1]
                        Lack_RadTag = ElementList[2]  
                        Low_Quality = ElementList[3]
                        Retained = ElementList[4]
                        
                        # Use the % operator to generate a string 
                        # We can use this for output both to the screen and to a file
                        OutputString = "%s\t%s\t%s\t%s\t%s" % (ElementList[0], ElementList[1], ElementList[2], ElementList[3], ElementList[4])
                           
                        # Can still print to the screen then write to a file
                        print (OutputString)
                        
                        # Unlike print statements, .write needs a linefeed
                        OutFile.write(OutputString+"\n")  
                
        # Index the counter used to keep track of line numbers
        LineNumber = LineNumber + 1
        # After the loop is completed, close the files
InFile.close()
OutFile.close()
AAACAC	67500	480	832	65833
AAACGA	101310	583	1531	98588
AAAGTC	94273	484	1254	92038
AACGGT	295794	1726	4350	287960
AACTTC	338254	1047	4962	330313
TGCTCA	282192	914	3817	275864
AAGAAC	220814	1013	2905	215614
AATGTG	177739	1179	2395	173333
ACATGT	294444	1268	3859	287603
ACCAAA	190400	714	2666	186009
ACGATA	78723	448	1178	76598
ACGTTT	159099	1398	2283	154615
ACTAGG	147953	882	1986	144226
ACTCCA	247040	756	3365	241745
AGACTC	203438	650	2804	198862
AGCATT	108481	880	1538	105469
AGGAGA	409995	1552	5436	400613
AGTAAG	216719	921	2895	211657
AGTCCT	91775	989	1234	89045
AGTTAC	117727	329	1573	115182
ATAACC	104174	348	1418	101852
ATCGCA	302140	810	4231	295553
ATCTCG	271613	1066	3852	265375
ATGGAG	225614	981	2999	220182
ATGTCC	211692	768	2876	207178
CAATTG	299	44	14	239
CAGAGT	167160	2042	2426	161713
CATCAG	175319	1450	2272	170523
CATCTC	231471	855	3177	226411
CCACTT	186168	1216	2707	181297
CCCATA	349739	1353	4752	341689
CCTGAA	134806	689	1852	131515
CGAAAC	147974	654	2122	144286
CGAATG	271604	1410	3535	265210
GACGTT	219107	791	2944	213952
GATACA	117764	508	1569	115006
GCAGAA	173596	658	2148	169844
GGGATA	266573	1193	3564	260379
GGTGAA	205835	919	2726	200895
GTAGCT	103417	694	1414	100767
GTCTAT	351033	1588	4600	342898
GTTCAG	322141	1740	4075	314528
TAAGAC	102404	892	1596	99375
TACCAG	202046	1154	2904	196812
TCAATC	210216	1122	2856	205223
TCCAAA	158105	588	2357	154286
TCTGCT	50781	315	787	49378
TCTTAG	147931	866	2062	144266
In [8]:
# latlon_3.py - for use in Chapter 10 PCfB
# Read in each line of the example file, split it into 
# separate components, and write certain output to a separate file

# Set the input file name
# (The program must be run from within the directory 
#  that contains this data file)
#InFileName = './logs/16process.out'
InFileName = '/local/home/michelles/02-apcl-ddocent/18seq/logs/processP087.log'

# Open the input file for reading
InFile = open(InFileName, 'r')

# Initialize the counter used to keep track of line numbers
LineNumber = 0

# Open the output file for writing
# Do this *before* the loop, not inside it
OutFileName=InFileName + ".tsv"

OutFile=open(OutFileName,'w') # You can append instead with 'a'

# Loop through each line in the file
for Line in InFile:
        # Skip the header, line # 0
        if LineNumber < 62:
                if LineNumber > 13:
                        # Remove the line ending characters
                        Line=Line.strip('\n')
                        
                        # Separate the line into a list of its tab-delimited components
                        ElementList=Line.split('\t')
                        Barcode   = ElementList[0]
                        Total_Reads = ElementList[1]
                        Lack_RadTag = ElementList[2]  
                        Low_Quality = ElementList[3]
                        Retained = ElementList[4]
                        
                        # Use the % operator to generate a string 
                        # We can use this for output both to the screen and to a file
                        OutputString = "%s\t%s\t%s\t%s\t%s" % (ElementList[0], ElementList[1], ElementList[2], ElementList[3], ElementList[4])
                           
                        # Can still print to the screen then write to a file
                        print (OutputString)
                        
                        # Unlike print statements, .write needs a linefeed
                        OutFile.write(OutputString+"\n")  
                
        # Index the counter used to keep track of line numbers
        LineNumber = LineNumber + 1
        # After the loop is completed, close the files
InFile.close()
OutFile.close()
AAACAC	190302	4852	3087	181384
AAACGA	162175	779	2637	158008
AAAGTC	152811	557	2153	149325
AACGGT	303513	1388	4942	295437
AACTTC	323025	1495	5229	314543
TGCTCA	579153	1997	8812	564999
AAGAAC	231566	1316	3525	225397
AATGTG	486478	4217	8083	471789
ACATGT	437502	2459	6876	426044
ACCAAA	506837	2133	8208	493714
ACGATA	742900	2938	11915	724097
ACGTTT	313593	3827	6021	302150
ACTAGG	378374	3473	6092	366789
ACTCCA	424307	1837	6839	413085
AGACTC	495279	1923	8148	482568
AGCATT	421086	2758	6709	408887
AGGAGA	392074	1560	5816	382316
AGTAAG	242603	1238	3746	236281
AGTCCT	239804	2258	3668	232625
AGTTAC	264139	1263	4251	257314
ATAACC	405359	1730	6311	394905
ATCGCA	240659	925	3749	234796
ATCTCG	498821	2857	8550	484865
ATGGAG	260688	3463	4523	251282
ATGTCC	337039	1808	5442	327983
CAATTG	156422	3082	2667	149931
CAGAGT	191687	2604	2990	184951
CATCAG	266339	2834	4342	257734
CATCTC	487292	2608	8034	474112
CCACTT	413108	1505	7116	402784
CCCATA	781551	2864	12515	761741
CCTGAA	444845	2741	6876	432787
CGAAAC	687481	4032	11164	668730
CGAATG	198403	2536	3351	191292
GACGTT	197673	1601	3379	191597
GATACA	149977	848	2347	145969
GCAGAA	277477	1181	4304	270382
GGGATA	420971	2176	6752	409570
GGTGAA	313520	2169	5003	304564
GTAGCT	122099	548	1828	119152
GTCTAT	283639	1709	4321	276060
GTTCAG	277394	1484	3967	270260
TAAGAC	601218	2719	8906	586707
TACCAG	265832	1651	4364	258425
TCAATC	546003	1915	7848	533008
TCCAAA	515924	1944	7916	503124
TCTGCT	472544	2034	7256	460467
TCTTAG	321653	1496	4875	313582
In [9]:
# latlon_3.py - for use in Chapter 10 PCfB
# Read in each line of the example file, split it into 
# separate components, and write certain output to a separate file

# Set the input file name
# (The program must be run from within the directory 
#  that contains this data file)
#InFileName = './logs/16process.out'
InFileName = '/local/home/michelles/02-apcl-ddocent/18seq/logs/processP088.log'

# Open the input file for reading
InFile = open(InFileName, 'r')

# Initialize the counter used to keep track of line numbers
LineNumber = 0

# Open the output file for writing
# Do this *before* the loop, not inside it
OutFileName=InFileName + ".tsv"

OutFile=open(OutFileName,'w') # You can append instead with 'a'

# Loop through each line in the file
for Line in InFile:
        # Skip the header, line # 0
        if LineNumber < 62:
                if LineNumber > 13:
                        # Remove the line ending characters
                        Line=Line.strip('\n')
                        
                        # Separate the line into a list of its tab-delimited components
                        ElementList=Line.split('\t')
                        Barcode   = ElementList[0]
                        Total_Reads = ElementList[1]
                        Lack_RadTag = ElementList[2]  
                        Low_Quality = ElementList[3]
                        Retained = ElementList[4]
                        
                        # Use the % operator to generate a string 
                        # We can use this for output both to the screen and to a file
                        OutputString = "%s\t%s\t%s\t%s\t%s" % (ElementList[0], ElementList[1], ElementList[2], ElementList[3], ElementList[4])
                           
                        # Can still print to the screen then write to a file
                        print (OutputString)
                        
                        # Unlike print statements, .write needs a linefeed
                        OutFile.write(OutputString+"\n")  
                
        # Index the counter used to keep track of line numbers
        LineNumber = LineNumber + 1
        # After the loop is completed, close the files
InFile.close()
OutFile.close()
AAACAC	101429	3063	1398	96505
AAACGA	105575	465	1529	103032
AAAGTC	442092	2087	6428	431190
AACGGT	228041	1309	3320	222145
AACTTC	342331	1470	5210	334067
TGCTCA	279245	1395	3953	272535
AAGAAC	141989	929	2012	138294
AATGTG	167020	1280	2458	162488
ACATGT	448364	2574	6984	436488
ACCAAA	401860	1276	6099	392454
ACGATA	208146	1162	3217	202821
ACGTTT	311980	2483	4976	302957
ACTAGG	330333	2110	5081	321535
ACTCCA	438927	1409	6405	428998
AGACTC	322891	1089	4624	315416
AGCATT	388728	2192	6004	378534
AGGAGA	430482	1775	6179	420202
AGTAAG	417868	2866	6758	406039
AGTCCT	370447	2513	5600	360585
AGTTAC	313094	1820	4776	304896
ATAACC	165242	730	2515	161155
ATCGCA	277391	972	4175	270977
ATCTCG	86183	545	1262	84017
ATGGAG	65111	374	946	63484
ATGTCC	115700	291	1612	113268
CAATTG	101761	588	1396	99310
CAGAGT	101876	1237	1443	98643
CATCAG	352997	1642	4847	344524
CATCTC	656117	2118	9680	640911
CCACTT	360644	1280	5327	352047
CCCATA	497098	1460	7151	485691
CCTGAA	691	45	35	606
CGAAAC	232062	1131	3257	226082
CGAATG	143154	803	1971	139532
GACGTT	192197	966	2770	187446
GATACA	235463	798	3434	230024
GCAGAA	189929	761	2543	185632
GGGATA	192784	674	2809	188314
GGTGAA	182358	655	2464	178187
GTAGCT	176624	728	2439	172632
GTCTAT	347090	1446	4745	338944
GTTCAG	149682	583	2061	146225
TAAGAC	83104	399	1117	81162
TACCAG	60126	250	897	58690
TCAATC	221142	669	3019	216423
TCCAAA	231718	758	3214	226586
TCTGCT	381881	1269	5473	373330
TCTTAG	174390	628	2170	170667

Move the newly created tsvs to the laptop with fetch and import them into the database

Rename the sample files - used script in seq_proc.Rmd to generate mass command line text stored in SEQXX_all_rename.sh - this caused some problems because the copy and paste was off and it ended up deleting the APCL-analysis folder. Because of this I am changing the first cd to a full path

In [ ]:
%%bash
cd /local/home/michelles/02-apcl-ddocent/18seq/
cd 01Pool/
sh rename.for.dDocent_se_gz ../logs/names_073.tsv
mv APCL* ../samples/
cd ../02Pool/
sh rename.for.dDocent_se_gz ../logs/names_074.tsv
mv APCL* ../samples/
cd ../03Pool/
sh rename.for.dDocent_se_gz ../logs/names_075.tsv
mv APCL* ../samples/
cd ../04Pool/
sh rename.for.dDocent_se_gz ../logs/names_076.tsv
mv APCL* ../samples/
cd ../05Pool/
sh rename.for.dDocent_se_gz ../logs/names_077.tsv
mv APCL* ../samples/
cd ../06Pool/
sh rename.for.dDocent_se_gz ../logs/names_078.tsv
mv APCL* ../samples/
cd ../07Pool/
sh rename.for.dDocent_se_gz ../logs/names_083.tsv
mv APCL* ../samples/
cd ../08Pool/
sh rename.for.dDocent_se_gz ../logs/names_084.tsv
mv APCL* ../samples/
cd ../09Pool/
sh rename.for.dDocent_se_gz ../logs/names_085.tsv
mv APCL* ../samples/
cd ../10Pool/
sh rename.for.dDocent_se_gz ../logs/names_086.tsv
mv APCL* ../samples/
cd ../11Pool/
sh rename.for.dDocent_se_gz ../logs/names_087.tsv
mv APCL* ../samples/
cd ../12Pool/
sh rename.for.dDocent_se_gz ../logs/names_088.tsv
mv APCL* ../samples/
cd ..

Once all of the samples have been moved, move the gz files to the logs and then you can delete the pool directories

In [15]:
%rm -r *Pool
P074.fastq.gz

make a new directory for the current analysis

In [ ]:
%mkdir ~/02-apcl-ddocent/APCL_analysis/21-03seq

Copy reference.fasta over from jonsfiles

In [17]:
% cp /local/home/michelles/02-apcl-ddocent/jonsfiles/reference.fasta /local/home/michelles/02-apcl-ddocent/APCL_analysis/test_28_03_seq/

If you are running a small batch of samples instead of all samples ever collected, trim and map reads - started at 7:25am and finished at 9:45am on same day

for this phase, dDocent will ask questions.

  • Are the number of samples correct? yes
  • Maximum number of processors to use for this analysis This depends on how many people are trying to use amphiprion right now. The trim and map section isn't too intensive so it should be ok to use alot. I used 30 for SEQ18 - 576 samples 30
  • Maximum memory Again, consider who else is using the machine, for SEQ18 - 576 samples I used 150 150
  • Quality trim? yes
  • Perform assembly? no - this is for creating the reference originally
  • Map reads? yes
  • Adjust default parameters default match score is 1, default mismatch is 4, gap penalty is 6 - I used defaults for all
  • call SNPs no
  • enter email address to be notified when it is done running
In [ ]:
dDocent