• Shuffle
    Toggle On
    Toggle Off
  • Alphabetize
    Toggle On
    Toggle Off
  • Front First
    Toggle On
    Toggle Off
  • Both Sides
    Toggle On
    Toggle Off
  • Read
    Toggle On
    Toggle Off
Reading...
Front

Card Range To Study

through

image

Play button

image

Play button

image

Progress

1/56

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

56 Cards in this Set

  • Front
  • Back

When was DNA sequencing discovered?

In 1977, by Frederick Sanger. The first genome ever was sequenced that same year by him, the bacteriophage phi X 174.

Moore's law and implication to genomics

the amount of genetic data you could compute on a single chip doubles every two years, lowering the cost of sequencing

Sanger Sequencing vs. Next-Generation Sequencing (Illumina)

Sanger:


Chain-termination method


High accuracy


Step-by-step (thermocycling and labeling reactions are done separate from sequencing step)


Strand separation by size using capillary electrophoresis.




Requires SS DNA template, DNA polymerase, dNTPs, ddNTPs (for terminating strand elongation; lack 3'-OH for phosphodiester bond between bases), and radioactive or fluorescent label.




NGS is highly parallelized processes, can handle large genomic analysis projects, high accuracy also. May use Sanger to validate long contiguous DNA sequencing reads that are > 500 nucleotides long.




Steps:


Library prep


Cluster generation


Sequencing


Alignment and Analysis

What's the first component in the NGS process? Indicate the steps.

1. Library Preparation
- random fragmentation of DNA/cDNA sample 
- 5'and 3' adapter ligation of the fragments
- adapter-ligated fragments are PCR amplification
- then gel purification


Alternatively, "tagmentation" combines fragmentation and lig...

1. Library Preparation


- random fragmentation of DNA/cDNA sample


- 5'and 3' adapter ligation of the fragments


- adapter-ligated fragments are PCR amplification


- then gel purification




Alternatively, "tagmentation" combines fragmentation and ligation in a single step, increasing efficiency of the library preparation process

What's the second component in the NGS process? Indicate the steps.

2. Cluster Generation
- library (sequencing templates) loaded into a flow cell
- hybridization of fragments to complementary oligos (surface-bound)
- bridge (solid-phase) amplification of each fragment into distinct, clonal clusters of 1000 identi...

2. Cluster Generation


- library (sequencing templates) loaded into a flow cell


- hybridization of fragments to complementary oligos (surface-bound)


- bridge (solid-phase) amplification of each fragment into distinct, clonal clusters of 1000 identical copies







What's the third component in the NGS process? Indicate the steps.

3. Sequencing 
- detection of single bases as they are incorporated into DNA template strands
- dNTPs are added to the nucleic acid chain by DNA polymerase during DNA synthesis cycles
(Check HMG textbook for description)




Illumina Sequencing by...

3. Sequencing


- detection of single bases as they are incorporated into DNA template strands


- dNTPs are added to the nucleic acid chain by DNA polymerase during DNA synthesis cycles


(Check HMG textbook for description)






Illumina Sequencing by Synthesis (SBS) technology


- proprietary reversible terminator-based method that detects single bases


- four fluorescently labeled nucleotides used to sequences tens of millions of clusters on the flow cell surface in parallel




dNTP incorporation


- a single labeled deoxynucleoside triphosphate (dNTP) is added to the nucleic acid chain during each sequencing cycle


- this is the nucleotide label, and it serves as a terminator for polymerization


- so after each dNTP incorporation, the fluorescent dye is imaged to identify the base and then enzymatically cleaved to allow incorporation of the next nucleotide




Base calling


- base calls made directly from signal intensity measurements during each cycle


- this greatly reduces raw error rates compared to other technologies (i.e. calls are made after?)


- end result is highly accurate base-by-base sequencing




Incorporation bias


- minimized due to the natural competition of the reversible terminator-bound dNTPs


- it is less likely that the wrong dNTP will bind




High accuracy and robust base calling


- eliminates sequence-context specific errors- enables robust base calling across the genome for repetitive regions and within homopolymers too

What's the fourth component in the NGS process? Indicate the steps.

4. Data Analysis
- newly identified sequence reads are aligned to a reference genome
- downstream analysis: SNPs, indel identification, read counting for RNA methods, phylogenetic or metagenomic analysis, and more.

4. Data Analysis


- newly identified sequence reads are aligned to a reference genome (COMPUTATIONALLY)


- downstream analysis: SNPs, indel identification, read counting for RNA methods, phylogenetic or metagenomic analysis, and more.

pooling and sequencing large numbers of libraries simultaneously is referred to as?




What's it good for?

multiplexing




powerful for multi-sample sequencing studies

How do you identify different genomes in multiplexing?

DNA barcode sequences specific to each genome attached to each fragment (on both 5' and 3' ends)

What's the difference between FASTA and FASTQ files?

FORMATS


FASTQ made up of several sequence fragments, each represented by four lines (identifier, raw sequence call, break, and ASCII-encoded quality score for each call). It has more quality info than FASTA.




FASTA will have its first line denoted by ">" containing a description with an identifier code, followed by the entire raw sequence data. No spaces. .fna is for nucleotides and .faa is for amino acids.




USE


FASTQ is the standard for storing high-throughput sequencing data output, so generally NGS short reads.




FASTA generally for assembled sequences (contigs), entire genomes, reference genomes.

Where could you find this (below)? Name what each line (a,b,c,d) represents.




[a] @SEQ_ID[b] GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT


[c] +


[d] !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

In a FASTQ file. This is the format for sequences in the file, containing sequence fragments (in the millions) of read lengths 50-150 bps




a. sequence identifier


b. raw sequence call


c. break


d. ASCII-encoded quality score for each call

any sequence fragment that comes out of a sequencer is called..

read


k-mer

any sequence of length k

contig

gap-less assembled sequence

scaffold

ordering of contigs to approximate larger chromosomal sequence that may contain gaps

% genome coverage

percent of the full genome that is covered by assembled contigs (the human genome has an estimated 95% genome coverage)

average read depth

aka 'coverage' refers to sequencing coverage calculated by taking the sum of the lengths of each sequence read making up the assembly and dividing by the number of positions in the assembly.

de novo assembly

starts from the beginning; has no prior information and no reference genome to work with

reference-guided assembly

uses closely related genome to guide process and aligns reads and contigs to the reference assembly

De Bruijn Graph - concept behind it

to find the Eulerian path (genome sequence or contig) that visits each edge (read - but then a node, the kmer, could be visited more than once?) once only in the graph




- used more for Illumina sequencing

N50

the largest contig length at which longer contigs cover 50% of the total assembly length




- dependent on the genome in question


- cannot be compared between genomes (not a measure of quality between genome assemblies)


- can be a measure for quality of genome assemblers using the same reference, though.

What requirements must be met to be considered a high quality assembly?

- High % genome coverage (90%+)


- High sequencing depth (average read depth of 10X, 50X, ...)


- High N50 (# is dependent on genome being assembled)


- Low % gaps



What is said about the C-value enigma?

Genome sizes in a haploid nucleus, represented by the C-value, do not correlate with organismal complexity. The larger the C-value does not mean that an organism is more complex.

Amoeba dubia has a genome 200X that of the human genome, however the human genome is much more complex. Explain.

Complexity of a genome is not to be measured by genome size, but rather, a measure non-coding elements and repetitive regions.

Name examples of how complexity of a genome could be measured.

Examples: Alternative splicing, Multi-domain complexity (the greater the number of proteins [on the genome - binding sites], the greater the number of interactions), Gene regulation

What is a key observation in the prevalence of non-coding sequences?

Coding sequences account for less than 5% of the HG. Meaning, 95% is non-coding DNA with repeats accounting for 50% of the 95% (hence 47.5% of entire HG are repetitive).

Name the top five non-coding RNA's in the draft human genome sequence.

tRNA, SSU (18S) rRNA, 5.8S rRNA, LSU (28S) rRNA, and 5S rRNA

Enhancers tend to share properties with messenger RNA promoters that are CpG-poor, but produce bidirectional, exosome-sensitive, relatively short unspliced RNAs, which is strongly related to enhancer activity.

tell me

Which genome is the first repeat-rich genome to be sequenced?

Human genome with ~ 45% made up of transposon-derived repeats (SINES + LINES)

GC content tells you?

One of the most informative genome sequence properties. Genomes are often mosaics of high and low GC regions. Correlates with gene-rich regions and CpG islands are found near promoters. GC content should be lowest out of all possible dinucleotides.

In vertebrate genomes, what happens to cytosine when it's followed by guanine?

Cytosine tends to be methylated (by an enzyme), by which makes it (methylated cytosine) susceptible to deamination to uracil.

Why is the GC content in human only 1%, when the amount of G and C individually in the genome should mean that there should be 4%? (0.21*0.21*100%= 4%)

Why is the GC content in human only 1%, when the amount of G and C individually in the genome should mean that there should be 4%? (0.21*0.21*100%= 4%)

Evolutionary phenomena that occurred whereby CG dinucleotides are have a lower occurrence in vertebrate genomes. Due to C methylation, allowing it to be deaminated to U.

CpG correlates with _____ ?

gene density

GC variation is exhibited across ______.

chromosomes; showing high number of gene density per Mb on the chromosomes with increasing number of CpG islands per Mb.

From a study on CpG correlation to gene density, which chromosomes showed greatest CpG density to high gene counts?

Chromosomes 16, 17, 22, and especially 19.

Kinds of transposable repetitive elements and their % proportion in the human genome.

LINEs (21%), SINEs (13%), Retrovirus-like elements (8%), and DNA transposon fossils (3%) = 45% of human genome made up of interspersed repeats.

What does high GC and CpG content correlate with?


Explain the technique of the sliding window (to analyze GC content). 

Explain the technique of the sliding window (to analyze GC content).

In this example, there are three snippets of a 100-Mb region on chromosome 1. Analyzed in non-overlapping 20 000 bp (= 200 kb), 2000 bp (= 2 kb), and 200 bp (= .002 kb) windows. When zoomed into the smallest scale, the 1 Mb region with 200 bp windows shows gaps in the sequence for GC content. Meaning that there is 0% GC content in specific regions. However, in the 100Mb, 200 kb window snippet, the same area shows that there is also high GC content in the area. High GC content is followed by regions of low, to no GC content.

A sequence that is under negative (purifying) selection means

it is a functionally conserved region

A neutrally evolving sequence (constant background mutation rate) is said to be

unconserved and non-functional

How may novel genes arise?

As new genes have no homologs, must aruse due to horizontal transfer or de novo evolution.

Whats a method to compare or predict genes, pathways, and functions?

ALL BY ALL BLAST (blast genomes to determine 1-1 top reciprocal matches. Simple. Top blast hit is often the closest homolog but this isnt always true, hence, its not that accurate. Can reveal gene losses and gains.

Whats a way to visualize differences in gene sets between organisms?

Venn diagrams or phylogenetic trees.

In addition to orthologs, the visualization method for comparing gene sets also allows one to compare _____. ______ analysis is more accurate. Use the _______

Paralogs; homology; Ensembl Compara DB tool

Can reveal enhancer elements and implies functional importance.

comparing genome sequences

How do you measure positive selection in noncoding elements?

look at conserved noncoding mammalian sequences that are human-specific dramatic mutations

How do you measure positive selection in coding elements?

1) Ka/Ks - synonymous to non-synonymous in protein sequences, 2) nature and frequency of allelic diversity within population

Tell me how you can find human-accelerated regions in the human genome.

look at conserved mammalian sequences that are functionally constrained (and hence, conserved) in human genome

Method differentiation for human-specific gene-loss and human specific gene-gain

gene-loss by looking at conserved mammalian sequences that are deleted in human, and gene-gain by comparing CNVs between species with human's CNVs. Gene loss showed enhancer loss near brain-specific tumor repressor and gene gain showed expansions related to brain development in humans.

absolute conservation, >200 bp, long noncoding sequences in distantly related spp., and more conserved than protein coding

ultraconserved elements

long (100s-1000s), 3'-UTRs (evidence for local RNA structure), > one half are exons, gene deserts (maybe distal enhancers)

highly conserved elements

metagenomics goal

characterize spp composition and functional capabilities, identify novel spp and genes, predict microbial interactions

taxonomic classification in metagnomics (binning) can be done via...

similarity-based binning (nearest BLAST hit, LCA of top k blast hits) and compositional binning (look at raw data for GC content and kmer content, for example. useful when there is no reference)

functional annotation in metagenomics is different in that

it's not - you can use the same technique as you do for single genomes, i.e. homo-based annotation with BLAST against protein DB and transfer the GO terms (to find function); you can also blast it to KEGG DB to find pathways. Then scan against conserved domain DB (HMMscan) to determinine conservation


Comparative metagenomics would make use of this genome assembly method...

multiplexing - raw reads can be de-multiplexed to get seprate subsets of reads too.

Two visual clustering methods in metagenomics

Hierarchical clustering and Principle Coordinate Analysis