Set the Language

We weren't able to detect the audio language on your flashcards. Please select the correct language below.

Front

Back

Flashcards
»
Bioinfo Algorithms Midterm

Bioinfo Algorithms Midterm

by antass, Mar. 2011

Subjects: bioinformatics

Favorite

Add to folder

Flag

Shuffle
Toggle On

Toggle Off
Alphabetize
Toggle On

Toggle Off
Front First
Toggle On

Toggle Off
Both Sides
Toggle On

Toggle Off
Read
Toggle On

Toggle Off

Reading...

Front

Card Range To Study

through

Play button

Progress

1/18

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

18 Cards in this Set

Front
Back

	Levenstein distance	a metric for measuring the amount of difference between two sequences (i.e. an edit distance)
	example of Levenstein distance	For example, the Levenshtein distance between "kitten" and "sitting" is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits: kitten → sitten (substitution of 'k' with 's') sitten → sittin (substitution of 'e' with 'i') sittin → sitting (insert 'g' at the end).
	Limitations of Hamming distance	Hamming distance only allows substitution (and hence, only applies to strings of the same length).
	Hamming distance	Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. Put another way, it measures the minimum number of substitutions required to change one string into the other, or the number of errors that transformed one string into the other.
	What are the time and space requirements for Needlman - Wunsch algorithm to find a global alignment of two strings?	len(str10 = n len(str2) = m both space and time = O(n x m) To analyze the time complexity of the Needleman-Wunsch algorithm, we can essentially analyze each individual part of the algorithm. To initialize the matrix, we need to input the scores of the row 0 and column 0 with –jd and –id respectively (in our example, these are both 0). This has a time complexity of O(M+N). The next step is filling in the matrix with all the scores, F(i,j). For each cell of the matrix, three neighboring cells must be compared, which is a constant time operation. Thus, to fill the entire matrix, the time complexity is the number of entries, or O(MN). Finally the traceback requires a number of steps. The first step is marking the cells according to the rules above. We can move a maximum of N rows and M columns, and thus the complexity of this is O(M+N). The second step is finding the final path which involves jumping from cells of matching residues. Since this step can include a maximum of N cells (where we assume N>M), this step is O(N). Thus, the overall time complexity of this algorithm is O(M+N)+O(MN)+O(M+N)+O(N)=O(MN) Since this algorithm fills a single matrix of size MN and stores at most N positions for the traceback,the total space complexity of this algorithm is (MN)+O(N)=O(MN).
	Algorithm	an effective method expressed as a finite list[1] of well-defined instructions[2] for calculating a function[3]. Algorithms are used for calculation, data processing, and automated reasoning. Starting from an initial state and initial input (perhaps null)[4], the instructions describe a computation that, when executed, will proceed through a finite [5] number of well-defined successive states, eventually producing "output"[6] and terminating at a final ending state. The transition from one state to the next is not necessarily deterministic; some algorithms, known as randomized algorithms, incorporate random input[7].
	Sequence Motif	nucleotide or amino-acid sequence pattern that is widespread and has, or is conjectured to have, a biological significance. For proteins, a sequence motif is distinguished from a structural motif, a motif formed by the three dimensional arrangement of amino acids, which may not be adjacent. When a sequence motif appears in the exon of a gene, it may encode the "structural motif" of a protein; that is a stereotypical element of the overall structure of the protein. Nevertheless, motifs need not be associated with a distinctive secondary structure. "Noncoding" sequences are not translated into proteins, and nucleic acids with such motifs need not deviate from the typical shape (e.g. the "B-form" DNA double helix). Outside of gene exons, there exist regulatory sequence motifs and motifs within the "junk," such as satellite DNA. Some of these are believed to affect the shape of nucleic acids (see for example RNA self-splicing), but this is only sometimes the case. For example, many DNA binding proteins that have affinities for specific motifs only bind DNA in its double-helical form. They are able to recognize motifs through contact with the double helix's major or minor groove. Short coding motifs, which appear to lack secondary structure, include those that label proteins for delivery to particular parts of a cell, or mark them for phosphorylation. Within a sequence or database of sequences, researchers search and find motifs using computer-based techniques of sequence analysis, such as BLAST. Such techniques belong to the discipline of bioinformatics. In molecular biology and bioinformatics, consensus sequence refers to the most common nucleotide or amino acid at a particular position after multiple sequences are aligned. A consensus sequence is a way of representing the results of a multiple sequence alignment, where related sequences are compared to each other, and similar functional sequence motifs are found. The consensus sequence shows which residues are most abundant in the alignment at each position. Developing software for pattern recognition is a major topic in genetics, molecular biology, and bioinformatics. Specific sequence motifs can function as regulatory sequences controlling biosynthesis, or as signal sequences that direct a molecule to a specific site within the cell or regulate its maturation. Since the regulatory function of these sequences is important, they are thought to be conserved across long periods of evolution. In some cases, evolutionary relatedness can be estimated by the amount of conservation of these sites.
	Sequence Motif (Lecture)	A microarray experiment showed that when gene X is knocked out, 20 other genes are not expressed How can one gene have such drastic effects? Gene X encodes regulatory protein, a.k.a. a transcription factor (TF) The 20 unexpressed genes rely on gene X’s TF to induce transcription A single TF may regulate multiple genes Every gene contains a regulatory region (RR) typically stretching 100-1000 bp upstream of the transcriptional start site Located within the RR are the Transcription Factor Binding Sites (TFBS), also known as motifs, specific for a given transcription factor TFs influence gene expression by binding to a specific location in the respective gene’s regulatory region - TFBS Genes are turned on or off by regulatory proteins These proteins bind to upstream regulatory regions of genes to either attract or block an RNA polymerase Regulatory protein (TF) binds to a short DNA sequence called a motif (TFBS) So finding the same motif in multiple genes’ regulatory regions suggests a regulatory relationship amongst those genes
	Dynamic programming	a method for solving complex problems by breaking them down into simpler subproblems. It is applicable to problems exhibiting the properties of overlapping subproblems which are only slightly smaller[1] and optimal substructure (described below). When applicable, the method takes far less time than naïve methods. The key idea behind dynamic programming is quite simple. In general, to solve a given problem, we need to solve different parts of the problem (subproblems), then combine the solutions of the subproblems to reach an overall solution. Often, many of these subproblems are really the same. The dynamic programming approach seeks to solve each subproblem only once, thus reducing the number of computations. This is especially useful when the number of repeating subproblems is exponentially large. Top-down dynamic programming simply means storing the results of certain calculations, which are later used again since the completed calculation is a sub-problem of a larger calculation. Bottom-up dynamic programming involves formulating a complex calculation as a recursive series of simpler calculations.
	Markov chain	HiddenMarkovModels are a popular machine learning approach in bioinformatics. Machine learning algorithms are presented with training data, which are used to derive important insights about the (often hidden) parameters. Once an algorithm has been suitably trained, it can apply these insights to the analysis of a test sample. As the amount of training data increases, the accuracy of the machine learning algorithm typically increases as well. The parameters that are learned during training represent knowledge; application of the algorithm with those parameters to new data (not used in the training phase) represents the algorithm’s use of that knowledge. The Hidden Markov Model (HMM) approach, considered in this chapter, learns some unknown probabilistic parameters from training samples and uses these parameters in the framework of dynamic programming (and other algorithmic techniques) to find the best explanation for the experimental data.
	Markov chain	a mathematical system that transits from one state to another (out of a finite or countable number of possible states) in a chainlike manner. It is a random process endowed with the Markov property: that the next state depends only on the current state and not on the past. A simple example is shown in the figure on the right, using a directed graph to picture the state transitions. The states represent whether the economy is in a bull market, a bear market, or a recession, during a given week. According to the figure, a bull week is followed by another bull week 90% of the time, a bear market 7.5% of the time, and a recession the other 2.5%. From this figure it is possible to calculate, for example, the long-term fraction of time during which the economy is in a recession, or on average how long it will take to go from a recession to a bull market.
	Heuristic	refers to experience-based techniques for problem solving, learning, and discovery. Heuristic methods are used to speed up the process of finding a good enough solution, where an exhaustive search is impractical. Examples of this method include using a "rule of thumb", an educated guess, an intuitive judgment, or common sense. In more precise terms, heuristics are strategies using readily accessible, though loosely applicable, information to control problem solving in human beings and machines.[1
	Translation	After transcription, happens splicing, then: translation is the third stage of protein biosynthesis (part of the overall process of gene expression). In translation, messenger RNA (mRNA) produced by transcription is decoded by the ribosome to produce a specific amino acid chain, or polypeptide, that will later fold into an active protein. ACG => amino acid
	Transcription	the process of creating a complementary RNA copy of a sequence of DNA.[1] Both RNA and DNA are nucleic acids, which use base pairs of nucleotides as a complementary language that can be converted back and forth from DNA to RNA by the action of the correct enzymes. During transcription, a DNA sequence is read by RNA polymerase, which produces a complementary, antiparallel RNA strand. As opposed to DNA replication, transcription results in an RNA complement that includes uracil (U) in all instances where thymine (T) would have occurred in a DNA complement. ATG => AUG
	Difference between prokaryotes and eukaryotes	two types of cells: those that encapsulate their DNA in a nucleus and those that do not. The former are referred to as eukaryotic cells and the latter are prokaryotic cells. All multicellular organisms (like flies or humans) are eukaryotic, while most unicellular organisms (like bacteria) are prokaryotic. For our purposes, the major difference between prokaryotes and eukaryotes is that prokaryotic genes are continuous strings, while they are broken into pieces (called exons) in eukaryotes. Human genes may be broken into as many as 50 exons, separated by seemingly meaningless pieces called introns
	intron / exon	prokaryotic genes are continuous strings, while they are broken into pieces (called exons) in eukaryotes. Human genes may be broken into as many as 50 exons, separated by seemingly meaningless pieces called introns
	Greedy algorithm	Many algorithms make a sequence of choices Sorting – which pair should we exchange? Traveling Salesman – which city should we visit next? Greedy algorithms make a locally optimal choice in hopes that it will make a global optimal choice. That is, they look ahead one move. Sometimes a greedy algorithm is optimal. Minimal Spanning Tree Rarely optimal on complex problems
	Two criteria for a problem to be solved using dynamic programming	To get an efficient solution, requires two things Optimal substructure: solution can be obtained by combining optimal solutions to subproblems Overlapping Subproblems: There aren't too many subproblems the following key property that allows them to be solved in a single pass: (*) There is an ordering on the subproblems, and a relation that shows how to solve a subproblem given the answers to smaller subproblems, that is, subproblems that appear earlier in the ordering.

Share This Flashcard Set

Set the Language

Bioinfo Algorithms Midterm

Add to Folders

Upgrade to Cram Premium

Card Range To Study

18 Cards in this Set