Study your flashcards anywhere!

Download the official Cram app for free >

  • Shuffle
    Toggle On
    Toggle Off
  • Alphabetize
    Toggle On
    Toggle Off
  • Front First
    Toggle On
    Toggle Off
  • Both Sides
    Toggle On
    Toggle Off
  • Read
    Toggle On
    Toggle Off

How to study your flashcards.

Right/Left arrow keys: Navigate between flashcards.right arrow keyleft arrow key

Up/Down arrow keys: Flip the card between the front and back.down keyup key

H key: Show hint (3rd side).h key

A key: Read text to speech.a key


Play button


Play button




Click to flip

36 Cards in this Set

  • Front
  • Back

Explain data fragmentation and why is it useful in distributed database design? Describe different types of fragments.*

Storing database relations at different sites, need not be complete relations, can be divided into smaller logical units; speeds up data retrieval

Horizontal-subsets of tuples (RESTRICT)

Vertical-subset of attributes (PROJECT)

Explain data replication and why is it useful in distributed database design?*

Storing replicated data at different sites

1. Lose one machine and you don’t lose data

2. Reads can be directed to different machines to balance load to get higher read rates (improves availability)

3. Writes require more network transfers.

Discuss some examples of problems that have to be dealt with in a distributed environment that do not occur in a centralised environment.*

1. Expensive in operational cost (more machines to run, more onsite presence at each location)

2. Security risk (backup complexity based on hierarchical structure, and monitoring)

3. Each site must manage own access and backups (keep data relevant, avoid redundancy)

4, Database design more complex (must determine which sites to have which fragments/replicas)

What are a “training set” and a “test set”, and how can we use them to avoid trusting hallucinations?*

Training Set- used to look for patterns and predictive relationships

Test Set- check whether what we thought was still there by assessing the strength and utility of predictive relationships

Cross-validation- define a data set (called the validation set) to test the model in training phase to limit problems like overfitting; give insight on how model will generalize to an independent data set.

Explain the purpose of recording each of the two types of time that are represented within a bitemporal relation.*

Valid time- ‘real world’ time during which an object exists or when some event takes place; historical

Transaction time- ‘DBMS System’ time at which an event was entered into temporal database; log of how/when data about the object was entered into the database; rollbacks

Both can be represented as having a starting time and ending time.

Construct a small example that demonstrates the use of a bitemporal relation to make an update to a previously recorded tuple that is discovered to have contained an error. Indicate how evidence of the existence of the previously recorded tuple is still maintained, after the tuple is corrected.*

Three suppliers exist: Smith, Jones and Blake. The following table shows changes to the database from Monday to Wednesday. On Monday we add information about all suppliers. On Tuesday Smith creates a new transaction which triggers the end of the f...

Three suppliers exist: Smith, Jones and Blake. The following table shows changes to the database from Monday to Wednesday. On Monday we add information about all suppliers. On Tuesday Smith creates a new transaction which triggers the end of the first transaction. On Wednesday we find out Blake is in New York, not Paris, so we update. We believed Blake was in Paris from 9 am Monday to 2:45 Wednesday, but now we believe at 2:46 Wednesday that Blake is in New York. So as we can see, new changes are given a new row.

What is clustering?*

Dividing a collection of data into groups called clusters in such a way that similar things are in the same cluster

Wht are two reasons for using clustering?*

To make comparisons between clustered items and find patterns in data

Data-driven: uses unsupervised learning (takes the data given and pick out patterns because it does not know the ideal solution)

Briefly describe the k-means algorithm.*

1. Find k vectors in feature space, called cluster centers

2. Assign each data pattern to the nearest cluster center in a way that minimises the sum of the squared distances.

3. Move the cluster centers to the geometric center of the data patterns.

4. Repeat until objective function is smaller than some tolerance, or the data pattern centers do not move.

The k-means algorithm is linear in the amount of data it is given, but may take many iterations to converge. How would you apply it to an extremely large set of cases?*

For large sets cases, we expect a small number of clusters.

We make an approximation: Take a random sample of S things, cluster the S things, and assign everything else to the nearest cluster center.

What is data mining? Compare and contrast it with statistics.*

Looking for interesting patterns in large amounts of data with a computer.

Data mining- finds a function that maps inputs into outputs without assumptions of how the world works; objective to find something new; results are researcher dependent

Statistics- tries to model the world using stochastic processes; once a model is found you can extract more samples; results verified by software

Both turn data into information and learn from the data

What are the possible problems of data mining?**

Failing to notice a pattern:

First graph of American football is created by plotting offensive plays over dates. The only distinction we could have found was that OffensivePlays is a discrete variable with integer values and that offensive plays happen about once a minute; second graph produces rough symmetry that could not be found in the first graph, as well as a peak of offensive plays.

Seeing a pattern that isn't there:

The curve turns over in about 1880s in population graph: could be due to the introduction of reliable contraception, but doesn’t explain why number of live births still high in the 2000s; may have been due to the switch from counting white births to counting everyone's births; the Maori population was substantial and not growing fast

Seeing a pattern that is misleading:

The number of live births is the product of the birth rate and the number of women. Since about 1885 New Zealand's birth rate has been steadily falling (except in baby boom). So there was a population increase, largely from immigration, so more live births occurred, but there wasn't an increase in the birth rate.

What is a distributed key-value store?*

A key-value store with information stored on many machines.

It is possible that no machine has all of the data, or data is held by more than one machine (replication).

What are the basic operations provided by a key-value store?*

PUT (update), GET (read), DELETE (delete)

What is “eventual consistency”?*

Storage system guarantees that if we stop changing a key/object, eventually everyone will see the same value

Domain name system- updates distributed and eventually all clients will see the update

What does the CAP theorem say about consistency, availability, and partitionability?*

Consistency- everyone sees the same data at the same time

Availability- as long as one copy of the information is available we can keep running the database; if some nodes fail the others can keep working (read/write)

Partionability- the system can survive part of the network being disconnected for a time.

We can pick any two, but it’s not possible to have all three.

Give an example to show why we may want "availability" and "partitionability".*

Twitter vs. Paypal: all good if message is lost or sent twice through Twitter, but if money is lost it is more problematic (O'Keefe hotel story).

It is possible to use a conventional relational database system to create relations that include attributes that record temporal properties. How will the resulting system differ from a real temporal database system?*

Create column to record temporal properties; however it may be possible to rewrite/delete the attribute column/data rows.

Can be difficult to make constraints where:

1. start values cannot change

2. end values can only change once (from now/until change to a proper time value)

3. only end values can exist if there is a start value.

A process will need to be created to automatically update the attribute with such constraints.

Explain how an object-relational database differs from a conventional relational database management system, including some examples of their use.

A hybrid concept that retains the relational structure but allows a domain to hold objects as values; cell values are no longer atomic and relation is in N1NF.

Employee table that consists of IDNum, Person, and Salary. Person contains the object type 'person_id_type' that consists of ‘first name’, ‘last name’ and ‘date of birth’; need to identify idnum, all attributes in person as person_id_type, and salary

INSERT INTO emp_obj_table(1, person_id_type(‘Amy’,’Heinrich’,TO_DATE(‘14/05/1992’, ‘dd-mm-yyy’)), 55555);

Briefly explain what “semi-structured data” is and why we want to support it. Why might we not?*

Semi-structured data- some of the information is in the structure and some is in the text

Support- allows one to self-describe a data structure

Not support- semi-structured data needs deep understanding of XML interpretation; tags are meaningless symbols to a person unfamiliar with the domain

Explain text nodes and elements what these things correspond to in the abstract data structure.*

Text nodes- represents sequences of plain characters, characters given by number, characters given by name, or quoted strings.

Elements- can have attributes of its own or contain other elements; has a name, an unordered collection of attribute=value pairs, and an ordered sequence of children


Year is element, ‘2005’ is text node

Explain attributes and what these things correspond to in the abstract data structure.*

Provide additional information about elements

Explain ID and IDREF, and what these things correspond to in the abstract data structure.*

ID- value is an unique id; employee_id

IDREF- value is the id of another element; manager_id is idref

Example: employees with Amy being employee with manager Stephen who is also an employee

What kinds of integrity constraints might we want to have in an XML database?*

Document Type Definition: specify a regular expression for contents of each element and each attribute: ID/IDREF/PCDATA, and #REQUIRED/#IMPLIED/default value.

XSchemas: XML documents that use XML syntax and can be generated/processed by XML toolkits; use programming languages that can implement data structures like Java; specify more detail ("this string is a date"); specify foreign key constraints between the documents in a collection of documents.

Briefly describe how you could represent an unchanging document in a relational database so that you could express “descendant” and “following in document” in SQL without recursion. Ignore attributes.*

Descendents- give each element two numbers which are the appropriate values to a pre-order traversal; the smallest number in its subtree (its own number) and the largest number in its subtree.

What is integrity and why do we want it?*

Integrity is the accuracy and consistency of data stored in a database.

"If a datatype is integer, it should not be referenced as a string"

We want integrity to prevent inappropriate use of the database (adding incorrect data, corrupting data).

What is a chronon? What is meant by 'an event which happens at an instant will exist for one chronon.'*

An event that occurs quickly in real time will be rounded to the smallest unit of time in the application, aka the chronon.

Depending on the granularity, it can be easy/difficult to distinguish the valid time taken between two events that are one chronon apart; two events might land in the same chronon, but not literally at the same time (cooked breakfast and dinner on Tuesday, but there's 10 hours difference between the two events).

Explain the term post-relational data models and write a brief comment on reasons why there is now a great interest in them.*

Post-relational data models extend the relational models with non-relational features such as scalable data stores and Web-based data stores.

1. Often de-normalized, schema-free, and notion document storage

2. Horizontal scaling for key-value based storage (apply hash function to key to determine which server to connect to); easy to spread data

3. Replication is already enabled

4. Easy to program APIs (create own semantics using key-value stores: request to get data, request to put stuff into the data)

Describe the main features of cloud computing, including some examples of their use.*

Storing data offsite by shifting computing resources from physical to virtual, usually third party or outside locations; allows for scalability of infrastructure as the business grows; easy access from anywhere and cheaper, but gives up security, privacy, and data ownership

SaaS: (subscribe to established software service- GoogleDocs)

IaaS: (provide framework for building applications- Google App Engine)

PaaS: (highly flexible and configurable- Amazon Elastic Compute Cloud)

After Christchurch earthquake many companies lost offices and systems and have to rebuild infrastructure again

Why did we need XQuery for XML databases, if we already had XPath? Hint: in SQL’s (SELECT ... FROM ... WHERE ... ORDER BY ...) which parts can’t be done in XPath?

XQuery is an elaborate extension of XPath inspired by SQL SELECT statements.

XPath can only extract existing fragments of documents, but XQuery can build new XML (point to/extract/reorder/build parts of documents into new documents).

Cannot do ORDER BY in XPath

Define ‘sharding’ and explain it's relevance to replication.*

Sharding: partition the key space so that a record goes to machine determined by its key; reads for different data can go to different machines; writes are simpler.

Their relevance is defined by your problem: high write rates or high read rates, more concerned about throughput or availability?

Explain 'granularity' in a temporal database. Why is the choice of granularity important and why might different applications need different granularities? Give some examples.*

Granularity- the smallest unit of time within the application that cannot be further divided

'What day were you born’=birth dates are recorded to granularity of days.

'What time is your COSC430 lecture’=lecture times recorded to granularity of hours.

Briefly explain the term distributed database. What are the potential advantages of a distributed database?

Distributed database- collection of multiple, logically interrelated databases distributed over a computer network.

1. To the user a distributed system looks exactly like a non-distributed system due to data independence.

2. There is no reliance on a central site, so if a site goes down all of the data is not lost.

3. Improved performance because data is located near the site of greatest demand

4. Database systems are parallelized, allowing load on the databases to be balanced among servers.

Explain how row-oriented and column-oriented database are similar and how they are different. Consider both logical and physical structure. What differences, if any, would you expect between the query languages for row-oriented and column-oriented databases?*

Row: best for many writes; good when you want to see many columns in few rows

Column: best for many reads; good when you want to see many rows in few columns

Tables stored as separate columns that may split by value and stored in separate shards that allow compression (high transfer efficiency)

Give an example of an application that would be processed more efficiently by a column-oriented database than a row-oriented one and say why.*

Data mining.

Given an example of an application that would be better served by a row-oriented database and why.

Online transaction processing - need to see few rows with many columns; find and define specific orders with rows, need all columns.