Use LEFT and RIGHT arrow keys to navigate between flashcards;
Use UP and DOWN arrow keys to flip the card;
H to show hint;
A reads text to speech;
134 Cards in this Set
- Front
- Back
Distributed filesystems |
Filesystems that store files on multiple machines |
|
HDFS |
Hadoop's main distributed filesystem, designed for storing very large files, with streaming access data patterns, running on clusters of commodity hardware |
|
Areas where HDFS is not a good fit |
Low-latency data access, lots of small files, multiple writers & arbitrary file modifications |
|
HDFS default block size |
64 MB |
|
Unit of abstraction for HDFS |
Blocks, not the files themselves |
|
namenode (master node) |
Manages the filesystem namespace, maintains the filesystem tree and metadata for all files in the tree, knows which datanodes a block is stored on. This info is stored persistently on local disk in both the namespace image and edit log. |
|
datanode (worker/slave node) |
store and retrieve blocks of data when they are told by the namenode or client, report back to namenode periodically with the blocks they are storing |
|
Secondary namenode |
a namenode stored on a separate physical machine, which keeps a copy of the namespace image, and is used in event of first namenode failure |
|
HDFS Federation |
A multiple namenode system where each namenode manages: a namespace volume and a block pool |
|
namespace volume |
a subset of all files in a filesystem, they are independent of one another |
|
block pool |
contain all the blocks for the files in a namespace. Storage in a block pool is not partitioned, a file will register with all namenodes in the cluster and store on multiple block pools |
|
HDFS High Availability |
A pair of namenodes in standby mode, ready to act as the namenode in case of failure |
|
Failover controller |
manages the transition from an active namenode to a standby, typically uses Zookeeper |
|
Fencing |
the process of ensuring that a previously active namenode that was seen as failed, but had not failed, cannot do damage to the system...done by the High-Availability system |
|
Methods of fencing |
killing the namenode's process, revoking access, STONITH |
|
Hadoop fs |
the command line prefix to most Hadoop commands (stands for Hadoop filesystem) |
|
Path object |
represents a file in a Hadoop filesystem |
|
Configuration object |
stores the client or server's configuration, which is set using configuration files read from the classpath |
|
FileStatus class |
allows you to view metadata about files |
|
globbing |
using wildcards to match multiple files with one expression, often used with globStatus function |
|
Steps of a File Read |
1. Client opens file calling open() on the Filesystem object, an instance of DistributedFileSystem
2. DistributedFileSystem calls the namenode using RPC to get locations of the first few blocks in the file 3. DFSInputStream connects to the closest datanode for the first block in the file 4. Data is streamed from datanode back to client, which calls read() repeatedly on the stream 5. This continues to happen without interruption to client 6. When client has finished reading, calls close() on the FSDataInputStream |
|
Steps of a File Write |
1. Client calls create() on DistributedFileSystem |
|
data queue |
An internal queue populated by client writing data, when the DFSOutputStream splits data into packets |
|
ack queue |
an internal queue of packets waiting to be acknowledged by datanodes by DFSOutputStream |
|
Hadoop's default replica strategy |
Three replicas: 1) same node as client, 2) a different rack chosen at random, 3) the same rack on a different node |
|
Coherency model for a filesystem |
describes the data visibility of reads and writes for a file |
|
Distcp |
Copies large amounts of data to and from Hadoop filesystems in parallel. Implemented as a MapReduce job |
|
Hadoop Archive (HAR) files |
pack files into HDFS blocks more efficiently, reducing namenode memory usage while still allowing access to files |
|
How does Hadoop detect errors in data read/write? |
Using a checksum, datanodes are responsible for verification. The last datanode in the pipeline will verify the checksum |
|
Types of compression compatible with Hadoop |
DEFLATE, gzip, bzip2, LZO, Snappy |
|
Hadoop's codec implementation |
the CompressionCodec interface |
|
Serialization |
the process of turning structured objects into a byte stream for transmission over a network or for writing to persistent storage |
|
Deserialization |
byte stream -> series of structured objects |
|
Remote Procedure Calls (RPC) |
* Handles interprocess communication between nodes in the system
* Uses serialization to render message into binary stream to be sent to remote node |
|
Hadoop's serialization format |
Writables, which have two methods: write, and readFields |
|
Hadoop equivalents to Long, Double, Int, String |
LongWritable, DoubleWritable, IntWritable, Text |
|
Apache Avro |
an IDL-based serialization framework, usually written in JSON |
|
Interface description language (IDL) |
a definition of types in a language-neutral, declarative fashion |
|
Sequence Files |
Provides a persistent data structure for binary key-value pairs. Temporary map outputs are stored as Sequence Files. |
|
Sync Points |
used with Sequence Files, a point in the stream that can be used to resynchronize with a record boundary if the reader is lost |
|
Format of a Sequence File |
1) First 3 bytes are the bytes SEQ, a magic # 2) 1 byte representing the version number 3) remainder of the header, which contains fields like key/value classes, compression details, sync marker 4) the records (which could be compressed) |
|
Map File |
a sorted Sequence File with an index to permit lookups by key |
|
the Configuration class |
a collection of configuration properties and their values, configurations read their properties from resources-XML files |
|
Example of a resources- XML file |
<?xml version="1.0"?> color yellow Color
size-weight ${size},${weight} Size and weight
|
|
Unit Testing class |
Mockito's mock() method, passing the class of the type we want to mock |
|
Optimal # of reducers |
Approx. slightly less than the # of reduce slots in the cluster |
|
HPROF |
a profiling tool that comes with the JDK that gives information about a program's CPU and heap usage |
|
What are a few things to check when tuning a MapReduce job? |
# mappers, # reducers, combiners, intermediate compression, custom serialization, shuffle tweaks |
|
JobControl class |
represents a graph of MR jobs to be run, can add the job configurations and tell the instance of JobControl the dependencies between jobs |
|
Apache Oozie |
designed to manage the executions of thousands of dependent workflows, that could be made up of thousands of actions at the level of an independent MR job; composed of a workflow engine and coordinator engine |
|
workflow engine |
part of Oozie, stores and runs workflows composed of Hadoop jobs |
|
coordinator engine |
part of Oozie, runs workflow jobs based on predefined schedules and data availability |
|
workflow |
a group of action nodes and control-flow nodes |
|
action nodes |
performs a workflow task |
|
control-flow nodes |
governs the workflow execution between actions by allowing conditional logic, parallel execution, and other constructs |
|
jobtracker |
part of MapReduce 1, coordinates the job run, its main class is JobTracker |
|
tasktrackers |
part of MapReduce 1, run the tasks that the job has been split into, its main class is TaskTracker |
|
YARN |
Yet Another Resource Negotiator, aka MapReduce 2 |
|
resource manager |
part of YARN, manages the use of resources across the cluster |
|
application master |
manages the lifecycle of applications running on the cluster |
|
node managers |
launch and monitor the computer containers on machines in the cluster |
|
3 failure modes to consider in Classic MapReduce |
failure of the running task, the tasktracker, and the jobtracker |
|
How does a tasktracker get blacklisted by a jobtracker? What happens? |
If 4 or more tasks from the same job fail, the jobtracker records this as a fault. IF the tasktracker goes over a certain # of faults, the tasktracker will not be assigned tasks, but will continue to communicate with the jobtracker...these faults expire over time, and a tasktracker can make it off the blacklist |
|
What can be done to mitigate a jobtracker failure? |
There is no mechanism for dealing with this type of failure, the job simply fails and must be restarted |
|
4 failure modes to consider in YARN |
failure of the task, the application master, the node manager, and the resource manager |
|
What occurs during an application master failure? |
The resource manager will detect failure and start a new instance of the master running in a new container...can recover the state of tasks being run |
|
What occurs during a node manager failure? |
it will be removed from the resource manager's pool of available nodes, can also be blacklisted |
|
What occurs during a resource manager failure? |
The most serious YARN failure, neither jobs nor task containers can be launched. The resource manager uses checkpoints to save its state. A new resource manager instance must be brought up, and can recover from the saved state |
|
Fair Scheduler |
a type of Hadoop scheduler, it aims to give every user a fair share of the cluster capacity over time |
|
Capacity Scheduler |
a type of Hadoop scheduler, splits the cluster into queues, which may be hierarchical, and each queue has an allocated capacity. Within each queue jobs are scheduled using FIFO |
|
Shuffle |
the process by which the system sorts map output keys and transfers the map outputs to reducers as inputs |
|
copy phase |
When the reduce task begins to copy map outputs as soon as the outputs are available. Doesn't wait until all map tasks are complete. |
|
sort phase |
part of the reduce task, map outputs are combined to feed the reduce tasks. done once all map outputs have been copied. Also known as the merge. |
|
speculative execution |
when Hadoop tries to detect when a task is running slower than expected, it will launch another equivalent task as a backup. Only launched if all tasks for a job have been launched |
|
Output committer |
a commit protocol used by Hadoop MapReduce to ensure jobs and tasks either succeed or fail cleanly, implemented by the Java class OutputCommitter |
|
Hadoop skipping mode |
Forces tasks to report the records being processed to the tasktracker, when the task fails the tasktracker will retry the task then skip the record(s) that cause failure |
|
General form of MapReduce |
Map: (K1, V1) -> (K2, V2) (if exists) Combiner: (K2, list(V2)) -> list(K2, V2) Reduce: (K2, list(V2)) -> list(K3, V3)
**Note: Often, K2=K3 and V2=V3, but not always |
|
What is the default input format for MapReduce? |
TextInputFormat, which produces keys of type LongWritable and values of type Text |
|
What is the default output format for MapReduce? |
TextOutputFormat, which writes out records one per line, by converting keys/values to strings and tab-separating them |
|
input split |
chunk of input that is processed by a single map, divided into records, which are key-value pairs |
|
which Java class is useful for processing small files? |
CombineFileInputFormat, which packs many files into each split, so that each mapper has more to process; and takes node/rack locality into consideration when packing together |
|
Which method can be overridden to prevent file splitting? |
isSplitable() on FileInputFormat |
|
NLineInputFormat |
the Java class that allows for a fixed number of lines, where the key/value is still the same as TextInputFormat (offset, lines). Controls the # of lines of input that each mapper receives |
|
What are the types of built-in counters in Hadoop? |
Task counters, Job counters, User-defined Java counters, Dynamic counters |
|
how can a counter be defined in Java? |
using the Java enum |
|
Total Sort |
sorting each file so that they can be concatenated to form a globally sorted file |
|
secondary sort |
Structuring key-value pairs to allow for sorting by values as well as the keys (keys only are the default) |
|
map-side join |
a join performed by the mapper. The data is joined before data reaches the map function, and requires the inputs to each map to be sorted by the join key and partitioned by that key as well, can be run through the CompositeInputFormat class |
|
reduce-side join |
a join performed by the reducer. More general than map-side joins, don't require the input data set to be structured a certain way. Less efficient since both datasets must go through the Shuffle. It DOES require that the data is sorted by source, one source before the other |
|
side data |
extra read-only data needed by a job to process the main dataset |
|
Pig |
a client-side application comprised of the language (Pig Latin) and the execution environment to run Pig Latin programs. It raises the level of abstraction for processing large datasets |
|
Grunt |
Pig's interactive shell, started when no file is specified on the command line for Pig to run, and the -e option is not used |
|
Three ways to run Pig programs |
Scripts (pig script.pig), Grunt, and Embedded in Java using the PigServer class |
|
bag |
in Pig, an unordered collection of tuples, represented in Pig Latin by curly braces |
|
Differences between SQL and Pig |
1) Pig Latin is a data flow programming language, while SQL is a declarative programming language
2) Pig Latin is a step-by-step set of single transformation operations on an input relation, SQL is a set of constraints to find a result set
3) RDBMS- set of predefined schemas, Pig is more relaxed about the data it processes
4) Pig supports complex, nested data structures, SQL is more flat
5) Pig doesn't support random reads/writes |
|
Pig Latin statement |
an operation or command in Pig Latin, which must end with a semicolon (ex: GROUP) |
|
Pig Latin commands |
not added to the Pig logical plan, they are executed immediately. Useful for moving data around before/after Pig processing |
|
Pig Latin expressions |
Something evaluated to yield a value, can be used as part of a statement containing a relational operator (ex: $n- field in position n) |
|
What are the four numeric types in Pig? |
int, long, float, double |
|
Which Pig types are analogous to Java string? |
bytearray, chararray |
|
What are the three complex Pig types? |
tuple, bag, map |
|
Read the schema from the following relation: {(year: chararray), (amount: int), (name)} |
Year variable has a type of chararray, the amount variable is an int, and the name variable is a bytearray (the default, if no type is specified) |
|
When a value being inserted cannot be cast to the type declared in a schema, what occurs in Pig? |
Pig inserts null in its place, instead of erroring...because of the volume of data usually being processed, it is impractical to expect to be able to correct every incorrect record |
|
Macro in Pig |
Similar to a Java method, allows you to package reusable pieces of Pig Latin, done using the DEFINE keyword |
|
User Defined Function (UDF) in Pig |
a function written in Java (or Javascript, Python, other scripting languages) that house logic used in Pig scripts |
|
dynamic invoker |
allows you to use a function that is provided by a Java library without using a UDF. Is less efficient, since it uses reflection |
|
Pig's STREAM operator |
allows you to transform data in a relation using an external program or script |
|
Pig's COGROUP operator |
Similar to a JOIN, but it creates a nested set of output tuples, nice for exploiting the structure in subsequent statements. JOIN creates a flat structure, just a set of tuples |
|
Hive |
Uses the HiveQL language, similar to SQL, to convert queries into MapReduce jobs for execution on a Hadoop cluster |
|
the Hive metastore |
a database created by Hive to store metadata; it is normally stored in a central location and is shared across the cluster |
|
schema on write |
used in traditional RDBMS, schema is enforced at data load time; if data doesn't conform, it is rejected. This makes query performance much faster, but takes longer to load the data |
|
schema on read |
used in Hive, data isn't verified when it is loaded, rather when a query is issued. Makes for a very fast initial load, allows for multiple schemas in the same database, but query performance is slower |
|
How does Hive manage tables and the data in them? |
When a table is created using CREATE TABLE, and data is loaded using LOAD, the data is physically moved from its location into Hive, so when the table is dropped using DROP TABLE, the data is deleted and no longer exists (unless it was previously backed up) |
|
Hive external table |
specified using the EXTERNAL keyword, (e.g. CREATE EXTERNAL TABLE), tells Hive it is not managing the data, which avoids the data deletion situation |
|
HBase |
a distributed, column-oriented database built on top of HDFS. Used for real-time read/write random-access to very large datasets |
|
Zookeeper |
Hadoop's distributed coordination service that builds general distributed applications |
|
Sqoop |
an open-source tool that allows users to extract data from a relational database into Hadoop for further processing |
|
Partitions |
In Hive, specified at table creation and subsequently at data load, allows you to store the data physically in different places based on a partitioning column, makes for more efficient queries (similar to SQL partitions) |
|
Buckets |
In Hive, imposes extra structure to a table, specified on table creation. Useful for sampling, which can be done with the TABLESAMPLE clause. Also stored physically within a table/partition directory. |
|
Two formats that govern storage in Hive |
Row format (dictates how rows and their fields are stored, defined by a SerDe, a Serializer-Deserializer) and file format (dictates the container format for fields in a row, such as text file, row-oriented binary, etc.) |
|
What is the default storage format in Hive? |
CTRL-A delimited text, with a row per line |
|
RCFiles |
Record Columnar File, similar to SequenceFiles except they store all values in a column as a record, can be specified in a CREATE TABLE clause by using STORED AS RCFILE. These work well when you only need access to a few columns in a table |
|
What is the difference between ORDER BY and SORT BY in Hive? |
ORDER BY produces a globally sorted file, but only uses one reducer, which is highly inefficient. SORT BY produces a sorted file per reducer, but is not globally sorted |
|
Types of UDFs in Hive |
Regular UDFs- operate on a single row, produce a single row
UDAFs- operate on multiple rows, and produce a single row
UDTFs- operate on a single row, produce multiple rows (a table) as output |
|
What 2 properties must a regular UDF satisfy? |
Must be a subclass of org.apache.hadoop.hive.ql.exec.UDF, and must implement at least one evaluate() method |
|
splitting column |
in Sqoop, the column that is identified as having a close-to uniform distribution, and is used to divide data coming from a relational db to a MapReduce job for import into Hadoop |
|
LobFile |
Sqoop's way of storing imported large objects |
|
the DBWritable interface |
used in Sqoop, a Java class that provides read & write method for serialization, which allows the class to interact with JDBC |
|
Sqoop's direct mode |
the quickest, most efficient way to import data using Sqoop. Even when it is used, the metadata is queried through JDBC |
|
--hive-import |
When specified in Sqoop, this option tells Sqoop to create a Hive table with the data being moved. If the table already exists, can use --hive-overwrite |
|
--hive-table |
When specified in Sqoop, names the output Hive table |
|
mysqldump |
a Sqoop tool that allows for direct access to MySQL databases, faster than the default JDBC connection. This is enabled by using the --direct option |
|
PigStorage |
a Java class that constructs a Pig loader to load delimited text files |
|
Three types of SequenceFiles |
1. Uncompressed key/value records 2. Record compressed key/value records- only values compressed 3. Block compressed- both keys/values are collected in blocks separately and compressed |