• Shuffle
    Toggle On
    Toggle Off
  • Alphabetize
    Toggle On
    Toggle Off
  • Front First
    Toggle On
    Toggle Off
  • Both Sides
    Toggle On
    Toggle Off
  • Read
    Toggle On
    Toggle Off
Reading...
Front

Card Range To Study

through

image

Play button

image

Play button

image

Progress

1/134

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

134 Cards in this Set

  • Front
  • Back

Distributed filesystems

Filesystems that store files on multiple machines

HDFS

Hadoop's main distributed filesystem, designed for storing very large files, with streaming access data patterns, running on clusters of commodity hardware

Areas where HDFS is not a good fit

Low-latency data access, lots of small files, multiple writers & arbitrary file modifications

HDFS default block size

64 MB

Unit of abstraction for HDFS

Blocks, not the files themselves

namenode (master node)

Manages the filesystem namespace, maintains the filesystem tree and metadata for all files in the tree, knows which datanodes a block is stored on. This info is stored persistently on local disk in both the namespace image and edit log.

datanode (worker/slave node)

store and retrieve blocks of data when they are told by the namenode or client, report back to namenode periodically with the blocks they are storing

Secondary namenode

a namenode stored on a separate physical machine, which keeps a copy of the namespace image, and is used in event of first namenode failure

HDFS Federation

A multiple namenode system where each namenode manages: a namespace volume and a block pool

namespace volume

a subset of all files in a filesystem, they are independent of one another

block pool

contain all the blocks for the files in a namespace. Storage in a block pool is not partitioned, a file will register with all namenodes in the cluster and store on multiple block pools

HDFS High Availability

A pair of namenodes in standby mode, ready to act as the namenode in case of failure

Failover controller

manages the transition from an active namenode to a standby, typically uses Zookeeper

Fencing

the process of ensuring that a previously active namenode that was seen as failed, but had not failed, cannot do damage to the system...done by the High-Availability system

Methods of fencing

killing the namenode's process, revoking access, STONITH

Hadoop fs

the command line prefix to most Hadoop commands (stands for Hadoop filesystem)

Path object

represents a file in a Hadoop filesystem

Configuration object

stores the client or server's configuration, which is set using configuration files read from the classpath

FileStatus class

allows you to view metadata about files

globbing

using wildcards to match multiple files with one expression, often used with globStatus function

Steps of a File Read

1. Client opens file calling open() on the Filesystem object, an instance of DistributedFileSystem
2. DistributedFileSystem calls the namenode using RPC to get locations of the first few blocks in the file
3. DFSInputStream connects to the closest datanode for the first block in the file
4. Data is streamed from datanode back to client, which calls read() repeatedly on the stream
5. This continues to happen without interruption to client
6. When client has finished reading, calls close() on the FSDataInputStream

Steps of a File Write

1. Client calls create() on DistributedFileSystem
2. Namenode checks to make sure file doesn't already exists & client has permissions to create
3. If a pass, namenode makes record of new file
4. DistributedFileSystem returns an FSDataOutputStream for client to start writing data to
5. FSDataOutputStream wraps a DFSOutputStream, which handles communication with the datanodes and the namenode
6. Data queue is consumed by the Data Streamer, which asks the namenode to allocate new blocks by picking a list of suitable datanodes to store replicas
7. List of datanodes forms a pipeline, typically three in each pipeline (depends on replication factor
8. DataStreamer streams packets to first datanode in pipeline, which stores packet and forwards it to second datanode, and onward to each datanode in pipeline
9. Packet is only removed from this queue only when it is acknowledged by all datanodes in this pipeline
10. Namenode notices under-replication, and arranges for a further replica to be created on another node
11. Failed datanode is removed from pipeline and remainder of block's data is written to the 2 good datanodes in pipeline
12. Current block on good datanodes is given a new identity, which is communicated to the datanode so the partial block on failed datanode will be deleted if failed datanode later recovers
13. Pipeline is closed, any packets in ack queue are added to front of data queue, so that datanodes downstream from failed node will not miss any packets
14. When client finishes writing data, it calls close() on the stream
15. Namenode already knows which blocks the file is made of, so it only has to wait for blocks to be minimally replicated before returning successfully

data queue

An internal queue populated by client writing data, when the DFSOutputStream splits data into packets

ack queue

an internal queue of packets waiting to be acknowledged by datanodes by DFSOutputStream

Hadoop's default replica strategy

Three replicas: 1) same node as client, 2) a different rack chosen at random, 3) the same rack on a different node

Coherency model for a filesystem

describes the data visibility of reads and writes for a file

Distcp

Copies large amounts of data to and from Hadoop filesystems in parallel. Implemented as a MapReduce job

Hadoop Archive (HAR) files

pack files into HDFS blocks more efficiently, reducing namenode memory usage while still allowing access to files

How does Hadoop detect errors in data read/write?

Using a checksum, datanodes are responsible for verification. The last datanode in the pipeline will verify the checksum

Types of compression compatible with Hadoop

DEFLATE, gzip, bzip2, LZO, Snappy

Hadoop's codec implementation

the CompressionCodec interface

Serialization

the process of turning structured objects into a byte stream for transmission over a network or for writing to persistent storage

Deserialization

byte stream -> series of structured objects

Remote Procedure Calls (RPC)

* Handles interprocess communication between nodes in the system
* Uses serialization to render message into binary stream to be sent to remote node

Hadoop's serialization format

Writables, which have two methods: write, and readFields

Hadoop equivalents to Long, Double, Int, String

LongWritable, DoubleWritable, IntWritable, Text

Apache Avro

an IDL-based serialization framework, usually written in JSON

Interface description language (IDL)

a definition of types in a language-neutral, declarative fashion

Sequence Files

Provides a persistent data structure for binary key-value pairs. Temporary map outputs are stored as Sequence Files.

Sync Points

used with Sequence Files, a point in the stream that can be used to resynchronize with a record boundary if the reader is lost

Format of a Sequence File

1) First 3 bytes are the bytes SEQ, a magic #


2) 1 byte representing the version number


3) remainder of the header, which contains fields like key/value classes, compression details, sync marker


4) the records (which could be compressed)

Map File

a sorted Sequence File with an index to permit lookups by key

the Configuration class

a collection of configuration properties and their values, configurations read their properties from resources-XML files

Example of a resources- XML file

<?xml version="1.0"?>




color


yellow


Color





size-weight


${size},${weight}


Size and weight




Unit Testing class

Mockito's mock() method, passing the class of the type we want to mock

Optimal # of reducers

Approx. slightly less than the # of reduce slots in the cluster

HPROF

a profiling tool that comes with the JDK that gives information about a program's CPU and heap usage

What are a few things to check when tuning a MapReduce job?

# mappers, # reducers, combiners, intermediate compression, custom serialization, shuffle tweaks

JobControl class

represents a graph of MR jobs to be run, can add the job configurations and tell the instance of JobControl the dependencies between jobs

Apache Oozie

designed to manage the executions of thousands of dependent workflows, that could be made up of thousands of actions at the level of an independent MR job; composed of a workflow engine and coordinator engine

workflow engine

part of Oozie, stores and runs workflows composed of Hadoop jobs

coordinator engine

part of Oozie, runs workflow jobs based on predefined schedules and data availability

workflow

a group of action nodes and control-flow nodes

action nodes

performs a workflow task

control-flow nodes

governs the workflow execution between actions by allowing conditional logic, parallel execution, and other constructs

jobtracker

part of MapReduce 1, coordinates the job run, its main class is JobTracker

tasktrackers

part of MapReduce 1, run the tasks that the job has been split into, its main class is TaskTracker

YARN

Yet Another Resource Negotiator, aka MapReduce 2

resource manager

part of YARN, manages the use of resources across the cluster

application master

manages the lifecycle of applications running on the cluster

node managers

launch and monitor the computer containers on machines in the cluster

3 failure modes to consider in Classic MapReduce

failure of the running task, the tasktracker, and the jobtracker

How does a tasktracker get blacklisted by a jobtracker? What happens?

If 4 or more tasks from the same job fail, the jobtracker records this as a fault. IF the tasktracker goes over a certain # of faults, the tasktracker will not be assigned tasks, but will continue to communicate with the jobtracker...these faults expire over time, and a tasktracker can make it off the blacklist

What can be done to mitigate a jobtracker failure?

There is no mechanism for dealing with this type of failure, the job simply fails and must be restarted

4 failure modes to consider in YARN

failure of the task, the application master, the node manager, and the resource manager

What occurs during an application master failure?

The resource manager will detect failure and start a new instance of the master running in a new container...can recover the state of tasks being run

What occurs during a node manager failure?

it will be removed from the resource manager's pool of available nodes, can also be blacklisted

What occurs during a resource manager failure?

The most serious YARN failure, neither jobs nor task containers can be launched. The resource manager uses checkpoints to save its state. A new resource manager instance must be brought up, and can recover from the saved state

Fair Scheduler

a type of Hadoop scheduler, it aims to give every user a fair share of the cluster capacity over time

Capacity Scheduler

a type of Hadoop scheduler, splits the cluster into queues, which may be hierarchical, and each queue has an allocated capacity. Within each queue jobs are scheduled using FIFO

Shuffle

the process by which the system sorts map output keys and transfers the map outputs to reducers as inputs

copy phase

When the reduce task begins to copy map outputs as soon as the outputs are available. Doesn't wait until all map tasks are complete.

sort phase

part of the reduce task, map outputs are combined to feed the reduce tasks. done once all map outputs have been copied. Also known as the merge.

speculative execution

when Hadoop tries to detect when a task is running slower than expected, it will launch another equivalent task as a backup. Only launched if all tasks for a job have been launched

Output committer

a commit protocol used by Hadoop MapReduce to ensure jobs and tasks either succeed or fail cleanly, implemented by the Java class OutputCommitter

Hadoop skipping mode

Forces tasks to report the records being processed to the tasktracker, when the task fails the tasktracker will retry the task then skip the record(s) that cause failure

General form of MapReduce

Map: (K1, V1) -> (K2, V2)


(if exists) Combiner: (K2, list(V2)) -> list(K2, V2)


Reduce: (K2, list(V2)) -> list(K3, V3)



**Note: Often, K2=K3 and V2=V3, but not always

What is the default input format for MapReduce?

TextInputFormat, which produces keys of type LongWritable and values of type Text

What is the default output format for MapReduce?

TextOutputFormat, which writes out records one per line, by converting keys/values to strings and tab-separating them

input split

chunk of input that is processed by a single map, divided into records, which are key-value pairs

which Java class is useful for processing small files?

CombineFileInputFormat, which packs many files into each split, so that each mapper has more to process; and takes node/rack locality into consideration when packing together

Which method can be overridden to prevent file splitting?

isSplitable() on FileInputFormat

NLineInputFormat

the Java class that allows for a fixed number of lines, where the key/value is still the same as TextInputFormat (offset, lines). Controls the # of lines of input that each mapper receives

What are the types of built-in counters in Hadoop?

Task counters, Job counters, User-defined Java counters, Dynamic counters

how can a counter be defined in Java?

using the Java enum

Total Sort

sorting each file so that they can be concatenated to form a globally sorted file

secondary sort

Structuring key-value pairs to allow for sorting by values as well as the keys (keys only are the default)

map-side join

a join performed by the mapper. The data is joined before data reaches the map function, and requires the inputs to each map to be sorted by the join key and partitioned by that key as well, can be run through the CompositeInputFormat class

reduce-side join

a join performed by the reducer. More general than map-side joins, don't require the input data set to be structured a certain way. Less efficient since both datasets must go through the Shuffle. It DOES require that the data is sorted by source, one source before the other

side data

extra read-only data needed by a job to process the main dataset

Pig

a client-side application comprised of the language (Pig Latin) and the execution environment to run Pig Latin programs. It raises the level of abstraction for processing large datasets

Grunt

Pig's interactive shell, started when no file is specified on the command line for Pig to run, and the -e option is not used

Three ways to run Pig programs

Scripts (pig script.pig), Grunt, and Embedded in Java using the PigServer class

bag

in Pig, an unordered collection of tuples, represented in Pig Latin by curly braces

Differences between SQL and Pig

1) Pig Latin is a data flow programming language, while SQL is a declarative programming language



2) Pig Latin is a step-by-step set of single transformation operations on an input relation, SQL is a set of constraints to find a result set



3) RDBMS- set of predefined schemas, Pig is more relaxed about the data it processes



4) Pig supports complex, nested data structures, SQL is more flat



5) Pig doesn't support random reads/writes

Pig Latin statement

an operation or command in Pig Latin, which must end with a semicolon (ex: GROUP)

Pig Latin commands

not added to the Pig logical plan, they are executed immediately. Useful for moving data around before/after Pig processing

Pig Latin expressions

Something evaluated to yield a value, can be used as part of a statement containing a relational operator (ex: $n- field in position n)

What are the four numeric types in Pig?

int, long, float, double

Which Pig types are analogous to Java string?

bytearray, chararray

What are the three complex Pig types?

tuple, bag, map

Read the schema from the following relation:


{(year: chararray), (amount: int), (name)}

Year variable has a type of chararray, the amount variable is an int, and the name variable is a bytearray (the default, if no type is specified)

When a value being inserted cannot be cast to the type declared in a schema, what occurs in Pig?

Pig inserts null in its place, instead of erroring...because of the volume of data usually being processed, it is impractical to expect to be able to correct every incorrect record

Macro in Pig

Similar to a Java method, allows you to package reusable pieces of Pig Latin, done using the DEFINE keyword

User Defined Function (UDF) in Pig

a function written in Java (or Javascript, Python, other scripting languages) that house logic used in Pig scripts

dynamic invoker

allows you to use a function that is provided by a Java library without using a UDF. Is less efficient, since it uses reflection

Pig's STREAM operator

allows you to transform data in a relation using an external program or script

Pig's COGROUP operator

Similar to a JOIN, but it creates a nested set of output tuples, nice for exploiting the structure in subsequent statements. JOIN creates a flat structure, just a set of tuples

Hive

Uses the HiveQL language, similar to SQL, to convert queries into MapReduce jobs for execution on a Hadoop cluster

the Hive metastore

a database created by Hive to store metadata; it is normally stored in a central location and is shared across the cluster

schema on write

used in traditional RDBMS, schema is enforced at data load time; if data doesn't conform, it is rejected. This makes query performance much faster, but takes longer to load the data

schema on read

used in Hive, data isn't verified when it is loaded, rather when a query is issued. Makes for a very fast initial load, allows for multiple schemas in the same database, but query performance is slower

How does Hive manage tables and the data in them?

When a table is created using CREATE TABLE, and data is loaded using LOAD, the data is physically moved from its location into Hive, so when the table is dropped using DROP TABLE, the data is deleted and no longer exists (unless it was previously backed up)

Hive external table

specified using the EXTERNAL keyword, (e.g. CREATE EXTERNAL TABLE), tells Hive it is not managing the data, which avoids the data deletion situation

HBase

a distributed, column-oriented database built on top of HDFS. Used for real-time read/write random-access to very large datasets

Zookeeper

Hadoop's distributed coordination service that builds general distributed applications

Sqoop

an open-source tool that allows users to extract data from a relational database into Hadoop for further processing

Partitions

In Hive, specified at table creation and subsequently at data load, allows you to store the data physically in different places based on a partitioning column, makes for more efficient queries (similar to SQL partitions)

Buckets

In Hive, imposes extra structure to a table, specified on table creation. Useful for sampling, which can be done with the TABLESAMPLE clause. Also stored physically within a table/partition directory.

Two formats that govern storage in Hive

Row format (dictates how rows and their fields are stored, defined by a SerDe, a Serializer-Deserializer) and file format (dictates the container format for fields in a row, such as text file, row-oriented binary, etc.)

What is the default storage format in Hive?

CTRL-A delimited text, with a row per line

RCFiles

Record Columnar File, similar to SequenceFiles except they store all values in a column as a record, can be specified in a CREATE TABLE clause by using STORED AS RCFILE. These work well when you only need access to a few columns in a table

What is the difference between ORDER BY and SORT BY in Hive?

ORDER BY produces a globally sorted file, but only uses one reducer, which is highly inefficient. SORT BY produces a sorted file per reducer, but is not globally sorted

Types of UDFs in Hive

Regular UDFs- operate on a single row, produce a single row



UDAFs- operate on multiple rows, and produce a single row



UDTFs- operate on a single row, produce multiple rows (a table) as output

What 2 properties must a regular UDF satisfy?

Must be a subclass of org.apache.hadoop.hive.ql.exec.UDF, and must implement at least one evaluate() method

splitting column

in Sqoop, the column that is identified as having a close-to uniform distribution, and is used to divide data coming from a relational db to a MapReduce job for import into Hadoop

LobFile

Sqoop's way of storing imported large objects

the DBWritable interface

used in Sqoop, a Java class that provides read & write method for serialization, which allows the class to interact with JDBC

Sqoop's direct mode

the quickest, most efficient way to import data using Sqoop. Even when it is used, the metadata is queried through JDBC

--hive-import

When specified in Sqoop, this option tells Sqoop to create a Hive table with the data being moved. If the table already exists, can use --hive-overwrite

--hive-table

When specified in Sqoop, names the output Hive table

mysqldump

a Sqoop tool that allows for direct access to MySQL databases, faster than the default JDBC connection. This is enabled by using the --direct option

PigStorage

a Java class that constructs a Pig loader to load delimited text files

Three types of SequenceFiles

1. Uncompressed key/value records


2. Record compressed key/value records- only values compressed


3. Block compressed- both keys/values are collected in blocks separately and compressed