Set the Language

We weren't able to detect the audio language on your flashcards. Please select the correct language below.

Front

Back

Related Flashcards

Flashcards
»
Hadoop the Definitive Guide

Hadoop The Definitive Guide

by ejenvey, Jan. 2015

Subjects: Big Data, Hadoop

Favorite

Add to folder

Flag

Related Essays

Sequenom Case Summary
Big data with Hadoop deployment recommended solution today only addresses part of the issue, and have done so by increasing knowledgeable IT staff for the te...
Nt1310 Unit 4 Assignment 4 Analysis
MapReduce Implements various algorithms to break the data or filter the various data .as Mapreduce framework uses different programming languages like pyton...
Mapreduce Rendezvous Hasing Based Virtual Hierarchies: The Cassandra Nosql Case Study
MR-RHVH framework apply the distributed structure of nodes using MapReduce, that equally partition data amongst Cassandra nodes using mapper and reducer func...
Nt1330 Unit 1 Paper
A Distributed File System (DFS) is being used as it will provide a centralize location for the data as well as its ability to be expanded easily as more file...
SQL Vs NOSQL
Not Only SQL (NOSQL) In this paper I will use comparisons to help distinguish what is NOSQL. We will talk about NOSQL vs SQL. Document, Graph, or Key-value...
The Pros And Cons Of Big Data
In his article “What is Big Data” Edd Dumbill characterizes different aspects of Big Data as three V’s. Volume, Velocity and Variety. The volume itself is a ...
Advantages And Disadvantages Of Storage Systems For Big Data
Then Bigtable returns the corresponding data where it behaves similar to distributed hash table. Bigtable maintains the data in lexicographic order by row st...
Big Data Predictive Analysis
The fourth is evaluating the model to see if it runs accurately by predicting the data subsets. The fifth is deploying the model in applications to deliver i...
Cloud Computing Advantages And Disadvantages
This was the starting point of the current Big Data trend as it was a relatively cheap solution for businesses confronted with similar problems. Meanwhile, ...
Nt1330 Unit 7 Exercise 1
10. Private Cloud This is a type of cloud computing that similar to a public cloud, but a private cloud is dedicated to one

Shuffle
Toggle On

Toggle Off
Alphabetize
Toggle On

Toggle Off
Front First
Toggle On

Toggle Off
Both Sides
Toggle On

Toggle Off
Read
Toggle On

Toggle Off

Reading...

Front

Card Range To Study

through

Play button

Progress

1/134

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

134 Cards in this Set

Front
Back

	Distributed filesystems	Filesystems that store files on multiple machines
	HDFS	Hadoop's main distributed filesystem, designed for storing very large files, with streaming access data patterns, running on clusters of commodity hardware
	Areas where HDFS is not a good fit	Low-latency data access, lots of small files, multiple writers & arbitrary file modifications
	HDFS default block size	64 MB
	Unit of abstraction for HDFS	Blocks, not the files themselves
	namenode (master node)	Manages the filesystem namespace, maintains the filesystem tree and metadata for all files in the tree, knows which datanodes a block is stored on. This info is stored persistently on local disk in both the namespace image and edit log.
	datanode (worker/slave node)	store and retrieve blocks of data when they are told by the namenode or client, report back to namenode periodically with the blocks they are storing
	Secondary namenode	a namenode stored on a separate physical machine, which keeps a copy of the namespace image, and is used in event of first namenode failure
	HDFS Federation	A multiple namenode system where each namenode manages: a namespace volume and a block pool
	namespace volume	a subset of all files in a filesystem, they are independent of one another
	block pool	contain all the blocks for the files in a namespace. Storage in a block pool is not partitioned, a file will register with all namenodes in the cluster and store on multiple block pools
	HDFS High Availability	A pair of namenodes in standby mode, ready to act as the namenode in case of failure
	Failover controller	manages the transition from an active namenode to a standby, typically uses Zookeeper
	Fencing	the process of ensuring that a previously active namenode that was seen as failed, but had not failed, cannot do damage to the system...done by the High-Availability system
	Methods of fencing	killing the namenode's process, revoking access, STONITH
	Hadoop fs	the command line prefix to most Hadoop commands (stands for Hadoop filesystem)
	Path object	represents a file in a Hadoop filesystem
	Configuration object	stores the client or server's configuration, which is set using configuration files read from the classpath
	FileStatus class	allows you to view metadata about files
	globbing	using wildcards to match multiple files with one expression, often used with globStatus function
	Steps of a File Read	1. Client opens file calling open() on the Filesystem object, an instance of DistributedFileSystem 2. DistributedFileSystem calls the namenode using RPC to get locations of the first few blocks in the file 3. DFSInputStream connects to the closest datanode for the first block in the file 4. Data is streamed from datanode back to client, which calls read() repeatedly on the stream 5. This continues to happen without interruption to client 6. When client has finished reading, calls close() on the FSDataInputStream
	Steps of a File Write	1. Client calls create() on DistributedFileSystem 2. Namenode checks to make sure file doesn't already exists & client has permissions to create 3. If a pass, namenode makes record of new file 4. DistributedFileSystem returns an FSDataOutputStream for client to start writing data to 5. FSDataOutputStream wraps a DFSOutputStream, which handles communication with the datanodes and the namenode 6. Data queue is consumed by the Data Streamer, which asks the namenode to allocate new blocks by picking a list of suitable datanodes to store replicas 7. List of datanodes forms a pipeline, typically three in each pipeline (depends on replication factor 8. DataStreamer streams packets to first datanode in pipeline, which stores packet and forwards it to second datanode, and onward to each datanode in pipeline 9. Packet is only removed from this queue only when it is acknowledged by all datanodes in this pipeline 10. Namenode notices under-replication, and arranges for a further replica to be created on another node 11. Failed datanode is removed from pipeline and remainder of block's data is written to the 2 good datanodes in pipeline 12. Current block on good datanodes is given a new identity, which is communicated to the datanode so the partial block on failed datanode will be deleted if failed datanode later recovers 13. Pipeline is closed, any packets in ack queue are added to front of data queue, so that datanodes downstream from failed node will not miss any packets 14. When client finishes writing data, it calls close() on the stream 15. Namenode already knows which blocks the file is made of, so it only has to wait for blocks to be minimally replicated before returning successfully
	data queue	An internal queue populated by client writing data, when the DFSOutputStream splits data into packets
	ack queue	an internal queue of packets waiting to be acknowledged by datanodes by DFSOutputStream
	Hadoop's default replica strategy	Three replicas: 1) same node as client, 2) a different rack chosen at random, 3) the same rack on a different node
	Coherency model for a filesystem	describes the data visibility of reads and writes for a file
	Distcp	Copies large amounts of data to and from Hadoop filesystems in parallel. Implemented as a MapReduce job
	Hadoop Archive (HAR) files	pack files into HDFS blocks more efficiently, reducing namenode memory usage while still allowing access to files
	How does Hadoop detect errors in data read/write?	Using a checksum, datanodes are responsible for verification. The last datanode in the pipeline will verify the checksum
	Types of compression compatible with Hadoop	DEFLATE, gzip, bzip2, LZO, Snappy
	Hadoop's codec implementation	the CompressionCodec interface
	Serialization	the process of turning structured objects into a byte stream for transmission over a network or for writing to persistent storage
	Deserialization	byte stream -> series of structured objects
	Remote Procedure Calls (RPC)	* Handles interprocess communication between nodes in the system * Uses serialization to render message into binary stream to be sent to remote node
	Hadoop's serialization format	Writables, which have two methods: write, and readFields
	Hadoop equivalents to Long, Double, Int, String	LongWritable, DoubleWritable, IntWritable, Text
	Apache Avro	an IDL-based serialization framework, usually written in JSON
	Interface description language (IDL)	a definition of types in a language-neutral, declarative fashion
	Sequence Files	Provides a persistent data structure for binary key-value pairs. Temporary map outputs are stored as Sequence Files.
	Sync Points	used with Sequence Files, a point in the stream that can be used to resynchronize with a record boundary if the reader is lost
	Format of a Sequence File	1) First 3 bytes are the bytes SEQ, a magic # 2) 1 byte representing the version number 3) remainder of the header, which contains fields like key/value classes, compression details, sync marker 4) the records (which could be compressed)
	Map File	a sorted Sequence File with an index to permit lookups by key
	the Configuration class	a collection of configuration properties and their values, configurations read their properties from resources-XML files
	Example of a resources- XML file	<?xml version="1.0"?> color yellow Color size-weight ${size},${weight} Size and weight
	Unit Testing class	Mockito's mock() method, passing the class of the type we want to mock
	Optimal # of reducers	Approx. slightly less than the # of reduce slots in the cluster
	HPROF	a profiling tool that comes with the JDK that gives information about a program's CPU and heap usage
	What are a few things to check when tuning a MapReduce job?	# mappers, # reducers, combiners, intermediate compression, custom serialization, shuffle tweaks
	JobControl class	represents a graph of MR jobs to be run, can add the job configurations and tell the instance of JobControl the dependencies between jobs
	Apache Oozie	designed to manage the executions of thousands of dependent workflows, that could be made up of thousands of actions at the level of an independent MR job; composed of a workflow engine and coordinator engine
	workflow engine	part of Oozie, stores and runs workflows composed of Hadoop jobs
	coordinator engine	part of Oozie, runs workflow jobs based on predefined schedules and data availability
	workflow	a group of action nodes and control-flow nodes
	action nodes	performs a workflow task
	control-flow nodes	governs the workflow execution between actions by allowing conditional logic, parallel execution, and other constructs
	jobtracker	part of MapReduce 1, coordinates the job run, its main class is JobTracker
	tasktrackers	part of MapReduce 1, run the tasks that the job has been split into, its main class is TaskTracker
	YARN	Yet Another Resource Negotiator, aka MapReduce 2
	resource manager	part of YARN, manages the use of resources across the cluster
	application master	manages the lifecycle of applications running on the cluster
	node managers	launch and monitor the computer containers on machines in the cluster
	3 failure modes to consider in Classic MapReduce	failure of the running task, the tasktracker, and the jobtracker
	How does a tasktracker get blacklisted by a jobtracker? What happens?	If 4 or more tasks from the same job fail, the jobtracker records this as a fault. IF the tasktracker goes over a certain # of faults, the tasktracker will not be assigned tasks, but will continue to communicate with the jobtracker...these faults expire over time, and a tasktracker can make it off the blacklist
	What can be done to mitigate a jobtracker failure?	There is no mechanism for dealing with this type of failure, the job simply fails and must be restarted
	4 failure modes to consider in YARN	failure of the task, the application master, the node manager, and the resource manager
	What occurs during an application master failure?	The resource manager will detect failure and start a new instance of the master running in a new container...can recover the state of tasks being run
	What occurs during a node manager failure?	it will be removed from the resource manager's pool of available nodes, can also be blacklisted
	What occurs during a resource manager failure?	The most serious YARN failure, neither jobs nor task containers can be launched. The resource manager uses checkpoints to save its state. A new resource manager instance must be brought up, and can recover from the saved state
	Fair Scheduler	a type of Hadoop scheduler, it aims to give every user a fair share of the cluster capacity over time
	Capacity Scheduler	a type of Hadoop scheduler, splits the cluster into queues, which may be hierarchical, and each queue has an allocated capacity. Within each queue jobs are scheduled using FIFO
	Shuffle	the process by which the system sorts map output keys and transfers the map outputs to reducers as inputs
	copy phase	When the reduce task begins to copy map outputs as soon as the outputs are available. Doesn't wait until all map tasks are complete.
	sort phase	part of the reduce task, map outputs are combined to feed the reduce tasks. done once all map outputs have been copied. Also known as the merge.
	speculative execution	when Hadoop tries to detect when a task is running slower than expected, it will launch another equivalent task as a backup. Only launched if all tasks for a job have been launched
	Output committer	a commit protocol used by Hadoop MapReduce to ensure jobs and tasks either succeed or fail cleanly, implemented by the Java class OutputCommitter
	Hadoop skipping mode	Forces tasks to report the records being processed to the tasktracker, when the task fails the tasktracker will retry the task then skip the record(s) that cause failure
	General form of MapReduce	Map: (K1, V1) -> (K2, V2) (if exists) Combiner: (K2, list(V2)) -> list(K2, V2) Reduce: (K2, list(V2)) -> list(K3, V3) **Note: Often, K2=K3 and V2=V3, but not always
	What is the default input format for MapReduce?	TextInputFormat, which produces keys of type LongWritable and values of type Text
	What is the default output format for MapReduce?	TextOutputFormat, which writes out records one per line, by converting keys/values to strings and tab-separating them
	input split	chunk of input that is processed by a single map, divided into records, which are key-value pairs
	which Java class is useful for processing small files?	CombineFileInputFormat, which packs many files into each split, so that each mapper has more to process; and takes node/rack locality into consideration when packing together
	Which method can be overridden to prevent file splitting?	isSplitable() on FileInputFormat
	NLineInputFormat	the Java class that allows for a fixed number of lines, where the key/value is still the same as TextInputFormat (offset, lines). Controls the # of lines of input that each mapper receives
	What are the types of built-in counters in Hadoop?	Task counters, Job counters, User-defined Java counters, Dynamic counters
	how can a counter be defined in Java?	using the Java enum
	Total Sort	sorting each file so that they can be concatenated to form a globally sorted file
	secondary sort	Structuring key-value pairs to allow for sorting by values as well as the keys (keys only are the default)
	map-side join	a join performed by the mapper. The data is joined before data reaches the map function, and requires the inputs to each map to be sorted by the join key and partitioned by that key as well, can be run through the CompositeInputFormat class
	reduce-side join	a join performed by the reducer. More general than map-side joins, don't require the input data set to be structured a certain way. Less efficient since both datasets must go through the Shuffle. It DOES require that the data is sorted by source, one source before the other
	side data	extra read-only data needed by a job to process the main dataset
	Pig	a client-side application comprised of the language (Pig Latin) and the execution environment to run Pig Latin programs. It raises the level of abstraction for processing large datasets
	Grunt	Pig's interactive shell, started when no file is specified on the command line for Pig to run, and the -e option is not used
	Three ways to run Pig programs	Scripts (pig script.pig), Grunt, and Embedded in Java using the PigServer class
	bag	in Pig, an unordered collection of tuples, represented in Pig Latin by curly braces
	Differences between SQL and Pig	1) Pig Latin is a data flow programming language, while SQL is a declarative programming language 2) Pig Latin is a step-by-step set of single transformation operations on an input relation, SQL is a set of constraints to find a result set 3) RDBMS- set of predefined schemas, Pig is more relaxed about the data it processes 4) Pig supports complex, nested data structures, SQL is more flat 5) Pig doesn't support random reads/writes
	Pig Latin statement	an operation or command in Pig Latin, which must end with a semicolon (ex: GROUP)
	Pig Latin commands	not added to the Pig logical plan, they are executed immediately. Useful for moving data around before/after Pig processing
	Pig Latin expressions	Something evaluated to yield a value, can be used as part of a statement containing a relational operator (ex: $n- field in position n)
	What are the four numeric types in Pig?	int, long, float, double
	Which Pig types are analogous to Java string?	bytearray, chararray
	What are the three complex Pig types?	tuple, bag, map
	Read the schema from the following relation: {(year: chararray), (amount: int), (name)}	Year variable has a type of chararray, the amount variable is an int, and the name variable is a bytearray (the default, if no type is specified)
	When a value being inserted cannot be cast to the type declared in a schema, what occurs in Pig?	Pig inserts null in its place, instead of erroring...because of the volume of data usually being processed, it is impractical to expect to be able to correct every incorrect record
	Macro in Pig	Similar to a Java method, allows you to package reusable pieces of Pig Latin, done using the DEFINE keyword
	User Defined Function (UDF) in Pig	a function written in Java (or Javascript, Python, other scripting languages) that house logic used in Pig scripts
	dynamic invoker	allows you to use a function that is provided by a Java library without using a UDF. Is less efficient, since it uses reflection
	Pig's STREAM operator	allows you to transform data in a relation using an external program or script
	Pig's COGROUP operator	Similar to a JOIN, but it creates a nested set of output tuples, nice for exploiting the structure in subsequent statements. JOIN creates a flat structure, just a set of tuples
	Hive	Uses the HiveQL language, similar to SQL, to convert queries into MapReduce jobs for execution on a Hadoop cluster
	the Hive metastore	a database created by Hive to store metadata; it is normally stored in a central location and is shared across the cluster
	schema on write	used in traditional RDBMS, schema is enforced at data load time; if data doesn't conform, it is rejected. This makes query performance much faster, but takes longer to load the data
	schema on read	used in Hive, data isn't verified when it is loaded, rather when a query is issued. Makes for a very fast initial load, allows for multiple schemas in the same database, but query performance is slower
	How does Hive manage tables and the data in them?	When a table is created using CREATE TABLE, and data is loaded using LOAD, the data is physically moved from its location into Hive, so when the table is dropped using DROP TABLE, the data is deleted and no longer exists (unless it was previously backed up)
	Hive external table	specified using the EXTERNAL keyword, (e.g. CREATE EXTERNAL TABLE), tells Hive it is not managing the data, which avoids the data deletion situation
	HBase	a distributed, column-oriented database built on top of HDFS. Used for real-time read/write random-access to very large datasets
	Zookeeper	Hadoop's distributed coordination service that builds general distributed applications
	Sqoop	an open-source tool that allows users to extract data from a relational database into Hadoop for further processing
	Partitions	In Hive, specified at table creation and subsequently at data load, allows you to store the data physically in different places based on a partitioning column, makes for more efficient queries (similar to SQL partitions)
	Buckets	In Hive, imposes extra structure to a table, specified on table creation. Useful for sampling, which can be done with the TABLESAMPLE clause. Also stored physically within a table/partition directory.
	Two formats that govern storage in Hive	Row format (dictates how rows and their fields are stored, defined by a SerDe, a Serializer-Deserializer) and file format (dictates the container format for fields in a row, such as text file, row-oriented binary, etc.)
	What is the default storage format in Hive?	CTRL-A delimited text, with a row per line
	RCFiles	Record Columnar File, similar to SequenceFiles except they store all values in a column as a record, can be specified in a CREATE TABLE clause by using STORED AS RCFILE. These work well when you only need access to a few columns in a table
	What is the difference between ORDER BY and SORT BY in Hive?	ORDER BY produces a globally sorted file, but only uses one reducer, which is highly inefficient. SORT BY produces a sorted file per reducer, but is not globally sorted
	Types of UDFs in Hive	Regular UDFs- operate on a single row, produce a single row UDAFs- operate on multiple rows, and produce a single row UDTFs- operate on a single row, produce multiple rows (a table) as output
	What 2 properties must a regular UDF satisfy?	Must be a subclass of org.apache.hadoop.hive.ql.exec.UDF, and must implement at least one evaluate() method
	splitting column	in Sqoop, the column that is identified as having a close-to uniform distribution, and is used to divide data coming from a relational db to a MapReduce job for import into Hadoop
	LobFile	Sqoop's way of storing imported large objects
	the DBWritable interface	used in Sqoop, a Java class that provides read & write method for serialization, which allows the class to interact with JDBC
	Sqoop's direct mode	the quickest, most efficient way to import data using Sqoop. Even when it is used, the metadata is queried through JDBC
	--hive-import	When specified in Sqoop, this option tells Sqoop to create a Hive table with the data being moved. If the table already exists, can use --hive-overwrite
	--hive-table	When specified in Sqoop, names the output Hive table
	mysqldump	a Sqoop tool that allows for direct access to MySQL databases, faster than the default JDBC connection. This is enabled by using the --direct option
	PigStorage	a Java class that constructs a Pig loader to load delimited text files
	Three types of SequenceFiles	1. Uncompressed key/value records 2. Record compressed key/value records- only values compressed 3. Block compressed- both keys/values are collected in blocks separately and compressed

Share This Flashcard Set