question archive 1)What is Data and Information? 2) What is Big Data? 3) What are the four characteristics of Big Data? 4) What are the core components of Hadoop? 5) What are the Features of Hadoop? 6) Explain Data Locality in Hadoop? 7) What is Fault Tolerance in HDFS? 8) What is Rack Awareness? 9) What is a Datanode? What is a rack? 

1)What is Data and Information? 2) What is Big Data? 3) What are the four characteristics of Big Data? 4) What are the core components of Hadoop? 5) What are the Features of Hadoop? 6) Explain Data Locality in Hadoop? 7) What is Fault Tolerance in HDFS? 8) What is Rack Awareness? 9) What is a Datanode? What is a rack? 

Subject:Computer SciencePrice:2.84 Bought7

1)What is Data and Information?

2) What is Big Data?

3) What are the four characteristics of Big Data?

4) What are the core components of Hadoop?

5) What are the Features of Hadoop?

6) Explain Data Locality in Hadoop?

7) What is Fault Tolerance in HDFS?

8) What is Rack Awareness?

9) What is a Datanode?

What is a rack? 

pur-new-sol

Purchase A New Answer

Custom new solution created by our subject matter experts

GET A QUOTE

Answer Preview

1. Data can be defined as a representation of facts, concepts, or instructions in a formalized manner, which should be suitable for communication, interpretation, or processing by human or electronic machine.

Information is organized or classified data, which has some meaningful values for the receiver. Information is the processed data on which decisions and actions are based.

For the decision to be meaningful, the processed data must qualify for the following characteristics −

Timely − Information should be available when required.

Accuracy − Information should be accurate.

Completeness − Information should be complete.

 

2. Big data is a term that describes the large volume of data - both structured and unstructured - that inundates a business on a day-to-day basis. But it's not the amount of data that's important. It's what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves

 

3. Volume of Big Data

The volume of data refers to the size of the data sets that need to be analyzed and processed, which are now frequently larger than terabytes and petabytes. The sheer volume of the data requires distinct and different processing technologies than traditional storage and processing capabilities

Velocity of Big Data

Velocity refers to the speed with which data is generated. High velocity data is generated with such a pace that it requires distinct (distributed) processing techniques.

Variety of Big Data

Variety makes Big Data really big. Big Data comes from a great variety of sources and generally is one out of three types: structured, semi structured and unstructured data. The variety in data types frequently requires distinct processing capabilities and specialist algorithms.

Veracity of Big Data

Veracity refers to the quality of the data that is being analyzed. High veracity data has many records that are valuable to analyze and that contribute in a meaningful way to the overall results.

 

4. The 3 core components of the Apache Software Foundation's Hadoop framework are:

1. MapReduce - A software programming model for processing large sets of data in parallel
2. HDFS - The Java-based distributed file system that can store all kinds of data without prior organization.
3. YARN - A resource management framework for scheduling and handling resource requests from distributed applications.

 

5. Open Source:

Hadoop is open-source, which means it is free to use. Since it is an open-source project the source-code is available online for anyone to understand it or make some modifications as per their industry requirement.

2. Highly Scalable Cluster: 

Hadoop is a highly scalable model. A large amount of data is divided into multiple inexpensive machines in a cluster which is processed parallelly. the number of these machines or nodes can be increased or decreased as per the enterprise's requirements

Fault Tolerance is Available:

Hadoop uses commodity hardware(inexpensive systems) which can be crashed at any moment. In Hadoop data is replicated on various DataNodes in a Hadoop cluster which ensures the availability of data if somehow any of your systems got crashed.

High Availability is Provided:

Fault tolerance provides High Availability in the Hadoop cluster. High Availability means the availability of data on the Hadoop cluster. Due to fault tolerance in case if any of the DataNode goes down the same data can be retrieved from any other node where the data is replicated.

 

6. Data Locality in Hadoop means moving computation close to data rather than moving data towards computation. Hadoop stores data in HDFS, which splits files into blocks and distribute among various data nodes. When a mapReduce job is submitted, it is divided into map jobs and reduce jobs.

 

7.In Hadoop Failure of one node doesn't affect accessing (read-write operation) of data in datanode. Multiple copies of same Block will be available in other datanode, So failure of one node will not impact our work and we can make use of block from other datanode when one of the datanode(slaves) fails.

Using Replication Factor we can achieve to make multiple block into datanode. By default the replication factor is 3 in HDFS. But you can increase the replication as per your requirement.

 

8. In Rack Awareness, NameNode chooses the DataNode which is closer to the same rack or nearby rack. NameNode maintains Rack ids of each DataNode to achieve rack information. Thus, this concept chooses Datanodes based on the rack information. NameNode in hadoop makes ensures that all the replicas should not stored on the same rack or single rack. Rack Awareness Algorithm reduces latency as well as Fault Tolerance.

 

9. DataNode is a daemon (process that runs in background) that runs on the 'SlaveNode' in Hadoop Cluster.

2. In Hdfs file is broken into small chunks called blocks(default block of 64 MB)

3. These blocks of data are stored on the slave node.

4. It stores the actual data. So, large number of disks are required to store data.(Recommended 8 disks).

5. These data read/write operation to disks is performed by the DataNode. For hosting datanodes, commodity hardware can be used.

 

10. A rack, in an IT (information technology) context, is a supporting framework that holds hardware modules. In this context, racks typically contain servers, hard disk drives and other computing equipment. Racks make it possible to contain a lot of equipment in a small physical footprint without requiring shelving

Step-by-step explanation

The term "big data" refers to data that is so large, fast or complex that it's difficult or impossible to process using traditional methods. The act of accessing and storing large amounts of information for analytics has been around a long time.

Data that is high volume, high velocity and high variety must be processed with advanced tools (analytics and algorithms) to reveal meaningful information. Because of these characteristics of the data, the knowledge domain that deals with the storage, processing, and analysis of these data sets has been labeled Big Data.

Related Questions