1. Introduction 2. Quiz: Input Data How to find total sales/store? KEY VALUE time store name cost store name store name cost store name product type 3. Quiz: Defensive Mapper Code # Your task is to make sure that this mapper code does not fail on corrupt data lines, # but instead just ignores them and continues working import sys def mapper(): # read standard input line by line for line in sys.
1. Quiz: HDFS Which of the following is true? HDFS uses a central SAN(storage area network) to hold its data HDFS stores a single copy of all data HDFS replicates all data for reliability To store 100TB of data in a Hadoop cluster you would need 300TB of raw disk space by default 2. Quiz: DataNode Which of the following is true if one of the nodes running the DataNode daemon on the cluster fails?
1. Quiz: HDFS Is there a problem? > https://youtu.be/6F8-cCUbRU8 Network failure Disk failure on DN(datanode) Not all DN used Block sizes differ Disk failure on NN(namenode) 2. Quiz: Data Redundancy Any problem now?(when NN failure) Data inaccessible > when network failure on NN Data lost forever > when disk failure on NN No problem 3. NameNode Standby The active namenode works before, but the standby can be configured to take over if the active one fails.
1. Quiz: Dimensions of Big Data Which of the following are Part of the 3 dimensions of Big Data? Volume Cost Importance Velocity Source Variety Security Virality 2. Quiz: Volume Volume of Big Data refers to: Importance of Data Size of data Speed of data generation The differnet data sources 3. Quiz: Hadoop Ecosystem Check all that are true: Hadoop provides an efficient way of storing data via HDFS Hadoop has a visualization framework called ‘Giraffe’ You can analyze large datasets using a high-level language called ‘Pig’ ‘Hive’ offers a SQL-like language on top of MapReduce The tools in Hadoop’s ecosystem are all proprietary, commercial tools 4.
1. Introduction You can read more about Big Data in Wikipedia which is also a company that generates and processes huge amounts of data itself. MapReduce and Apache Hadoop are the technologies we will be talking about more in this course. 2. Data Sources According to IBM: “Every day, 2.5 billion gigabytes of high-velocity data are created in a variety of forms, such as social media posts, information gathered in sensors and medical devices, videos and transaction records”