[airflow] 0. Quickstart

# airflow needs a home, ~/airflow is the default, # but you can lay foundation somewhere else if you prefer # (optional) export AIRFLOW_HOME=~/airflow # install from pypi using pip pip install airflow # initialize the database airflow initdb # start the web server, default port is 8080 airflow webserver -p 8080 export AIRFLOW_HOME=~/airflow 명령어로 설치 경로를 지정할 수 있다. AIRFLOW_HOME을 지정하지 않을 경우 default 경로는 ~/airflow 설치는 pip로 간단하게 할 수 있다.

Continue reading

1. Introduction 2. Quiz: Input Data How to find total sales/store? KEY VALUE time store name cost store name store name cost store name product type 3. Quiz: Defensive Mapper Code # Your task is to make sure that this mapper code does not fail on corrupt data lines, # but instead just ignores them and continues working import sys def mapper(): # read standard input line by line for line in sys.

Continue reading

1. Quiz: HDFS Which of the following is true? HDFS uses a central SAN(storage area network) to hold its data HDFS stores a single copy of all data HDFS replicates all data for reliability To store 100TB of data in a Hadoop cluster you would need 300TB of raw disk space by default 2. Quiz: DataNode Which of the following is true if one of the nodes running the DataNode daemon on the cluster fails?

Continue reading

1. Quiz: HDFS Is there a problem? > https://youtu.be/6F8-cCUbRU8 Network failure Disk failure on DN(datanode) Not all DN used Block sizes differ Disk failure on NN(namenode) 2. Quiz: Data Redundancy Any problem now?(when NN failure) Data inaccessible > when network failure on NN Data lost forever > when disk failure on NN No problem 3. NameNode Standby The active namenode works before, but the standby can be configured to take over if the active one fails.

Continue reading

1. Quiz: Dimensions of Big Data Which of the following are Part of the 3 dimensions of Big Data? Volume Cost Importance Velocity Source Variety Security Virality 2. Quiz: Volume Volume of Big Data refers to: Importance of Data Size of data Speed of data generation The differnet data sources 3. Quiz: Hadoop Ecosystem Check all that are true: Hadoop provides an efficient way of storing data via HDFS Hadoop has a visualization framework called ‘Giraffe’ You can analyze large datasets using a high-level language called ‘Pig’ ‘Hive’ offers a SQL-like language on top of MapReduce The tools in Hadoop’s ecosystem are all proprietary, commercial tools 4.

Continue reading

Author's picture

Sanghun Kang

COOL

Data Engineer&Analyst

South Korea