HADOOP ON MAC OSX YOSEMITE PART 2

Hadoop on Mac OSX Yosemite part 2

 

This is a continuation from Installing Hadoop on Mac where we installed Hadoop, Yarn, and HDFS, we also ran our first Hadoop WordCount job. In this part we will actually write our first WordCount.javaprogram and compile it. Then actually run it on the Hadoop standalone we configured.

Creating Hadoop’s Wordcount Program
– Main Class
– Mapper Class
– Reducer Class
Compiling the Hadoop Project
– using the terminal
– using the Maven

Extra:
Managing the filesystem HDFS
Uploading Data Files
Running a Hadoop Project

Working Github Repo configured with Maven

External:

Hadoop and Hive Running a Hadoop Program
UT CS378 Big Data Programming Lecture Slides

Creating Hadoop’s Wordcount Program

Main and General layout

The main class and code layout will generally be identical. With a main WordCount public class encapsulating the Mapper, Reducer, and Combiner classes.  I wrote the Mapper and Reducer classes in separate sections of the page to make it clearer what is what, but in the end you’ll insert the code and replace the brackets. Start off by creating a file called WordCount.java.

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import java.util.StringTokenizer;


public class WordCount extends Configured implements Tool {
   private final static LongWritable ONE = new LongWritable(1L);

[INSERT MAPPER CLASS]
[INSERT REDUCER CLASS]

static int printUsage() {
System.out.println("wordcount [-m #mappers ] [-r #reducers] input_file output_file");
    ToolRunner.printGenericCommandUsage(System.out);
    return -1;
  }

public int run(String[] args) throws Exception {

    JobConf conf = new JobConf(getConf(), WordCount.class);
    conf.setJobName("wordcount");

// the keys are words (strings)
   conf.setOutputKeyClass(Text.class);
// the values are counts (ints)
   conf.setOutputValueClass(IntWritable.class);

   conf.setMapperClass(MapClass.class);
// Here we set the combiner!!!! 
   conf.setCombinerClass(Reduce.class);
   conf.setReducerClass(Reduce.class);

  List other_args = new ArrayList();
   for(int i=0; i < args.length; ++i) {
     try {
        if ("-m".equals(args[i])) {
conf.setNumMapTasks(Integer.parseInt(args[++i]));
        } else if ("-r".equals(args[i])) {     
conf.setNumReduceTasks(Integer.parseInt(args[++i]));
        } else {
          other_args.add(args[i]);
        }
      } catch (NumberFormatException except) {
        System.out.println("ERROR: Integer expected instead of " + args[i]);
        return printUsage();
      } catch (ArrayIndexOutOfBoundsException except) {
        System.out.println("ERROR: Required parameter missing from " +
            args[i-1]);
        return printUsage();
      }
    }
// Make sure there are exactly 2 parameters left.
   if (other_args.size() != 2) {
      System.out.println("ERROR: Wrong number of parameters: " +
          other_args.size() + " instead of 2.");
      return printUsage();
    }
    FileInputFormat.setInputPaths(conf, other_args.get(0));
    FileOutputFormat.setOutputPath(conf, new Path(other_args.get(1)));

    JobClient.runJob(conf);
    return 0;
  }


public static void main(String[] args) throws Exception {
    int res = ToolRunner.run(new Configuration(), new WordCount(), args);
    System.exit(res);
  }
}

Notice the Job object is responsible for most of the configuration that will happen, the number of mappers, reducer, the input and output types, job name, and much much more.

Building the Mapper class

The idea behind the mapper class is it takes in a row of input and emits a key value pair. The mapper is where the parsing will usually happen. This key value pair will be then caught by the reducer and acted upon.

/**
 * Counts the words in each line.
 * For each line of input, break the line into words and emit them as
 * (word, 1).
 */
public static class MapClass extends MapReduceBase implements Mapper< LongWritable, Text, Text, IntWritable > {
  private final static IntWritable one = new IntWritable(1);
 private Text word = new Text();

public void map(LongWritable key, Text value,
   OutputCollector<text, intwritable=""> output,
   Reporter reporter) throws IOException {

  String line = value.toString();
  StringTokenizer itr = new StringTokenizer(line);
  while (itr.hasMoreTokens()) {
    word.set(itr.nextToken());
     output.collect(word, one);
   }
 }
}

Building the Reduce class

The reducer captures the keys and a list of values. In the shuffle and sort phase of Hadoop, all values belonging to a particular key are put together into a list. In the reduce phase we receive that list and the key it belongs to. We usually loop through the list and perform some operation on the individual values. When we finally emit from the class it’ll be actually written in one of the part-r-0000 output files.


/**
 * A reducer class that just emits the sum of the input values.
 */
public static class Reduce extends MapReduceBase implements Reducer< Text, IntWritable, Text, IntWritable > {

public void reduce(Text key, Iterator values,
 OutputCollector<text, intwritable=""> output,
 Reporter reporter) throws IOException {
      int sum = 0;
      while (values.hasNext()) {
        sum += values.next().get();
      }
      output.collect(key, new IntWritable(sum));
    }
  }

Compiling the Hadoop Project

Compiling using the terminal

Compilation of a Hadoop Java project is pretty straight forward once you find out the magic command. Which is hadoop classpath. if you created the above WordCount.java file then just open up the terminal and cd into the folder. The execute the Java compiler with the proper classpath.

$ javac WordCount.java -cp $(hadoop classpath)

The hadoop classpath provides the compiler with all the paths it needs to compile correctly and you should see a resulting WordCount.class appear in the directory. 

Compiling using Maven

When using Maven there’s a very specific directory structure required and a little building within thepom.xml. I’ve created a sample Github repository that has a completely working version. The fastest way to go about would be to clone the repo and just explore it. The source WordCount file is located in the src/main/java/com/qfa path.

$ git clone https://github.com/marek5050/Hadoop_Examples
$ cd Hadoop_Examples
$ mvn install

Running a Hadoop Project

When we finish create a Hadoop java file and packaging it using Maven. We can test out the jar file using

% hadoop jar ./target/bdp-1.3.jar dataSet3.txt  dataOutput1
  • bdp–1.3 is that name of the jar file generated by Maven.
  • dataSet3.txt is the data file we uploaded using put
  • dataOutput will be the folder where results are added

An easier way of running the project is by creating a script, let’s call it run
Create the file in the mvn directory and

% hadoop jar ./target/bdp-1.3.jar \
dataSet3.txt $(date +%s)

Close and save the file.

% chmod +x ./run     //To make it executable.
%./run              // To execute

Now after we package the new jar file using Maven, we just run the hadoop job using ./run and the hadoop job executes and ouputs the information into a file called 10012313131. This number will be the number of millisecond since the beginning of mankind or something, but the great side effect of that is the newest folder will always be the last folder visible and the names will always be unique. So there’s no need to track file names.

After the job runs we just open up the Web GUI and download the resulting file.
1459088800-4406-wpid-hdfs-download

Managing the filesystem HDFS

“Hadoop hdfs dfs” was deprecated and now it’s done purely with “hdfs dfs”. Some of the basic HDFS commands are:

%hdfsdfs 
> Usage:hdfsdfs
>hdfsdfs-put 
> -cp copy files from src to dest
> -cat
> -ls list the files in a directory
> -mkdir Create a file directory
> -mv 
> -rm remove file
> -rmdir

remove directory

Uploading Data Files

To transfer data files into HDFS use either put or copyFromLocal, if the dst parameter is missing the default will be the users home directory, or /user/name/.

hdfs dfs -put  
hdfs dfs -copyFromLocal  
hdfs dfs -put book.txt

Verify the file was added using

hdfs dfs -ls 
hdfs dfs -ls

Leave a Reply

Your email address will not be published. Required fields are marked *