setting-up-hadoop-on-a-mac

2011-12-01T07:42:38.000Z
Tags: hadoop setup

I recently was given the task to report what requested urls on our site had the most 404 response codes.  As I started to dig into our logs I noticed we had lots of them the smallest being 9 gigs for 1 month.

I didn’t want to have to parse every single log using scripting because of the size of the logs and I remembered we were already doing something similar for another application via hadoop.  I had used hadoop before and remembered how difficult it was to install on my pc.

This time around I was on a mac, and hadoop is more mature, there is better documentation so I gave it another shot.  After some digging I found a few sites that had similar instructions but none had a complete and simple to follow plan. So here is the plan I used to get my local psuedo-distributed hadoop dev environment on my mac to work:

  1. Download hadoop 0.20.0 from

  2. Untar the hadoop file, can simply double click it in your Downloads folder and it will be extracted to a folder

  3. Copy that hadoop folder to where ever you want to install hadoop.  I put mine under my home directory under workspace/java/projects/hadoop

  4. Next we need to tell hadoop where our java home is on my mac it is under: /System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home

I set that value under the hadoop/conf/hadoop-env.sh for JAVA_HOME

# The java implementation to use. Required.export
JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home
  1. Next we need to create an RSA key to be used by hadoop when ssh’ing to localhost: ssh-keygen -t rsa -P ""cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys Next we need to test this works by issuing the following command: ssh localhost Which should let you login with no password. If this does not work you need to start the SSH server that comes with your Mac. Open the control panel, click sharing, then check the box next to “Remote Login”.

    WARNING

    Be sure to turn it off when you are not using Hadoop.

  2. Next we need to set some more configurations under hadoop/conf

    • core-­site.xml:
    <configuration>
      <property>
        <name>hadoop.tmp.dir</name>
        <value>TEMPORARY-DIR-FOR-HADOOP-DATASTORE</value>
      </property>
      <property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:54310</value>
      </property>
    </configuration>
    
    • mapred-­site.xml:
    <configuration>
    <property>
      <name>mapred.job.tracker</name>
      <value>localhost:54311</value>
    </property>
    </configuration>
    
    • hdfs-­site.xml:
    <configuration>
    <property>
      <name>dfs.replication</name>
      <value>1</value>
    </property>
    </configuration>
    
  3. Next we need to format the hadoop file system.  From the hadoop directory run the following: ./bin/hadoop namenode -format

  4. We are now ready to fire up your hadoop system. Run hadoop by running the following script: ./bin/start-all.sh

  5. Next we want to run an example job to make sure our setup works. We will be testing by using the wordcount job that comes with the hadoop distribution. First create a text file, I created one under workspace/java/projects/ using vi test.txt Then I populate it with some sample text. Next we need to copy it over to hadoop’s file system by executing the following command from the hadoop directory:./bin/hadoop fs -put ../test.txt test_wordcount You can verify that the file is now in the hadoop file system by using ./bin/hadoop fs -ls you should see it there now.

  6. We now want to run a job to process our test.txt file, we will be running the wordcount job by using ./bin/hadoop jar hadoop-examples-0-20.203.0.jar wordcount test_wordcount wordcount_out This will run the wordcount job, the results will be placed on the wordcount_out directory on hdfs. You can verify the results by using ./bin/hadoop fs -lsr you should see a file under under wordcount_out/part-r-00000 which contains the results. The results can be viewed by using ./bin fs -cat wordcount_out/part-r-00000

You should now have a working local pseudo-distributed hadoop environment to play with. Happy processing! 😃

Last Updated: 5/21/2019, 12:58:44 PM