setting-up-hadoop-on-a-mac
2011-12-01T07:42:38.000Z
I recently was given the task to report what requested urls on our site had the most 404 response codes. As I started to dig into our logs I noticed we had lots of them the smallest being 9 gigs for 1 month.
I didn’t want to have to parse every single log using scripting because of the size of the logs and I remembered we were already doing something similar for another application via hadoop. I had used hadoop before and remembered how difficult it was to install on my pc.
This time around I was on a mac, and hadoop is more mature, there is better documentation so I gave it another shot. After some digging I found a few sites that had similar instructions but none had a complete and simple to follow plan. So here is the plan I used to get my local psuedo-distributed hadoop dev environment on my mac to work:
Download hadoop 0.20.0 from
Untar the hadoop file, can simply double click it in your Downloads folder and it will be extracted to a folder
Copy that hadoop folder to where ever you want to install hadoop. I put mine under my home directory under
workspace/java/projects/hadoop
Next we need to tell hadoop where our java home is on my mac it is under:
/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home
I set that value under the hadoop/conf/hadoop-env.sh
for JAVA_HOME
# The java implementation to use. Required.export
JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home
Next we need to create an RSA key to be used by hadoop when ssh’ing to localhost:
ssh-keygen -t rsa -P ""cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Next we need to test this works by issuing the following command:ssh localhost
Which should let you login with no password. If this does not work you need to start the SSH server that comes with your Mac. Open the control panel, click sharing, then check the box next to “Remote Login”.WARNING
Be sure to turn it off when you are not using Hadoop.
Next we need to set some more configurations under
hadoop/conf
- core-site.xml:
<configuration> <property> <name>hadoop.tmp.dir</name> <value>TEMPORARY-DIR-FOR-HADOOP-DATASTORE</value> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> </property> </configuration>
- mapred-site.xml:
<configuration> <property> <name>mapred.job.tracker</name> <value>localhost:54311</value> </property> </configuration>
- hdfs-site.xml:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
Next we need to format the hadoop file system. From the hadoop directory run the following:
./bin/hadoop namenode -format
We are now ready to fire up your hadoop system. Run hadoop by running the following script:
./bin/start-all.sh
Next we want to run an example job to make sure our setup works. We will be testing by using the wordcount job that comes with the hadoop distribution. First create a text file, I created one under
workspace/java/projects/
usingvi test.txt
Then I populate it with some sample text. Next we need to copy it over to hadoop’s file system by executing the following command from the hadoop directory:./bin/hadoop fs -put ../test.txt test_wordcount
You can verify that the file is now in the hadoop file system by using./bin/hadoop fs -ls
you should see it there now.We now want to run a job to process our test.txt file, we will be running the wordcount job by using
./bin/hadoop jar hadoop-examples-0-20.203.0.jar wordcount test_wordcount wordcount_out
This will run the wordcount job, the results will be placed on thewordcount_out
directory on hdfs. You can verify the results by using./bin/hadoop fs -lsr
you should see a file under underwordcount_out/part-r-00000
which contains the results. The results can be viewed by using./bin fs -cat wordcount_out/part-r-00000
You should now have a working local pseudo-distributed hadoop environment to play with. Happy processing! 😃