The official Hadoop homepage says “The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures…”
I suggest the following readings
- http://blogs.ischool.berkeley.edu/i290-abdt-s12/files/2012/08/BillGraham_IntroToHadoop_Aug30.pdf
- http://files.cloudera.com/pdf/whitepaper/Using-Cloudera-to-Improve-Data-Processing_WP_2012-09.pdf