Computing in its
purest form, has changed hands multiple times. First, from near the
beginning mainframes were predicted to be the future of computing.
Indeed mainframes and large scale machines were built and used, and in
some circumstances are used similarly today. The trend, however, turned
from bigger and more expensive, to smaller and more affordable commodity
PCs and servers.
Most of our data is stored on local networks with servers that may be clustered and sharing
storage. This approach has had time to be developed into stable
architecture, and provide decent redundancy when deployed right. A newer
emerging technology, cloud computing,
has shown up demanding attention and quickly is changing the direction
of the technology landscape. Whether it is Google’s unique and scalable
Google File System, or Amazon’s robust Amazon S3 cloud storage model, it is clear that cloud computing has arrived with much to be gleaned from.
Cloud computing is a style of
computing in which dynamically scalable and often virtualizes resources
are provided as a service over the Internet. Users need not have
knowledge of, expertise in, or control over the technology
infrastructure in the “cloud” that supports them.
Need for large data processing
We
live in the data age. It’s not easy to measure the total volume of data
stored electronically, but an IDC estimate put the size of the “digital
universe” at 0.18 zettabytes in 2006, and is forecasting a tenfold
growth by 2011 to 1.8 zettabytes.
Some of the large data processing needed areas include:-
• The New York Stock Exchange generates about one terabyte of new trade data per day.
• Facebook hosts approximately 10 billion photos, taking up one petabyte of storage.
• Ancestry.com, the genealogy site, stores around 2.5 petabytes of data.
• The Internet Archive stores around 2 petabytes of data, and is growing at a rate of 20 terabytes per month.
• The Large Hadron Collider near Geneva, Switzerland, will produce about 15 petabytes of data per year.
The
problem is that while the storage capacities of hard drives have
increased massively over the years, access speeds—the rate at which data
can be read from drives have not kept up. One typical drive from 1990
could store 1370 MB of data and had a transfer speed of 4.4 MB/s,§ so we
could read all the data from a full drive in around five minutes.
Almost 20 years later one terabyte drives are the norm, but the transfer
speed is around 100 MB/s, so it takes more than two and a half hours to
read all the data off the disk. This is a long time to read all data on
a single drive—and writing is even slower. The obvious way to reduce
the time is to read from multiple disks at once. Imagine if we had 100
drives, each holding one hundredth of the data. Working in parallel, we
could read the data in under two minutes.This shows the significance of
distributed computing.