Treść książki

Przejdź do opcji czytnikaPrzejdź do nawigacjiPrzejdź do informacjiPrzejdź do stopki
HumanitiesandBigData.ExploitingDigitalArchivesintheAgeofAbundance
21
pairsofdińerentkey-valuepairs.Nexteachkeyfromthepreviousstepispa-
iredwithallthevaluesassociatedwiththiskey.Finally,theReducefunctionis
appliedtoabovepairsproducingfinallistofdata.EachMapandReducecan
berunindependentlyandisusuallyexecutedbyindividualcomputersinthe
cluster.
Teabovemodelofexecutionprovedtobeverypopular.Firstimplemen-
tationswereproprietaryandusedinternally(butextensively)byGoogle,but
soonanopensourceversionoftheframework,calledHadoop,wasdeveloped
byApacheSofwarefoundationandimplementedbymanylargeorganizations
worldwidenotableexamplesincludeYahoo,Twitter,andFacebook.Inorder
tofacilitateprocessingofthedatastoredinlargeclustersbynon-program-
mers,additionaltoolsquicklyappearedthatallowtheso-called“datascienti-
sts”toperformdatabaseanalysis,usingsimplelanguagessimilartoSQLlan-
guage.ExamplesincludetheApachePigandApacheHiveprojects,whichhide
thecomplexityofMapReduceframeworkbehindsimplelanguagescalledPig
LatinandHiveQLrespectively.
Obviously,creatingandmaintaininglargecomputinginfrastructureisboth
timeandcostconsuming.Initially,onlylargecorporationscouldtherefore
benefitfromthenewcapabilitiesoftoolsdescribedabove.However,itsoon
turnedoutthatthesamepropertiesoftheframeworkthatmakeitresilientto
hardwarefailurenamelytheisolationofindividualMapandReducecom-
putationsmakeitalsoespeciallywell-suitedtosharing.Tankstothis,seve-
raloperatorsoflargecomputeclustersthatwereunderutilized,startedtosell
computertime,eńectivelyrentingindividualMapReducenodes,togetherwith
storagespaceneededforthedatabeinganalyzed.Tisturnedouttobehighly
popular,asitallowedmanyorganizationstoperformlargescalecomputations
withoutinvestinginaveryexpensivehardware.Evenlargeexperiments,invo-
lvingdatasetsdescribing,e.g.thetopologyofentireWorldWideWeb,canbe
thusrunonsuchrentedinfrastructure,incurringthecostofseveralhundred
dollarsversusthebudgetofseveralordersofmagnitudelarger,requiredto
purchaseandsetupsuchcomputecluster.Perhapsevenmoreimportantly,
such“virtual”infrastructureiseasilyscalableIftheorganizationdatapro-
cessingneedskeepgrowing,itcansimplyrentmorecomputenodes,without
buyingthem.Inshortthankstothis,perhapsforthefirsttimeinhistory,
largescalecomputingcapabilitiesbecamerelativelyeasilyavailabletoalmost
everyone,includingscientistsnothavinglargebudgetsforcomputinginfra-
structure,suchashumanists.