Keywords (1)

Academic
Publications
FlumeJava: easy, efficient data-parallel pipelines

FlumeJava: easy, efficient data-parallel pipelines,10.1145/1806596.1806638,Craig Chambers,Ashish Raniwala,Frances Perry,Stephen Adams,Robert R. Henry,

Cumulative Annual
View Publication
The following links allow you to view full publications. These links are maintained by other sources not affiliated with Microsoft Academic Search.
    • ...First, unlike interactive web requests [ÛÏ], data parallel jobs have complex internal structure with operations (e.g., map, reduce, join, etc.) which feed data from one to the other [‰, fl].,Jobs are written in SCOPE [‰], a mash-up language with both declarative and imperative elements similar toPig[’fl]orHIVE[ÛÛ].Acompilertranslatesthejobintoan execution plan graph wherein nodes represent stages such as map, reduce or join, and edges represent dataûow [fl, …, ’Û].,Jobs are written in SCOPE [‰], a mash-up language with both declarative and imperative elements similar toPig[’fl]orHIVE[ÛÛ].Acompilertranslatesthejobintoan execution plan graph wherein nodes represent stages such as map, reduce or join, and edges represent dataûow [fl, …, ’Û].,For example, with D = Ó minutes, a deadline of ‰˛ minutes is treated as a deadline of ¢fl minutes, and the policy won’t act unless the job is at least Ï minutes delayed.,e job’s impact on the cluster is measured as the fraction of job allocation requested by the policy that stat A B C D E F G vertex runtime median [sec] ’‰.Ï ƒ.˛ Û.‰ ‰.’ ò.˛ Ï.‰ Ï.˛ vertex runtime …˛ th percentile [sec] ‰’.¢ ¢ƒ.’ ¢.fl Û¢.’ ’Ï˛.˛ ’fl.ƒ fl.fl vertex runtime …˛ th percentile [sec] (fastest stage) ƒ.˛ Ï.Ï ’.fl ’.ƒ Ï.… Ï.Ï ’.‰ vertex runtime …˛ th ...,e job’s impact on the cluster is measured as the fraction of job allocation requested by the policy that stat A B C D E F G vertex runtime median [sec] ’‰.Ï ƒ.˛ Û.‰ ‰.’ ò.˛ Ï.‰ Ï.˛ vertex runtime …˛ th percentile [sec] ‰’.¢ ¢ƒ.’ ¢.fl Û¢.’ ’Ï˛.˛ ’fl.ƒ fl.fl vertex runtime …˛ th percentile [sec] (fastest stage) ƒ.˛ Ï.Ï ’.fl ’.ƒ Ï.… Ï.Ï ’.‰ vertex runtime …˛ th ...,e job’s impact on the cluster is measured as the fraction of job allocation requested by the policy that stat A B C D E F G vertex runtime median [sec] ’‰.Ï ƒ.˛ Û.‰ ‰.’ ò.˛ Ï.‰ Ï.˛ vertex runtime …˛ th percentile [sec] ‰’.¢ ¢ƒ.’ ¢.fl Û¢.’ ’Ï˛.˛ ’fl.ƒ fl.fl vertex runtime …˛ th percentile [sec] (fastest stage) ƒ.˛ Ï.Ï ’.fl ’.ƒ Ï.… Ï.Ï ’.‰ vertex runtime …˛ th ...,e job’s impact on the cluster is measured as the fraction of job allocation requested by the policy that stat A B C D E F G vertex runtime median [sec] ’‰.Ï ƒ.˛ Û.‰ ‰.’ ò.˛ Ï.‰ Ï.˛ vertex runtime …˛ th percentile [sec] ‰’.¢ ¢ƒ.’ ¢.fl Û¢.’ ’Ï˛.˛ ’fl.ƒ fl.fl vertex runtime …˛ th percentile [sec] (fastest stage) ƒ.˛ Ï.Ï ’.fl ’.ƒ Ï.… Ï.Ï ’.‰ vertex runtime …˛ th ...,e job’s impact on the cluster is measured as the fraction of job allocation requested by the policy that stat A B C D E F G vertex runtime median [sec] ’‰.Ï ƒ.˛ Û.‰ ‰.’ ò.˛ Ï.‰ Ï.˛ vertex runtime …˛ th percentile [sec] ‰’.¢ ¢ƒ.’ ¢.fl Û¢.’ ’Ï˛.˛ ’fl.ƒ fl.fl vertex runtime …˛ th percentile [sec] (fastest stage) ƒ.˛ Ï.Ï ’.fl ’.ƒ Ï.… Ï.Ï ’.‰ vertex runtime …˛ th ...,e job’s impact on the cluster is measured as the fraction of job allocation requested by the policy that stat A B C D E F G vertex runtime median [sec] ’‰.Ï ƒ.˛ Û.‰ ‰.’ ò.˛ Ï.‰ Ï.˛ vertex runtime …˛ th percentile [sec] ‰’.¢ ¢ƒ.’ ¢.fl Û¢.’ ’Ï˛.˛ ’fl.ƒ fl.fl vertex runtime …˛ th percentile [sec] (fastest stage) ƒ.˛ Ï.Ï ’.fl ’.ƒ Ï.… Ï.Ï ’.‰ vertex runtime …˛ th ...,Notice that jobs using the max allocation policy ùnish signiùcantly before the deadline ‐ the median such job ùnishes approximately fl˛¤ early ‐ which translates to a large impact on the rest of the cluster; jobs under the other three policies ùnish much closer to the deadline.,statistic training job ’ job Û total work [hours] ’Û.fl ÛÏ.¢ ’ò.¢ queueing median [sec] ¢.ò ‰.ò ‰.… queueing …˛ th perc.,In the runs where we doubled or tripled the deadline, the policy released ‰Ï¤ or òϤ (respectively) of the allocated resources on average. See two example runs in Fig. fl.,(a) Deadline changed from ’ƒ˛ to fl˛ minutes.,Figure fl. Examples of two experiments with changing deadlines.,e results are summarized in Fig. ’’. Running our policy with no hysteresis and no dead zone, results in meeting only ¢fl¤ of the SLOs, while using the hysteresis with no dead zone, meets …˛¤ of the SLOs.,Hence, we believe Jockey is a better match for production DAG-like frameworks such as Hive [ÛÛ], Pig [’fl], and Ciel [’‰]...

    Andrew D. Fergusonet al. Jockey: guaranteed job latency in data parallel clusters

    • ...In addition, FlumeJava [9] is a Java library for programming and managing MapReduce pipelines that proposes new parallel-collection abstractions, does deferred evaluation, and optimizes the data flow graph of an execution plan internally before executing...

    Zhenyu Guoet al. Spotting Code Optimizations in Data-Parallel Pipelines through PeriSCO...

    • ...MadLINQ embeds a set of domain-specific language constructs into a general-purpose programming language (C#), similar to the approach taken by DryadLINQ and Flume-Java [15] for data-parallel programming...

    Zhengping Qianet al. MadLINQ: large-scale distributed matrix computation for the cloud

    • ...Recently, declarative programming has found an important application in data-center programming: systems such as MapReduce [10], DryadLINQ [32] and FlumeJava [7] allow users to compose a declarative specification of an application’s logic, and execute it across hundreds or thousands of machines.,In particular, DryadLINQ [32, 33], FlumeJava [7] and Pig [26] optimize many operations, including aggregation, joins and sorting, by applying high-level transformations on the query operator graph.,DryadLINQ implements this functionality using SelectMany [32], and similar operators exist in FlumeJava [7] and Pig Latin [26], both of which execute on a MapReduce cluster.,Chambers et al. described FlumeJava, which (like DryadLINQ) uses lazy evaluation to build a distributed execution plan from a graph of operators [7]...

    Derek Gordon Murrayet al. Steno: automatic optimization of declarative queries

    • ...Recently, Google has proposed FlumeJava [16], a Java library that helps users to express parallel operations over distributed collections, which are internally compiled into a MapReduce dataflow plan...

    Alexander Behmet al. ASTERIX: towards a scalable, semistructured data platform for evolving...

Sort by: