parkour.mapreduce documentation

*context*

The task context.  Only bound during the dynamic scope of a task.

collfn

(collfn v)
Task function adapter for collection-function-like functions.  The adapted
function `v` should accept conf-provided arguments followed by the (unwrapped)
input tuple source, and should return a reducible collection of output tuples.
If `v` has metadata for the `::mr/source-as` or `::mr/sink-as` keys, the
function input and/or output will be re-shaped as specified via the associated
metadata value.

combiner!

(combiner! conf var & args)
As per `reducer!`, but allocate and configure for the Hadoop combine step,
which may impact e.g. output types.

contextfn

(contextfn v)
Task function adapter for functions accessing the job context.  The adapted
function `v` should accept a configuration followed by any conf-provided
arguments, and should return a function.  The returned function should accept
the job context and an (unwrapped) input tuple source, and should return a
reducible collection of output tuples.

counters-map

(counters-map counters)
Translate job `counters` into nested Clojure map of strings to counts.

input-format!

(input-format! conf svar sargs rvar rargs)
Allocate and return a new input format class for `conf` as invoking `svar` to
generate input-splits and invoking `rvar` to generate record-readers.

During local job initialization, the function referenced by `svar` will be
invoked with the job context followed by any provided `sargs` (which must be
EDN-serializable); it should return a sequence of input split data.  Any values
in the returned sequence which are not `InputSplit`s will be wrapped in
`EdnInputSplit`s and must be EDN-serializable; in such a case, the `::mr/length`
and `::mr/locations` keys of such data may provide the split byte-size and node
locations respectively.

Prior to use, the function reference by `rvar` will be transformed by the
function specified as the value of the `readerv`'s `::mr/adapter` metadata,
defaulting to `parkour.mapreduce/recseqfn`.  During remote task-setup, the
transformed function will be invoked with the task input split and task context
followed by any provided `rargs`; it should return a `RecordReader` generating
the task input data.

See also: `recseqfn`.

job

(job)(job conf)
Return new Hadoop `Job` instance, optionally initialized with configuration
`conf`.

keygroups

(keygroups context)
Produce distinct keys from the tuples in `context`.  Deprecated.

keykeygroups

(keykeygroups context)
Produce pairs of distinct grouping keys and associated sequences of specific
keys from the tuples in `context`.  Deprecated.

keykeyvalgroups

(keykeyvalgroups context)
Produce pairs of distinct grouping keys and associated sequences of specific
keys and values from the tuples in `context`.  Deprecated.

keys

(keys context)
Produce keys only from the tuples in `context`.  Deprecated.

keysgroups

(keysgroups context)
Produce sequences of specific keys associated with distinct grouping keys
from the tuples in `context`.  Deprecated.

keyvalgroups

(keyvalgroups context)
Produce pairs of distinct group keys and associated sequences of values from
the tuples in `context`.  Deprecated.

keyvals

(keyvals context)
Produce pairs of keys and values from the tuples in `context`.  Deprecated.

local-runner?

(local-runner? conf)
True iff `conf` specifies the local job runner.

mapper!

(mapper! conf var & args)
Allocate and return a new mapper class for `conf` as invoking `var`.

Prior to use, the function referenced by `var` will be transformed by the
function specified as the value of `var`'s `::mr/adapter` metadata, defaulting
to `parkour.mapreduce/collfn`.  During task-setup, the transformed function will
be invoked with the job `Configuration` and any provided `args` (which must be
EDN-serializable); it should return a function of one argument, which will be
invoked with the task context to execute the task.

See also: `collfn`, `contextfn`.

partfn

(partfn v)
Partitioner function adapter for value-based partitioners.  The adapted
function `v` should accept a configuration followed by any conf-provided
arguments, and should return a function.  The returned function should accept
an (unwrapped) tuple key, (unwrapped) tuple value, and a partition-count; it
should return an integer modulo the partition-count, and should optionally be
primitive-hinted as OOLL.

partitioner!

(partitioner! conf var & args)
Allocate and return a new partitioner class for `conf` as invoking `var`.

Prior to use, the function referenced by `var` will be transformed by the
function specified as the value of `var`'s `::mr/adapter` metadata, defaulting
to the `(comp parkour.mapreduce/partfn constantly)`.  During task-setup, the
transformed function will be invoked with the job `Configuration` and any
provided `args` (which must be EDN-serializable); it should return a function of
three arguments: a raw map-output key, a raw map-output value, and an integral
reduce-task count.  That function will be called for each map-output tuple, must
return an integral value mod the reduce-task count, and must be primitive-hinted
as `OOLL`.

See also: `partfn`.

recseqfn

(recseqfn v)
Input format record-reader creation function adapter for input formats
implemented in terms of seqs.  The adapted function `v` should accept an input
split and a task context, and should return a value which is `seq`able,
`count`able, and optionally `Closeable`.

reducer!

(reducer! conf var & args)
Allocate and return a new reducer class for `conf` as invoking `var`.

Prior to use, the function referenced by `var` will be transformed by the
function specified as the value of `var`'s `::mr/adapter` metadata, defaulting
to `parkour.mapreduce/collfn`.  During task-setup, the transformed function will
be invoked with the job `Configuration` and any provided `args` (which must be
EDN-serializable); it should return a function of one argument, which will be
invoked with the task context to execute the task.

See also: `collfn`, `contextfn`.

set-combiner

(set-combiner job cls)
Set the combiner class for `job` to `cls`.

set-mapper

(set-mapper job cls)
Set the mapper class for `job` to `cls`.

set-partitioner

(set-partitioner job cls)
Set the partitioner class for `job` to `cls`.

set-reducer

(set-reducer job cls)
Set the reducer class for `job` to `cls`.

sink

(sink coll)(sink sink coll)
Emit all tuples from `coll` to `sink`, or to `*context*` if not provided.

sink-as

(sink-as kind coll)
Annotate `coll` as containing values to sink as `kind`.  The `kind` may
either be a sinking function of two arguments (a sink and a collection) or a
keyword indicating a built-in sinking function.  Supported keywords are `:none`,
`:keys`, `:vals`, and `:keyvals`.

source-as

(source-as kind source)
Shape `source` to the collection shape `kind`.  The `kind` may either be a
source-shaping function of one argument or a keyword indicating a built-in
source-shaping function.  Supported keywords are: `:keys`, `:vals`, `:keyvals`,
`:keygroups`, `:valgroups`, :keyvalgroups`, `:keykeyvalgroups`,
`:keykeygroups`, and `:keysgroups`.

tac

(tac conf)(tac conf id)
Return a new TaskAttemptContext instance using provided configuration `conf`
and task attempt ID `taid`.

task-ex

Atom holding any exception thrown during local execution.  Intended only for
internal use within Parkour.

valgroups

(valgroups context)
Produce sequences of values associated with distinct grouping keys from the
tuples in `context`.  Deprecated.

vals

(vals context)
Produce values only from the tuples in `context`.  Deprecated.

wrap-sink

(wrap-sink sink)(wrap-sink ckey cval sink)
Return new tuple sink which wraps keys and values as the types `ckey` and
`cval` respectively, which should be compatible with the key and value type of
`sink`.  Where they are not compatible, the type of the `sink` will be used
instead.  Returns a new tuple sink which wraps any sunk keys and values which
are not already of the correct type then sinks them to `sink`.