just in case: hadoop

четверг, 19 февраля 2015 г.

Debug Beeline Client for Hive

Ярлыки: beeline , debug , english , hadoop , hive , hortonworks , java , remote debug

Just a note how to enable debug mode in beeline (or any other) Hive client.

To enable remote debugging, we need to pass "-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=5005" arguments to JVM.

Tricky part is to find the place where JVM is being executed. It's file $HADOOP_HOME/hive-client/bin/ext/beeline.sh - in Hortonworks (HDP) installation it will be /usr/hdp/current/hive-client/bin/ext/beeline.sh

On the line
exec $HADOOP jar ${beelineJarPath} $CLASS $HIVE_OPTS "$@"

but -Xdebug option should be placed to HADOOP_CLIENT_OPTS variable:

export HADOOP_CLIENT_OPTS="$HADOOP_CLIENT_OPTS -Dlog4j.configuration=beeline-log4j.properties -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=5005"

-Xdebug - enables remote debugging
-Xrunjdwp - sets configuration properties, where
server - start server or try to connect to debugger (usually server=y to do remote debug via Idea)
suspend - freeze start of program and wait until debugger connects to it
address - port of server you will connect to

Now if you just start # beeline
you'll see that debug server started on port 5005, and you can connect to it via Idea (or whatever).

понедельник, 28 апреля 2014 г.

UDF for Exponential moving average in Pig Latin

Ярлыки: big data , english , hadoop , pig , programming , statistics

Today I faced with the fact that there's no native way to calculate moving average in Pig.

For example:

A = {(5, 1), (2, 2), (7, 3), (4, 4)}
And we need to calculate EMA of first field, with weight of second field. alpha=0.5.

ema(A) = (5*1 + 2*0.5 + 7*0.25 + 4*0.125) / (1 + 0.5 + 0.25 + 0.125) = 4.4

In Pig and Python UDF it will be like this:

REGISTER 'python_udf.py' USING jython AS myfuncs;

B = GROUP A ALL;
C = FOREACH times {
    GENERATE A as src,
            myfuncs.EMA(A, 1, 3, 0.5) as ema;
}

DUMP C;

UDF:

@outputSchema("value:double")
def EMA(D, weight_field, wmax, alpha):
    """
    Calculates exponential moving average
    note: weights are reversed!
    """
    weights = [x for x in range(1, wmax+1)]
    weights_values = {}
    wv = 1.0
    for w in weights:
        weights_values[w] = wv
        wv *= alpha
    denom = sum(weights_values.values())
    numer = 0.0
    for weight in weights:
        numer += sum(1 for x in D if x[weight_field] == weight)*weights_values[weight]
    return numer/denom

Pretty straightforward, but it works. If you know more elegant way, please share it!

четверг, 19 февраля 2015 г.

Debug Beeline Client for Hive

понедельник, 28 апреля 2014 г.

UDF for Exponential moving average in Pig Latin

четверг, 19 февраля 2015 г.

понедельник, 28 апреля 2014 г.