Just a note how to enable debug mode in beeline (or any other) Hive client.
To enable remote debugging, we need to pass "-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=5005" arguments to JVM.
Tricky part is to find the place where JVM is being executed. It's file $HADOOP_HOME/hive-client/bin/ext/beeline.sh - in Hortonworks (HDP) installation it will be /usr/hdp/current/hive-client/bin/ext/beeline.sh
On the line
exec $HADOOP jar ${beelineJarPath} $CLASS $HIVE_OPTS "$@"
but -Xdebug option should be placed to HADOOP_CLIENT_OPTS variable:
export HADOOP_CLIENT_OPTS="$HADOOP_CLIENT_OPTS -Dlog4j.configuration=beeline-log4j.properties -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=5005"
-Xdebug - enables remote debugging
-Xrunjdwp - sets configuration properties, where
server - start server or try to connect to debugger (usually server=y to do remote debug via Idea)
suspend - freeze start of program and wait until debugger connects to it
address - port of server you will connect to
Now if you just start # beeline
you'll see that debug server started on port 5005, and you can connect to it via Idea (or whatever).
Показаны сообщения с ярлыком hadoop. Показать все сообщения
Показаны сообщения с ярлыком hadoop. Показать все сообщения
четверг, 19 февраля 2015 г.
понедельник, 28 апреля 2014 г.
UDF for Exponential moving average in Pig Latin
Ярлыки:
big data
,
english
,
hadoop
,
pig
,
programming
,
statistics

Today I faced with the fact that there's no native way to calculate moving average in Pig.
For example:
A = {(5, 1), (2, 2), (7, 3), (4, 4)}
And we need to calculate EMA of first field, with weight of second field. alpha=0.5.
ema(A) = (5*1 + 2*0.5 + 7*0.25 + 4*0.125) / (1 + 0.5 + 0.25 + 0.125) = 4.4
In Pig and Python UDF it will be like this:
REGISTER 'python_udf.py' USING jython AS myfuncs;
B = GROUP A ALL;
C = FOREACH times {
GENERATE A as src,
myfuncs.EMA(A, 1, 3, 0.5) as ema;
}
DUMP C;
UDF:
@outputSchema("value:double")
def EMA(D, weight_field, wmax, alpha):
"""
Calculates exponential moving average
note: weights are reversed!
"""
weights = [x for x in range(1, wmax+1)]
weights_values = {}
wv = 1.0
for w in weights:
weights_values[w] = wv
wv *= alpha
denom = sum(weights_values.values())
numer = 0.0
for weight in weights:
numer += sum(1 for x in D if x[weight_field] == weight)*weights_values[weight]
return numer/denom
Pretty straightforward, but it works. If you know more elegant way, please share it!
Подписаться на:
Сообщения
(
Atom
)