2013年8月9日星期五

Hive face questions, seek expert help

There are a lot of tables: TRLOG
around the table about 2T
TRLOG:
CREATE TABLE TRLOG
(PLATFORM string,
USER_ID int,
CLICK_TIME string,
CLICK_URL string)
row format delimited
fields terminated by '\ t';

data:
PLATFORM USER_ID CLICK_TIME CLICK_URL
WEB 12332321 2013-03-21 13:48:31.324 / home /
WEB 12332321 2013-03-21 13:48:32.954 / selectcat / er /
WEB 12332321 2013-03-21 13:48:46.365 / er/viewad/12.html
WEB 12332321 2013-03-21 13:48:53.651 / er/viewad/13.html
WEB 12332321 2013-03-21 13:49:13.435 / er/viewad/24.html
WEB 12332321 2013-03-21 13:49:35.876 / selectcat / che /
WEB 12332321 2013-03-21 13:49:56.398 / che/viewad/93.html
WEB 12332321 2013-03-21 13:50:03.143 / che/viewad/10.html
WEB 12332321 2013-03-21 13:50:34.265 / home /
WAP 32483923 2013-03-21 23:58:41.123 / m / home /
WAP 32483923 2013-03-21 23:59:16.123 / m / selectcat / fang /
WAP 32483923 2013-03-21 23:59:45.123 / m/fang/33.html
WAP 32483923 2013-03-22 00:00:23.984 / m/fang/54.html
WAP 32483923 2013-03-22 00:00:54.043 / m / selectcat / er /
WAP 32483923 2013-03-22 00:01:16.576 / m/er/49.html
........................

to the above data processing of the following structure of the table ALLOG:
CREATE TABLE ALLOG
(PLATFORM string,
USER_ID int,
SEQ int,
FROM_URL string,
TO_URL string)
row format delimited
fields terminated by '\ t';

After finishing the data structure:
PLATFORM USER_ID SEQ FROM_URL TO_URL
WEB 12332321 1 NULL / home /
WEB 12332321 2 / home / / selectcat / er /
WEB 12332321 3 / selectcat / er / / er/viewad/12.html
WEB 12332321 4 / er/viewad/12.html / er/viewad/13.html
WEB 12332321 5 / er/viewad/13.html / er/viewad/24.html
WEB 12332321 6 / er/viewad/24.html / selectcat / che /
WEB 12332321 7 / selectcat / che / / che/viewad/93.html
WEB 12332321 8 / che/viewad/93.html / che/viewad/10.html
WEB 12332321 9 / che/viewad/10.html / home /
WAP 32483923 1 NULL / m / home /
WAP 32483923 2 / m / home / / m / selectcat / fang /
WAP 32483923 3 / m / selectcat / fang / / m/fang/33.html
WAP 32483923 4 / m/fang/33.html / m/fang/54.html
WAP 32483923 5 / m/fang/54.html / m / selectcat / er /
WAP 32483923 6 / m / selectcat / er / / m/er/49.html
........................
PLATFORM and USER_ID or on behalf of platform and user ID; SEQ field after the user's behalf chronological order of access, FROM_URL and TO_URL represent where a user where a jump. For a platform to access a user's first record, which FROM_URL is NULL (null values).


interviewer said I need to do it in two ways:
1, one can accelerate the realization of the above-mentioned process Hive Generic UDF, and gives use this UDF implementation ETL process Hive SQL
2, pure Hive SQL-based ETL process, from TRLOG table generation ALLOG table; (the result is a set of SQL)

My question is how to write the UDF should
Also, the second question how do ah, there is no direction

and other experts to help
------ Solution --------------------------------- -----------
give you a method written in JAVA RowNumber


public class RowNumber extends org.apache.hadoop.hive.ql.exec.UDF {

private static int MAX_VALUE = 50;
private static String comparedColumn[] = new String[MAX_VALUE];
private static int rowNum = 1;

public int evaluate(Object... args) {
String columnValue[] = new String[args.length];
for (int i = 0; i < args.length; i++)
columnValue[i] = args[i].toString();
if (rowNum == 1)
{

for (int i = 0; i < columnValue.length; i++)
comparedColumn[i] = columnValue[i];
}

for (int i = 0; i < columnValue.length; i++)
{

if (!comparedColumn[i].equals(columnValue[i]))
{
for (int j = 0; j < columnValue.length; j++)
{
comparedColumn[j] = columnValue[j];
}
rowNum = 1;
return rowNum++;
}
}
return rowNum++;
}
}

JAVA
put this package, compiled into a JAR package, such RowNumber.jar. This you will always it ~ ~ ~
then placed on the machine HIVE
In HIVE SHELL in the implementation of the following two statements:
add jar /root/RowNumber.jar; 
#把RowNumber.jar加载到HIVE的CLASSPATH中
create temporary function row_number as 'RowNumber';
#在HIVE里创建一个新函数,叫row_number ,引用的CLASS 就是JAVA代码里的RowNumber


prompts successful, run the following this HIVE SQL


#INSERT OVERWRITE TABLE ALLOG 如果要写入ALLOG表,可以把注释去掉
SELECT t1.platform,t1.user_id,row_number(t1.user_id)seq,t2.click_url FROM_URL,t1.click_url TO_URL FROM
(select *,row_number(user_id)seq from trlog)t1
LEFT OUTER JOIN
(select *,row_number(user_id)seq from trlog)t2
on t1.user_id = t2.user_id and t1.seq = t2.seq + 1;


LZ, you will not be the side of the interview, while the bar in the online help
------ Solution - -------------------------------------------

The first question is HIVE translates SQL RN seemingly the BUG, ​​you can remove the outer layer of the ROW_NUMBER with T1, SEQ, will be able to find the problem.
circumstances yet to be analyzed, are interested can go to look up the relevant communities abroad BUG LIST.

The second question:

INSERT OVERWRITE TABLE ALLOG
SELECT t1.platform,t1.user_id,t1.seq,t2.click_url FROM_URL,t1.click_url TO_URL FROM
(SELECT platform,user_id,click_time,click_url,count(1) seq FROM (SELECT a.*,b.click_time click_time1,b.click_url click_url2  FROM trlog a left outer join trlog b on a.user_id = b.user_id)t WHERE click_time>=click_time1 GROUP BY platform,user_id,click_time,click_url)t1
LEFT OUTER JOIN
(SELECT platform,user_id,click_time,click_url,count(1) seq FROM (SELECT a.*,b.click_time click_time1,b.click_url click_url2  FROM trlog a left outer join trlog b on a.user_id = b.user_id)t WHERE click_time>=click_time1 GROUP BY platform,user_id,click_time,click_url )t2
on t1.user_id = t2.user_id and t1.seq = t2.seq + 1;

This is completely inefficient at all, MapReduce JOB to run at least five times, doing experiments okay, ran a production environment on a free. Might as well write a JAVA version of MapReduce it
Minato live with it

---- - For reference only ---------------------------------------


this subject gnaw deja vu
UDF as simple as a custom ROWNUMBER method can be loaded into the HIVE in the result.
This second question a little abnormal, since the use UDF, why did they do it with pure HIVE SQL, poor performance.
------ For reference only -------------------------------------- -

tntzbzc, I was HADOOP novice, you can explain the detailed points, the best examples of code

there are ways to do this, seeking expert to teach me.
------ For reference only -------------------------------------- -

Also, the interviewer said, the first question must be high efficiency, because the table has 2TB
The second question does not matter, as long as the line can HIVE SQL implementation
------ For reference only --------------------- ------------------
first question seemingly resolved, but one thing does not understand T1 \ T2 Now that we have row_number (user_id) seq
also set in the outermost points row_number (user_id) seq?

There is a second problem to solve it, do not control efficiency, long run through the line
------ For reference only --------------- ------------------------

too strong, and the second also run up
hope to see other answers appear, thank you

没有评论:

发表评论