Hadoop suitable for real-time query it?
because now the business system based on SQL Server,
business system is mainly query operation, timed to the SQL server to execute the batch insert.
With SQL server data is increasing, the system runs more slowly.
I want to build a Hadoop cluster, use SQLOOD data into Hbase in
business systems based on Hbase query.
but I heard Hadoop is not suitable for real-time query, I still feasible idea what?
------ Solution ---------------------------------------- ----
HBASE can do real-time data queries, and high efficiency
but pay attention to the following points
1, Hbase Table of ROWKEY design should be reasonable
2, Hbase Table of INDEX need to create (using the coprocessor, or the use of MAPREDUCE asynchronous indexing)
3, Hbase can not directly use the SQL query, but you can use SQL open source project to solve some problems, such as phoenix
Reference https://github.com/forcedotcom/phoenix
phoenix is based on the 0.92 version of the coprocessor runs, the solution with SQL commands to perform the aggregation on the HBASE problem
such as SUM \ AVG \ COUNT \ MAX \ MIN, etc.
also includes LIMIT \ SORT other operations
But now phoenix JOIN operation is currently not supported, nor support the creation INDEX, such methods need to implement any course.
I personally recommend custom ENDPOINT (abandon SQL), custom hooks can be tied directly to a phoenix out of the table to create
4, HBASE itself is not suitable as BI, need MAPREDUCE custom business.
------ For reference only -------------------------------------- -
Correction: SQLOOD ==> SQOOP
------ For reference only ----------------------- ----------------
you mean what I think it will work?
my business query is similar to moving the company's bill details of a single query, the front desk for real-time requirements are particularly high,
But SQL server data after years of accumulation, the database server has been overburdened,
query time to tens of seconds or even minutes.
So I consider architecture a Hadoop cluster, through SQOOP import the data into Hbase in
Hadoop cluster through real-time queries.
indifferent to my business system data access methods need to change.
------ For reference only ---------------------------------- -----
your business and last year I made a very similar business
I was this day there are N T, the data need to do real-time query LOG
And, once every 10 minutes to analyze
The difference between us is that your data in MSSQL, my LOG data in text format.
If high real-time requirements, it is not recommended to use SQOOP guide data, after all SQOOP using the MR way.
I had to write their own data in incremental import procedure it, use HBASE JAVA API, turn once every 1 minute increments.
At present, a total of 84 of my HADOOP cluster machines, HBASE of RSERVER 20 units. Each 16-core, 1.2TB hard drive, 24G memory
support one month's data is completely no problem.
------ For reference only -------------------------------------- -
you say recommended SQOOP import data mainly refers SQOOP not incremental import so slow it caused the import?
you did Hadoop + Hbase large amount of data to achieve real-time queries, really good, what information recommend it?
------ For reference only -------------------------------------- -
1, if the real-time import ask, you can use SQOOP bulk import, but for the high real-time applications, SQOOP bit difficult
2, you see more HBASE the information, ROWKEY design, GET \ SCAN optimization, co-processing of the OBSERVER and ENDPOINT
recommend that you read HBASE Definitive Guide, under online, but only in English
------ For reference only ---------------------------------- -----
Thank you.
------ For reference only -------------------------------------- -
come to learn about
------ For reference only ------------------------------- --------
hadoop is massive batch systems, not real-time processing system is not suitable for real-time processing
------ For reference only ----------- ----------------------------
really fraught
do not know do not talk nonsense
MAPREDUCE indeed magical batch processing
But there are many components HADOOP constructed on HDFS is absolutely real-time query of HBASE artifact
------ For reference only ---------------- -----------------------
concept of how real definition? You use HBASE real to me? The scenarios out. As long as the expansion is based on massive data hadoop batch systems to real time systems such as Apache S4 system is positioned to handle real-time response of the system, good or bad.
------ For reference only -------------------------------------- -
scenarios it? You can imagine moving calls details of a single query.
What do you think can be achieved?
------ For reference only -------------------------------------- -
Oh, and to make the system move it. The customer's bill details of a single query requires real-time response, which may take a few seconds the system will respond, otherwise the customer experience on the poor, and I do not think HADOOP able to respond in a few seconds, from the submission to the JOB JOB initialization operation will take some time to complete, you can do the next test, running an empty JOB handles only one record, MAP in a 1 +1 on the implementation of the operation of the system to see how long you can run to completion JOB respond. ..
------ For reference only ------------------------------------ ---
Map reduce certainly not as a method of inquiry affairs
but HBASE can ah, as long as the design is good, not to mention a few seconds, you can have responded to several tens of milliseconds, and support massive data random access
------ For reference only ---- -----------------------------------
HBASE where no M / R job operation can be done real-time random access
------ For reference only ----------------- ----------------------
brother, you said MAPREUDCE, done HADOOP all know MR can not do real-time queries.
But HBASE different, it is a distributed database, HBASE the biggest feature is the "real-time data random access"
HBASE do real-time query is absolutely possible, and great shoes must have done this experience.
"Hadoop The Definitive Guide," Chapter 13, Section on HBASE definition: If you need real-time random read and write, you can use HBASE
------ For reference only ---------------------------------- -----
HBASE where no M / R job operation can be done real-time random read and write, and I think you are talking about M / R job, huh, huh
------ For reference only --- ------------------------------------
you how to HBase discussion of it?
My question is: I have vast amounts of data, real-time query want to do the above, how to solve this problem.
My idea is to put data into Hbase, by doing distributed Hadoop cluster processing, enabling real-time queries.
focus vast amounts of data and real-time query how to solve. Hadoop + Hbase fit it?
------ For reference only ---------------------------------- -----
LZ, your question is very worth exploring, so we're talking more intense ~ ~ ~
can be done, rest assured.
If you encounter problems, you can post messages to ask, you can also send Station Letters discussion, I wish you success.
------ For reference only -------------------------------------- -
LZ, your question is very worth exploring, so we're talking more intense ~ ~ ~
can be done, rest assured.
If you encounter problems, you can post messages to ask, you can also send Station Letters discussion, I wish you success.
It seems big wet line and use a lot of experience on Hbase, will the next big wet line and in real enterprise applications, how to use Hbase to handle the kind of business scenario? The configuration of the entire cluster environment how? How much amount of data processing?
------ For reference only -------------------------------------- -
my amount of knowledge in this area is limited to know Hadoop, Hbase these terms and some knowledge of the principles of science level.
program is still in the investigation stage.
Hope to get your help.
------ For reference only -------------------------------------- -
The brothers also, the exhibitions!
------ For reference only -------------------------------------- -
My HADOOP main business segments 3
Distributed Storage: HDFS
data warehouse and data mining: Hive + MapReduce + MySQL
company's official website user data within the network real-time query and the LOG data query: Hbase + MySQL
Hadoop cluster with 84 units +26 units, distributed in four IDC, each IDC has a separate NN, JT and HM
Two IDC, respectively, to 42 clusters, together with the other two IDC plus 26 units.
machine than the poor:
NN, JT and HM the host computer is INTER XEON 16-core, 24GB RAM, 1.2TB hard drive
part and the master node machine as is INTER 2U machine, the other part is the blade, AMD 8-core, 16GB RAM, 2TB HDD
daily data entry 4TB to 8TB of data processing, probably around 15TB. Finally, cut 12TB of junk data, and finally retained 2-4TB.
persistent data cleaned once a week, packaged in compressed to SEQFILE.
entire cluster HDFS total there are 20 million small files (2MB less), and 20,000 ordinary documents, more than 800 large files (100GB or more)
HBASE maximum a table close to one hundred billion rows of data. Other tables ranging from tens of thousands to billions of rows.
hard enough, then add the machine, anyway outdated blades are not expensive, a cage will cost 150,000.
------ For reference only -------------------------------------- -
advise not to mention, everyone exchanges, after all, I did not really in the actual production environment deployment used HBASE, just conducted a simple experiment, China Mobile's real production environment I just had HADOOP The M / R computational model, they fit with your application using HBASE business scenarios relationship is relatively large, different business scenarios may be very different performance optimization point, after all HBASE positioning NOSQL, unlike a traditional relational database, if it is handle complex business you need to do a lot of things, it is recommended to install a HBASE according to their own business scenarios do a survey, so as to reduce project risk. See more online, such as HBASE the KEY-VALUE operating performance test http://tech.it168.com/a2011/0711/1216/000001216244_all.shtml
------ For reference only ----- ----------------------------------
one day produce 8T data! can be said that what the company's official website do next? Currently we give a carrier a province-day data processing data also only about 2T. Ask the user query is a query data in real time what kind of content? LOG intranet data query is what kind of content query, the query input and output, respectively, what is it? Ask the next
------ For reference only ------------------------------------ ---
nice to have such a resource, I will five machines, the best 4-core 8G memory
------ For reference only --------- ------------------------------
why this article is not suitable for BI says Hbase it?
---------------------------------------------- -------------------------------------------------
HBase Hadoop File System is built on a layer of Key-Value Pair storage server. HBase to support Key-Value quickly insert, modify, and delete, and individual Key to Value quick query. So Hbase suitable for BI analysis of the data source you? Filter (Filtering) and aggregation (Aggregation) is the basic arithmetic BI, so we first need to know whether HBase support rapid filtering and aggregation operations.
MapReduce Hadoop system is the basic computational framework, HBase users can use MapReduce to perform filtering and aggregation operations. But we know the MapReduce reaction time is usually a few tens of seconds or more, which is too slow for real-time BI operation. So we would like to investigate whether HBase Coprocessor is a better choice.
HBase Coprocessor is a much simpler operation than MapReduce system. Coprocessor equivalent in HBase Region Server on a stored procedure (Stored Procedure). HBase computing customers can call (via execCoprocessor) in Region Server on Coprocessor, do the filtering and aggregation operations. Coprocessor computing Region Server locally, and then pass the results will be part of Region Server to the client, the final results will be assembled at the client. The following figure shows a schematic diagram of Corprocessor do Count operation.
Coprocessor programmers can write their own programs to do with the HBase Scan Object screening, using Java code to implement, such as Sum (), Avg () aggregation operations.
Since HBase API itself does not support Table Join, we can assume that all of the data warehouse are stored in a giant HBase Table on.
at the logical level, HBase Table is equivalent to a three-dimensional Map - used (Row, Column, TimeStamp), we can find the corresponding value. In specific implementations, HBase Table according to one of the data is stored in a data unit, each data unit range in addition, there are other fields, such as RowKey, Column ID, TimeStamp. Such a large part of the data unit space is actually used to store data that Metadata. This storage format for sparse reports is very effective, but when the report data density becomes large, its greatly reduced storage efficiency. While a typical data warehouse table data density is often close to 100%, then HBase Table storage efficiency is much lower than a simple two-dimensional reports, such as a report or a relational database CSV report.
Our tests showed that when data report small (200-300MB), coprocessor to be slightly slower than MySQL, but faster than MapReduce. becomes large when the data report, coprocessor will be more slower than MySQL even slower than MapReduce. In MapReduce inside, the same data, CSV, and HBase Table compared, CSV much faster.
In summary, bloggers think, HBase Table own storage format is not suitable for typical BI application.
But for some simple reporting applications, such as Facebook Insight, HBase can still be used as a data source. In Facebook Insight, each user has some Count metrics, such as Click #, Impression # etc., the user ID (as Key) and those Count metrics exist a HBase Table li, Insight For each according to Web Log user metrics to do real-time updates. The metric value for each user can also be real-time access. Because this does not involve a more complex filtering and aggregation operations, HBase can play a good role.
---------------------------------------------- -------------------------------------------------
------ For reference only ---------------------------------- -----
------ For reference only ---------------------------------- -----
1, HBASE itself is not high-level programming language for data analysis, such as SQL
2, HBASE no ready JOIN syntax, you need to develop your own implementation
3, Coprocessor's Endpoint although run at server, but it is always a lightweight aggregation tools
4, HBASE not as easy as RDBMS or MONGODB establishment of two index
If you do BI, or use Hive or PIG do it. After all, from the bottom of the high-level language, can greatly improve the efficiency
HIVE can be analyzed directly HBASE mapping data, very convenient.
------ For reference only -------------------------------------- -
Our company is doing Internet entertainment, games, whose name is not revealed.
Game LOG large amount oh, players each fired one shot, killing one person each, each get a game props are recorded LOG
If a game people daily online 10W, LOG there are at least more than 1T. Just a lot of companies do not choose HBASE, more is RDBMS process data, efficiency is relatively low only.
There are many means of user queries:
1, game 2, PC 3 browsers, mobile phones and other mobile devices
Because there WEB middleware layer data cache, so HBASE concurrency is not high, about 500 times per minute visit, specifically I did not have statistics.
intranet and extranet business is similar to the use of CALL CENTER
------ For reference only ---------------------------------- -----
HBASE table you are designed to be larger width table, right? have carried JOIN operation? Response speed how? You said at least one day's data into HBASE 1T big wide list yet?
------ For reference only -------------------------------------- -
table is not very wide, a column family, each column family multiple columns.
Each table is probably 10-25 range, but the table each row "column of data" is not necessarily the same, this is a lot better than the RDBMS.
JOIN operations are, are our own implementation of
1T data distributed in different table, by business segments
------ For reference only ---------------------------------- -----
You are the conditions for each query you create a table, with its KEY as a query? JOIN operation involved you are placed on a table for multi-attribute combinations KEY redundant storage, or multi-table to check each of the tables, and then check out the association in memory after the merger?
------ For reference only -------------------------------------- -
ROWKEY are used as a query, not the KEY. The difference is significant.
HBASE designed tables are redundant, which the three paradigms and RDBMS design theory is very different.
JOIN refers to the table in the union query between tables are done in memory MERGE, and an RDBMS JOIN very similar.
------ For reference only -------------------------------------- -
I'd care about these inconvenient features mentioned above,
long as it can do huge amounts of data in real time to meet BI queries,
complex meaning by matter. I can overcome the difficulties to achieve.
Since you have been in the company huge amounts of data on real-time queries made through the Hbase.
then I have to try.
------ For reference only -------------------------------------- -
right, I said that you said KEY ROWKEY, JOIN of two tables out of the first screening are large amount of data, such as memory size has been exceeded, how did you handle it? Then create a table to merge these two tables ROWKEY redundant storage do? You for your inquiry about these tables is the number of response time within it?
没有评论:
发表评论