|
| 1 | +======== |
| 2 | +SqlHBase |
| 3 | +======== |
| 4 | + |
| 5 | +SqlHBase is an HBase ingestion tool for MySQL generated dumps |
| 6 | + |
| 7 | +The aim of this tool is to provide a 1:1 mapping of a MySQL table |
| 8 | +into an HBase table, mapped on Hive (schema is handled too) |
| 9 | + |
| 10 | +To run this requires a working HBase with Thrift enabled, |
| 11 | +and a Hive instance, with metastore properly configured and |
| 12 | +Thrift enabled as well. If u need I/O performance, I recommend to |
| 13 | +look into Pig or Jython, or directly a native Map Reduce job. |
| 14 | + |
| 15 | +SQOOP was discarded as an option, as it doesn't cope with dump files |
| 16 | +and it does not compute the difference between dumps before ingestion. |
| 17 | + |
| 18 | +SqlHBase does a 2 level ingestion process, described below. |
| 19 | + |
| 20 | +"INSERT INTO `table_name` VALUE (), ()" statements are hashed |
| 21 | +and stored (dropping anything at the left side of the first open |
| 22 | +round bracket) as a single row into a staging table on HBase (the |
| 23 | +md5 hash of the row is the row_key on HBase). |
| 24 | +When multiple dumps of the same table/database are inserted, this |
| 25 | +prevents (or at least reduce) the duplication of data on HBase side. |
| 26 | + |
| 27 | +MySQL by default chunks rows as tuples, up to 16Mb, in a single |
| 28 | +INSERT statement. Given that, we basically have a list of tuples: |
| 29 | + |
| 30 | + [(1, "c1", "c2", "c3"), (2, "c1", "c2", "c3"), ... ] |
| 31 | + |
| 32 | +Initial attempt of parsing/splitting such a string with a regexp |
| 33 | +failed, of course. Since a column value could contain ANYTHING, |
| 34 | +even round brackets and quotes. This kind of language is not |
| 35 | +recognizable by a Finite State Automata, so something else had to |
| 36 | +be implemented, to keep track of the nested brackets for example. |
| 37 | +A PDA (push down automata) would have helped but... as u can |
| 38 | +look above, the syntax is exactly the one from a list of tuples |
| 39 | +in python.... an eval() is all we needed in such a case. |
| 40 | +(and it is also, I guess, optimized on C level by the interpreter) |
| 41 | + |
| 42 | +To be taken in consideration that the IDs of the rows are integers |
| 43 | +while HBase wants a string... plus, we need to do some zero padding |
| 44 | +due to the fact that HBase does lexicographic sorting of its keys. |
| 45 | + |
| 46 | +There are tons of threads on forums about how bad is to use a |
| 47 | +monotonically incrementing key on HBase, but... this is what we needed. |
| 48 | + |
| 49 | +[...] |
| 50 | + |
| 51 | +A 2-level Ingestion Process |
| 52 | +=========================== |
| 53 | + |
| 54 | +A staging, -> (bin/sqlhbase-mysqlimport) |
| 55 | +-------------------------------------------- |
| 56 | +without any kind of interpretation of the content of the MySQL dump |
| 57 | +file apart of the splitting between schema data and raw data (INSERTs). |
| 58 | +2 tables are created _"namespace"_creates, _"namespace"_values |
| 59 | +The first table contains an entry/row for each dumpfile ingested, |
| 60 | +having as a rowkey the timestamp of the day at the bottom of the dumpfile |
| 61 | +(or a command line provided one, in case that information is missing). |
| 62 | +Such row contains the list of hashes that for a table (see below), |
| 63 | +a create statement for each table, and a create statement for each view, |
| 64 | +plus some statistics related to the time of parsing of the file, |
| 65 | +and the amount of rows it was containing, and the overall md5 hash. |
| 66 | + |
| 67 | +A publishing, -> (bin/sqlhbase-populate) |
| 68 | +----------------------------------------- |
| 69 | +given a namespace (as of initial import) and a timestamp (from a list): |
| 70 | + - the content of the table CREATE statement gets interpreted, the data |
| 71 | + types mapped from MySQL to HIVE, and the table created on HIVE. |
| 72 | + - if not existing, the table gets created fully, reading each 16Mb chunk |
| 73 | + - the table gets created with such convention: "namespace"_"table_name" |
| 74 | + - if the table exists, and it contains data, we compute the difference |
| 75 | + between the 2 lists of hashes that were created at ingestion time |
| 76 | + -- then we check what has already been ingested in the range of row ids |
| 77 | + which is contained in the mysql chunk (we took the assumption that |
| 78 | + mysql is sequentially dumping a table, hopefully) |
| 79 | + -- if a row id which is in the sequence in the database is not in the |
| 80 | + sequence from the chunk we are ingesting, than we might have a DELETE |
| 81 | + (DELETE that we do not execute on HBase due to HBASE-5154, HBASE-5241) |
| 82 | + -- if a row id is also in our chunk, we check each column for changes |
| 83 | + -- duplicated columns are removed from the list that is going to be sent |
| 84 | + to the server, this to avoid waste of bandwidth consumption |
| 85 | + - at this stage, we get a copy of the data on the next known ingestion |
| 86 | + date (dates are known from the list of dumps in the meta table) |
| 87 | + -- if data are found, each row gets diffed with the data to be ingested |
| 88 | + that are left from the previous cleaning... if there are real changes |
| 89 | + those are kept and will be sent to the HBase server for writing |
| 90 | + (timestamps are verified at this stage, to avoid to resend data |
| 91 | + that have already been written previously) |
| 92 | + |
| 93 | +FIXME: ingesting data, skipping a day, will need proper recalculation |
| 94 | + of the difference of the hashes list... |
| 95 | + ingesting data, from a backup that was not previously ingested |
| 96 | + (while we kept ingesting data in the tables) will cause some |
| 97 | + redundant data duplicated in HBase, simply cause we do not dare |
| 98 | + to delete the duplicate that are "in the future" |
| 99 | + |
| 100 | + ...anyway, it is pretty easy to delete a table and reconstruct it |
| 101 | + having all the history into the staging level of HBase |
| 102 | + |
| 103 | +Last but not least, we do parse VIEWs and apply them on HIVE |
| 104 | +... be careful about https://issues.apache.org/jira/browse/HIVE-2055 !!! |
0 commit comments