This is a simple tool to get data from a kafka topic and backup it into AWS-S3
Backup files will be generated in parquet format. To do so kafka-topic-dumper uses PyArrow package.
Clone this repository:
git clone [email protected]:Cobliteam/alexstrasza-stress-test.git
Go to correct folder:
cd alexstrasza-stress-test/kafka-topic-dumper
Install it using setup.py file:
pip install -e .
To be able to upload files to AWS-S3 bucket you will need to setup your AWS Credentials as described here. Than just export you profile like here:
export AWS_PROFILE=name
$kafka-topic-dumper -h
usage: kafka-topic-dumper [-h] [-t TOPIC] [-s BOOTSTRAP_SERVERS]
[-b BUCKET_NAME] [-p PATH]
{dump,reload} ...
Simple tool to dump kafka messages and send it to AWS S3
positional arguments:
{dump,reload} sub-command help
dump Dump mode will fetch messages from kafka cluster and
send then to AWS-S3.
reload Reload mode will download files from AWS-S3 and send
then to kafka.
optional arguments:
-h, --help show this help message and exit
-t TOPIC, --topic TOPIC
Kafka topic to fetch messages from.
-s BOOTSTRAP_SERVERS, --bootstrap-servers BOOTSTRAP_SERVERS
host[:port] string (or list of host[:port] strings
concatened by ",") that the consumer should contact to
bootstrap initial cluster metadata. If no servers are
specified, will default to localhost:9092.
-b BUCKET_NAME, --bucket-name BUCKET_NAME
The AWS-S3 bucket name to send dump files.
-p PATH, --path PATH Path to folder where to store local files.
$kafka-topic-dumper dump -h
usage: kafka-topic-dumper dump [-h] [-n NUM_MESSAGES]
[-m MAX_MESSAGES_PER_PACKAGE] [-d]
optional arguments:
-h, --help show this help message and exit
-n NUM_MESSAGES, --num-messages NUM_MESSAGES
Number of messages to try dump.
-m MAX_MESSAGES_PER_PACKAGE, --max-messages-per-package MAX_MESSAGES_PER_PACKAGE
Maximum number of messages per dump file.
-d, --dry-run In dry run mode, kafka-topic-dumper will generate
local files. But will not send it to AWS S3 bucket.
$kafka-topic-dumper reload -h
usage: kafka-topic-dumper reload [-h] [-g RELOAD_CONSUMER_GROUP]
[-T TRANSFORMER]
optional arguments:
-h, --help show this help message and exit
-g RELOAD_CONSUMER_GROUP, --reload-consumer-group RELOAD_CONSUMER_GROUP
Whe reloading a dump of messages that already was in
kafka, kafka-topic-dumper will not load it again, it
will only reset offsets for this consumer-group.
-T TRANSFORMER, --transformer TRANSFORMER
package:class that will be used to transform each
message before producing
The following command will dump to the folder named data
, 10000
messages
from the kafka server hosted at localhost: 9092
. It will also send this dump
in 1000
message packets to an AWS-S3 bucket called my-bucket
.
$kafka-topic-dumper -t my-topic -s localhost:9092 -n 10000 -m 1000 -p ./data \
-b my-bucket