Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE SSL mTLS capability #46023

Closed
2 tasks done
alexio215 opened this issue Jan 24, 2025 · 13 comments
Closed
2 tasks done

HIVE SSL mTLS capability #46023

alexio215 opened this issue Jan 24, 2025 · 13 comments

Comments

@alexio215
Copy link

Description

Add HIVE provider capability to use SSL and perform mTLS handshake with connecting host

Use case/motivation

Airflow runs in a different network from HIVE due to policy and company support.
Require mTLS encrypted connection between the two instances to securely run HIVE jobs remotely.

Related issues

None that I am aware of

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@alexio215 alexio215 added kind:feature Feature Requests needs-triage label for new issues that we didn't triage yet labels Jan 24, 2025
Copy link

boring-cyborg bot commented Jan 24, 2025

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

@alexio215
Copy link
Author

Hello, thank you for having me. It is my first time contributing to any open source project, so please bear with me.

Happy to learn from any wisdom shared

@alexio215
Copy link
Author

My current inclination in solving this problem is to use jpype to funnel python requests through a JVM running the JDBC Driver adjacent to Airflow.

The goal with this is to use the Python natively to write HIVE DAGs but communicate in Java which is more native to HIVE and supports the JKS key format, which is the default to HIVE

@alexio215
Copy link
Author

On pause for now and considering ramifications of this in my environment

@nathadfield nathadfield removed the needs-triage label for new issues that we didn't triage yet label Jan 27, 2025
@nevcohen
Copy link
Contributor

The hive cli or hive server 2 connections don't work for you?

@alexio215
Copy link
Author

So I'm currently looking at using two capabilities. The first, is to connect to an NGINX proxy that requires SSL certs and expects mTLS to serve HIVE commands locally through our cluster, into the HIVE2SERVER running right behind it. The second, down the line that I am hoping for, is to find or create support for direct connection with pyHIVE to HIVE2SERVER running with SSL, and to perform mTLS. The problem with this however is that I notice that python does not natively support the .jks format that HIVE2SERVER expects, hence the use of an NGINX proxy. However, looking at pyHIVE, and its most recent issues, to me it seems that pyHIVE as well does not support SSL connection:
dropbox/PyHive#257

Forgive me for any misunderstanding as well, this is all a learning process to me at the same time. Thank you for the patience and help @nevcohen

@nevcohen
Copy link
Contributor

nevcohen commented Feb 2, 2025

So I'm currently looking at using two capabilities. The first, is to connect to an NGINX proxy that requires SSL certs and expects mTLS to serve HIVE commands locally through our cluster, into the HIVE2SERVER running right behind it. The second, down the line that I am hoping for, is to find or create support for direct connection with pyHIVE to HIVE2SERVER running with SSL, and to perform mTLS. The problem with this however is that I notice that python does not natively support the .jks format that HIVE2SERVER expects, hence the use of an NGINX proxy. However, looking at pyHIVE, and its most recent issues, to me it seems that pyHIVE as well does not support SSL connection:
dropbox/PyHive#257

Forgive me for any misunderstanding as well, this is all a learning process to me at the same time. Thank you for the patience and help @nevcohen

So today how do you connect to hive using a code?

@alexio215
Copy link
Author

So I'm currently looking at using two capabilities. The first, is to connect to an NGINX proxy that requires SSL certs and expects mTLS to serve HIVE commands locally through our cluster, into the HIVE2SERVER running right behind it. The second, down the line that I am hoping for, is to find or create support for direct connection with pyHIVE to HIVE2SERVER running with SSL, and to perform mTLS. The problem with this however is that I notice that python does not natively support the .jks format that HIVE2SERVER expects, hence the use of an NGINX proxy. However, looking at pyHIVE, and its most recent issues, to me it seems that pyHIVE as well does not support SSL connection:
dropbox/PyHive#257
Forgive me for any misunderstanding as well, this is all a learning process to me at the same time. Thank you for the patience and help @nevcohen

So today how do you connect to hive using a code?

Thank you for the patience, this has taken some digging on my end, getting accustomed to what is currently practiced in my org. Currently our pyHive queries are written a more manual script and sent to a NGINX server that redirects appropriate traffic to a Hive2Server proxy. The Thrift communication is wrapped in HTTPS using the THTTPClient module from the Thrift library. I have found this to exist within pyHive as well.

This lives and is made accessible within the Connection method of pyHive
if scheme in ("https", "http") and thrift_transport is None: port = port or 1000 ssl_context = None if scheme == "https": ssl_context = create_default_context() ssl_context.check_hostname = check_hostname == "true" ssl_cert = ssl_cert or "none" ssl_context.verify_mode = ssl_cert_parameter_map.get(ssl_cert, CERT_NONE) thrift_transport = thrift.transport.THttpClient.THttpClient( uri_or_host="{scheme}://{host}:{port}/cliservice/".format( scheme=scheme, host=host, port=port ), ssl_context=ssl_context, )

My goal is to add a method using the ssl library that creates ssl context using the extras provided and appends them to the connection being created if a "use_https_proxy" boolean is specified within the proxy. Further, a "enable_mtls" boolean option will be included to allow for cases where someone needs to use mTLS.

@alexio215
Copy link
Author

So I'm currently looking at using two capabilities. The first, is to connect to an NGINX proxy that requires SSL certs and expects mTLS to serve HIVE commands locally through our cluster, into the HIVE2SERVER running right behind it. The second, down the line that I am hoping for, is to find or create support for direct connection with pyHIVE to HIVE2SERVER running with SSL, and to perform mTLS. The problem with this however is that I notice that python does not natively support the .jks format that HIVE2SERVER expects, hence the use of an NGINX proxy. However, looking at pyHIVE, and its most recent issues, to me it seems that pyHIVE as well does not support SSL connection:
dropbox/PyHive#257
Forgive me for any misunderstanding as well, this is all a learning process to me at the same time. Thank you for the patience and help @nevcohen

So today how do you connect to hive using a code?

Thank you for the patience, this has taken some digging on my end, getting accustomed to what is currently practiced in my org. Currently our pyHive queries are written a more manual script and sent to a NGINX server that redirects appropriate traffic to a Hive2Server proxy. The Thrift communication is wrapped in HTTPS using the THTTPClient module from the Thrift library. I have found this to exist within pyHive as well.

This lives and is made accessible within the Connection method of pyHive if scheme in ("https", "http") and thrift_transport is None: port = port or 1000 ssl_context = None if scheme == "https": ssl_context = create_default_context() ssl_context.check_hostname = check_hostname == "true" ssl_cert = ssl_cert or "none" ssl_context.verify_mode = ssl_cert_parameter_map.get(ssl_cert, CERT_NONE) thrift_transport = thrift.transport.THttpClient.THttpClient( uri_or_host="{scheme}://{host}:{port}/cliservice/".format( scheme=scheme, host=host, port=port ), ssl_context=ssl_context, )

My goal is to add a method using the ssl library that creates ssl context using the extras provided and appends them to the connection being created if a "use_https_proxy" boolean is specified within the proxy. Further, a "enable_mtls" boolean option will be included to allow for cases where someone needs to use mTLS.

Finding a way to do this through the pyHive scheme and default constructor parameters

@nevcohen
Copy link
Contributor

nevcohen commented Feb 5, 2025

I think I understand what you want to do, the way I see it there are two options.

  1. Open a PR and if it's something simple and relevant it will also be promoted in the open source (I would love to do a CR for you).
  2. Implement your own wrapper for the operator in your organization.

@alexio215
Copy link
Author

I think I understand what you want to do, the way I see it there are two options.

  1. Open a PR and if it's something simple and relevant it will also be promoted in the open source (I would love to do a CR for you).
  2. Implement your own wrapper for the operator in your organization.

Hello, thank you for the help. I have opened an issue to make a PR for puHive first since it is lacking the capability fundamentally. Once I get that merged, I will come back here to make a PR for the airflow provider.

@eladkal
Copy link
Contributor

eladkal commented Mar 2, 2025

I'm closing this issue as it's missing feature in upstream library dropbox/PyHive#480
Should upstream add support for it feel free to open PR directly (no need for issue in Airflow)

@eladkal eladkal closed this as completed Mar 2, 2025
@alexio215
Copy link
Author

Just wanted to add the comment, that the new pyHive has been adopted by the apache/kyuubi project. The PR for this support has been made upstream, and is awaiting release.

Issue can remain closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants