-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-28963 Updating Quota Factors is too expensive #6451
base: master
Are you sure you want to change the base?
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
} catch (IOException e) { | ||
LOG.warn("Failed to get cluster metrics needed for updating quotas", e); | ||
return; | ||
boolean hasTableQuotas = !tableQuotaCache.entrySet().isEmpty() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if this is relevant here, but this check won't be atomic. the contents of tableQuotaCache can change while checking userQuotaCache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's okay — the two conditions are definitely independent of each other, and the implication of that case would be that your tableQuotaCache addition missed the boat for this refresh and will be reflected in subsequent refreshes
() -> QuotaUtil.buildDefaultUserQuotaState(rsServices.getConfiguration(), 0L), | ||
this::triggerCacheRefresh); | ||
() -> QuotaUtil.buildDefaultUserQuotaState(rsServices.getConfiguration(), 0L)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change here, and in getQuotaState, are critical. These changes ensure that each cache miss does not trigger an immediate refresh — particularly given that cache entries are evicted after 5 refresh periods, this approach is too heavy handed.
} | ||
} | ||
|
||
static class RefreshableExpiringValueCache<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this class because I don't think there's a good keyless equivalent to a LoadingCache, and a memoized supplier does not offer all of the functionality that I'd like (on-demand refresh, invalidation)
if (hasTableQuotas) { | ||
updateTableMachineQuotaFactors(); | ||
} else { | ||
updateOnlyMachineQuotaFactors(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check ensures that we only pull down every region state if we actually need to. Without table machine quotas, there is no point
final String userOverrideWithQuota = User.getCurrent().getShortName() + "123"; | ||
final String userOverrideWithQuota = User.getCurrent().getShortName(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a few small test changes to ensure that refreshes were called appropriately. Because of the logic in ThrottleQuotaTestUtil.triggerUserCacheRefresh
assuming that we're waiting on throttling for the current user, I also had to change around the procedure of this test a bit to ensure that the current user is who we're throttling (rather than the current user plus a suffix, like we were before). But the logic is still sound
@@ -61,6 +67,10 @@ public class QuotaCache implements Stoppable { | |||
private static final Logger LOG = LoggerFactory.getLogger(QuotaCache.class); | |||
|
|||
public static final String REFRESH_CONF_KEY = "hbase.quota.refresh.period"; | |||
public static final String TABLE_REGION_STATES_CACHE_TTL_MS = | |||
"hbase.quota.cache.ttl.region.states.ms"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for including a unit indicator in the config key! 🙇
Duration tableRegionStatesCacheTtl = | ||
Duration.ofMillis(conf.getLong(TABLE_REGION_STATES_CACHE_TTL_MS, period)); | ||
this.tableRegionStatesClusterMetrics = | ||
new RefreshableExpiringValueCache<>("tableRegionStatesClusterMetrics", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a follow-up JIRA, should these configuration values be hot-reloadable ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would definitely be nice, and would probably be a larger refactor so it would definitely be nice to make that a separate issue. The quota refresh period is also static, and should probably be made dynamic in that same push
hbase-server/src/main/java/org/apache/hadoop/hbase/quotas/QuotaCache.java
Outdated
Show resolved
Hide resolved
hbase-server/src/test/java/org/apache/hadoop/hbase/quotas/TestQuotaCache.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, just some questions and nits. Thanks for the manual testing.
This comment has been minimized.
This comment has been minimized.
3cc02fc
to
8bed3d3
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
https://issues.apache.org/jira/browse/HBASE-28963
My company is running Quotas across a few hundred clusters of varied size. One cluster has hundreds of servers, tens of thousands of regions, and tens of thousands of unique users — for all of whom we build default user quotas to manage resource usage OOTB.
We noticed that the HMaster was quite busy for this cluster, and after some investigation we realized that RegionServers were hammering the HMaster's ClusterMetrics endpoint to facilitate the refreshing of table machine quota factors. We were also hotspotting the RegionServer hosting the quotas system table.
After some digging here, we realized there were three meaningful changes that we could make to the quota refresh process to really increase its scalability as RegionServer count, region count, and distinct user count grow.
#2
introduced. The cache TTL defaults to the defined quota refresh period, but can be customized.Testing
I've updated some tests to jive with the expectation that quotas will only refresh on the normally scheduled refresh period. Otherwise, I think our quotas test suite provides pretty good coverage to ensure that nothing is broken by this changeset.
I've also deployed this on a test cluster, and here is how refresh rates change in a given 30s timeperiod pre- and post-restart.
Pre-Restart
Post-Restart
On said test cluster I've also confirmed that throttling continues to work as intended. I've also validated that I can enact new manually defined throttles inline with refresh periods, and that I can remove a throttle to return to the default behavior.
cc @ndimiduk @hgromer