Skip to content

[smtp_batch] Additional field for grouping #2586

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Lukas-Heindl opened this issue Mar 20, 2025 · 16 comments
Open

[smtp_batch] Additional field for grouping #2586

Lukas-Heindl opened this issue Mar 20, 2025 · 16 comments
Assignees
Labels
feature Indicates new feature requests or new features help wanted Indicates that a maintainer wants help on an issue or pull request

Comments

@Lukas-Heindl
Copy link
Contributor

Hi,

I have a usecase where I'd like to send batched E-Mails to someone with collected events for which I want to use the smtp_batch output bot. Then in this case it would be nice to break down the events for one contact further into multiple E-Mails. Would it be possible to add some option such that the group-by is not only done by source.abuse_contact but also by a field determined by the bot configuration?
Eventually it would also be nice to use this field also included in the group-by in e.g. the subject, but that's just for the future.


I'm willing to submit this as a PR myself, I just wanted to create a place for input and also most important to check beforehand whether this has any chance of being merged.

So far smtp_batch inserts each event as value into redis with the bot_id + source.abuse_contact as key:

for mail in (field if isinstance(field, list) else [field]):
self.cache.redis.rpush(f"{self.key}{mail}", message.to_json())

It then later uses redis.keys() to retrieve all keys with a certain prefix (bot_id):

for mail_record in self.cache.redis.keys(f"{self.key}*")[slice(self.limit_results)]:

So the most straight forward way would be to simply extend the redis-key with the value of the field that should be included in the logical group-by. But we'd need to be careful, the retrieved redis-key (here mail_record) is still used to obtain the destination E-Mail address:

email_to = str(mail_record[len(self.key):], encoding="utf-8")

In order to solve this I see two options

  1. separate the values used to derive the redis-key with a special character like | -> in this case we'd need to make sure this character doesn't occur in any of the values -> replacing (but probably modifying the values is a bad idea) or escaping (a bit more complex) needed
  2. we could switch to not using the retrieved redis-key anymore. Instead we know all retrieved events must have the same source.abuse_contact so we could simply get the value from an arbitrary event we retrieved from redis (same thing if we later want to gain access to the other field we included in the group-by). Decoupling the redis-key from the rest of the logic also should make it possible to simply hash the redis-key which avoids all trouble with the redis-key getting much longer with more keys involved in the group-by logic. So in this case the redis-key would look like this f"{self.key}{hash(source.abuse_contact, other_field1, other_field2, ...)}"

As I mentioned above I've got two questions

  1. are you open generally for such additions to the smtp_batch bot?
  2. if so, I'm happy about any input you have regarding my sketch/draft of an implementation (I think, I'd probably go for the 2nd approach described above)
@sebix sebix added feature Indicates new feature requests or new features help wanted Indicates that a maintainer wants help on an issue or pull request labels Mar 20, 2025
@sebix
Copy link
Member

sebix commented Mar 20, 2025

whether this has any chance of being merged.

Sure!

@e3rd is the original author of the bot and likely a user of it as well so I assume he can give you some feedback on your questions.

@e3rd e3rd self-assigned this Mar 20, 2025
@e3rd
Copy link
Member

e3rd commented Mar 20, 2025

Thanks, I ll definitely take a look at it, thanks!

@e3rd
Copy link
Member

e3rd commented Apr 7, 2025

The 2nd approach really seems great! You make the group-by field defaulted to source.abuse_contact to keep it backwards compatible, right?

I am really ashamed I slowed down your PR by three weeks! Looking forward to the PR, please tag me there so that I can review.

@e3rd e3rd assigned Lukas-Heindl and unassigned e3rd Apr 7, 2025
@Lukas-Heindl
Copy link
Contributor Author

All good.

Yes, the solution should be backwards compatible (also we need to force source.abuse_contact always being in the list of fields for which the events are grouped-by, since this field is always needed to send the mail so it has to be the same for all events).
Then I'll go and try approach 2.

Probably this will be a while because of time constraints and I need to setup my development environment first (so I can test all this).

@e3rd
Copy link
Member

e3rd commented Apr 9, 2025

Fingers crossed!

@e3rd
Copy link
Member

e3rd commented Apr 9, 2025

If you don't mind, I'd suggest to add a configuration option too. What do you think?

include_into_body: Optional[int]
""" Insert the first N characters of the attachment to the body. """

If bigger than 0, the part of the attachment contents would be put into the body apart from being attached.

@Lukas-Heindl
Copy link
Contributor Author

Sounds interesting. Especially with the idea of introducing jinja2 to this bot (I know there already is the templated mail bot, but it does not allow for batching the events).
As I think about your suggestion the open issue would be how to display the start for the attachment into the body. But this might also be solved with jinja.

Also I'm not decided on whether it should be something like, only include the start of the attachment if it is smaller than N (so it fully fits into the body) or unconditionally the first N chars of the attachment (acting as some sort of preview).

@Lukas-Heindl
Copy link
Contributor Author

Talking more generally about this bot, I am curious. Do you know why the attachment always is zipped? Is is common the attachment is that large and compression really brings a benefit (or is necessary due to the size and limitations in E-Mail servers rejecting large attachments)?

@e3rd
Copy link
Member

e3rd commented Apr 9, 2025

how to display the start

just the plain text

full if small VS preview

I have no opinion, both fit :) The preview seems slightly better as it is consistent, but I don't know.
Also, it might be better to be the number of events, not number of chars.

Do you know why the attachment always is zipped?

Unfortunately, according to our experiences, mail servers are mad. Sometimes, 10 MB caused the e-mail to vanish if I remember well. To be consistent (so that the partners could automate), we've enforced the zip for all.

But if you don't like it, just add a flag for that.

@Lukas-Heindl
Copy link
Contributor Author

Adding onto the ideas regarding the smtp_batch bot:

  • According to my test intelmq.bots.outputs.smtp_batch.output --cli --send <bot-id> does not exit with exit-code != 0 if sending the E-Mail failed (e.g. wrong password) -> in my opinion this should be changed allowing the user to trigger an alert (to the server-admin) something in the operation of intelmq has gone wrong

@e3rd
Copy link
Member

e3rd commented Apr 17, 2025

Great enhancement, of course!

Lukas-Heindl added a commit to Lukas-Heindl/intelmq that referenced this issue Apr 24, 2025
Sending mails with the smtp_batch output might fail e.g. due to wrong
settings regarding the SMTP-Server. So far this is not reflected in the
exit code when running the bot via cli to trigger sending the mails.
Fixing this (at least for the non-interactive call) enables users to
integrate this into their monitoring properly.
Lukas-Heindl added a commit to Lukas-Heindl/intelmq that referenced this issue Apr 24, 2025
Sending mails with the smtp_batch output might fail e.g. due to wrong
settings regarding the SMTP-Server. So far this is not reflected in the
exit code when running the bot via cli to trigger sending the mails.
Fixing this (at least for the non-interactive call) enables users to
integrate this into their monitoring properly.
@Lukas-Heindl
Copy link
Contributor Author

I am a bit confused

count = len(rows_output)
if not count:
path = None
else:

mail = Mail(mail_record, email_to, path, count)
self.build_mail(mail, send=False)
if count:
yield mail

@e3rd do you still remember what the idea here was? Why building the mail if count is None, particularly as the mail is not yielded anyhow? (can it even be None? https://docs.python.org/3/library/functions.html#len is not too specific about it) So the control flow looks a bit odd to me.
Also do we really want to send mail in case there are no events captured?

(And if you already want to have a look, I'm drafting at https://github.com/Lukas-Heindl/intelmq/tree/smtpBatchGroup)

@e3rd
Copy link
Member

e3rd commented Apr 24, 2025

Good point, a design flaw, get rid of it. The draft looks nice! :)

sebix pushed a commit to Lukas-Heindl/intelmq that referenced this issue Apr 28, 2025
Sending mails with the smtp_batch output might fail e.g. due to wrong
settings regarding the SMTP-Server. So far this is not reflected in the
exit code when running the bot via cli to trigger sending the mails.
Fixing this (at least for the non-interactive call) enables users to
integrate this into their monitoring properly.
@Lukas-Heindl
Copy link
Contributor Author

Ok this is getting more like a (minor) refactoring of the bot.

I noticed the parameters allowed_fieldnames and fieldnames_translation are not documented yet. Writing the documentation for these parameters i noticed I need to include something like if you are inconsistent here, the bot will crash

Reason:

rows_output.append(OrderedDict({self.fieldnames_translation[k]: row[k] for k in ordered_keys}))

will crash in case the ordered_keys (intersection of allowed_fieldnames and the keys defined in the event) contains a key which is not defined in fieldnames_translation.

Personally I'd be okay with this (shifting the responsibility to the user/admin to do the configuration right). Still I wanted to raise this for discussion whether the bot should be more robust in this regard. We could use dict.get to set a default (maybe the name of the key in allowed_fieldnames) when doing the translation.

Or we could remove the allowed_fieldnames option and use fieldnames_translation.keys() as a replacement. Though, just modifying the allowed_fieldnames option to remove a column from the csv without needing to toucht the translation might be an ease of use we want to keep.

@Lukas-Heindl
Copy link
Contributor Author

Except for the include_into_body / preview, I think everything discussed here is implemented in #2610

@Lukas-Heindl
Copy link
Contributor Author

Addition to the collectio of ideas:

  1. optionally use paging to retrieve the data from redis (I experienced some OOM with too less RAM+swap -- paging the data should mitigate a bit)
  2. same reason like for 1.: Currently we first collect all data for all mails just for the nice display what mails are in the pipeline. Maybe some way of "streaming" the data from redis directly to mail (so obtain all the data for one mail -> send the mail and loop this)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Indicates new feature requests or new features help wanted Indicates that a maintainer wants help on an issue or pull request
Projects
None yet
Development

No branches or pull requests

3 participants