Skip to content

Added support for JSON containing multiple events #2545

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,9 @@ Please refer to the [NEWS](NEWS.md) for a list of changes which have an affect o
- Drop support for Python 3.8 (fixes #2616, PR#2617 by Sebastian Wagner).
- `intelmq.lib.splitreports`: Handle bot parameter `chunk_size` values empty string, due to missing parameter typing checks (PR#2604 by Sebastian Wagner).
- `intelmq.lib.mixins.sql` Add Support for MySQL (PR#2625 by Karl-Johan Karlsson).
- Python 3.8 or newer is required (PR#2541 by Sebastian Wagner).
- `intelmq.lib.utils.list_all_bots`/`intelmqctl check`: Fix check for bot executable in $PATH by using the bot name instead of the import path (fixes #2559, PR#2564 by Sebastian Wagner).
- `intelmq.lib.message.Message.from_dict`: Do not modify the dict parameter by adding the `__type` field and raise an error when type is not determinable (PR#2545 by Sebastian Wagner).

### Development

Expand All @@ -29,6 +32,9 @@ Please refer to the [NEWS](NEWS.md) for a list of changes which have an affect o

#### Parsers
- `intelmq.bots.parsers.cymru.parser_cap_program`: Add mapping for TOR and ipv6-icmp protocol (PR#2621 by Mikk Margus Möll).
- `intelmq.bots.parser.json.parser`:
- Support data containing lists of JSON Events (PR#2545 by Tim de Boer).
- Add default `classification.type` with value `undetermined` if input data has now classification itself (PR#2545 by Sebastian Wagner).

#### Experts
- `intelmq.bots.experts.asn_lookup.expert`:
Expand Down
61 changes: 59 additions & 2 deletions docs/user/bots.md
Original file line number Diff line number Diff line change
Expand Up @@ -1923,12 +1923,69 @@ also <https://www.crummy.com/software/BeautifulSoup/bs4/doc/>). Defaults to `htm

---

### JSON (TODO) <div id="intelmq.bots.parsers.json.parser" />
### JSON <div id="intelmq.bots.parsers.json.parser" />

TODO
Parses JSON events that are already in IntelMQ format.
If the input data did not contain the field `classification.type`, it is set to `undetermined`.

Supports multiple different modes:

#### Input data is one event
Example:
```json
{ INTELMQ data... }
```
or:
```
{
INTELMQ data...
}
```

Configuration:
* `splitlines`: False
* `multiple_events`: False

#### Input data is in JSON stream format
Example:
```json
{ INTELMQ data... }
{ INTELMQ data... }
{ INTELMQ data... }
```

Configuration:
* `splitlines`: True
* `multiple_events`: False

#### Input data is a list of events
Example:
```json
[
{ INTELMQ data... },
{ INTELMQ data... },
...
]
```

Configuration:
* `splitlines`: False
* `multiple_events`: True

#### Configuration

**Module:** `intelmq.bots.parsers.json.parser`

**Parameters:**

**`splitlines`**

(optional, boolean) When the input file contains one JSON dictionary per line, set this to `true`. Defaults to `false`.

**`multiple_events`**

(optional, string) When the input file contains a JSON list of dictionaries, set this to `true`. Defaults to `false`.

---

### Key=Value Parser <div id="intelmq.bots.parsers.key_value.parser" />
Expand Down
34 changes: 20 additions & 14 deletions intelmq/bots/parsers/json/parser.py
Original file line number Diff line number Diff line change
@@ -1,38 +1,44 @@
# SPDX-FileCopyrightText: 2016 by Bundesamt für Sicherheit in der Informationstechnik
# SPDX-FileCopyrightText: 2016 by Bundesamt für Sicherheit in der Informationstechnik, 2016-2021 nic.at GmbH, 2024 Tim de Boer, 2025 Institute for Common Good Technology
#
# SPDX-License-Identifier: AGPL-3.0-or-later
"""
JSON Parser Bot
Retrieves a base64 encoded JSON-String from raw and converts it into an
event.

Copyright (C) 2016 by Bundesamt für Sicherheit in der Informationstechnik
Software engineering by Intevation GmbH
"""
from intelmq.lib.bot import ParserBot
from intelmq.lib.message import MessageFactory
from intelmq.lib.utils import base64_decode
from json import loads as json_loads, dumps as json_dumps


class JSONParserBot(ParserBot):
"""Parse IntelMQ-JSON data"""
splitlines = False
splitlines: bool = False
multiple_events: bool = False

def process(self):
report = self.receive_message()
if self.splitlines:
lines = base64_decode(report['raw']).splitlines()
if self.multiple_events:
lines = json_loads(base64_decode(report["raw"]))
elif self.splitlines:
lines = base64_decode(report["raw"]).splitlines()
else:
lines = [base64_decode(report['raw'])]
lines = [base64_decode(report["raw"])]

for line in lines:
new_event = MessageFactory.unserialize(line,
harmonization=self.harmonization,
default_type='Event')
event = self.new_event(report)
event.update(new_event)
if 'raw' not in event:
event['raw'] = line
if self.multiple_events:
event.update(MessageFactory.from_dict(line,
harmonization=self.harmonization,
default_type="Event"))
event["raw"] = json_dumps(line, sort_keys=True)
else:
event.update(MessageFactory.unserialize(line,
harmonization=self.harmonization,
default_type="Event"))
event.add('raw', line, overwrite=False)
event.add("classification.type", "undetermined", overwrite=False) # set to undetermined if input has no classification
self.send_message(event)
self.acknowledge_message()

Expand Down
8 changes: 5 additions & 3 deletions intelmq/lib/message.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,17 +49,19 @@ def from_dict(message: dict, harmonization=None,
MessageFactory.unserialize
MessageFactory.serialize
"""
if default_type and "__type" not in message:
message["__type"] = default_type
if not default_type and '__type' not in message:
raise ValueError("Message type could not be determined. Input message misses '__type' and parameter 'default_type' not given.")
try:
class_reference = getattr(intelmq.lib.message, message["__type"])
class_reference = getattr(intelmq.lib.message, message.get("__type", default_type))
except AttributeError:
raise exceptions.InvalidArgument('__type',
got=message["__type"],
expected=VALID_MESSSAGE_TYPES,
docs=HARMONIZATION_CONF_FILE)
# don't modify the parameter
message_copy = message.copy()
if default_type and "__type" not in message_copy:
message_copy["__type"] = default_type
del message_copy["__type"]
return class_reference(message_copy, auto=True, harmonization=harmonization)

Expand Down
68 changes: 68 additions & 0 deletions intelmq/tests/bots/parsers/json/ncscnl.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
[
{
"extra.dataset_collections": "0",
"extra.dataset_files": "1",
"extra.dataset_infected": "false",
"extra.dataset_ransom": "null",
"extra.dataset_rows": "0",
"extra.dataset_size": "301",
"protocol.application": "https",
"protocol.transport": "tcp",
"source.asn": 12345689,
"source.fqdn": "fqdn-example-1.tld",
"source.geolocation.cc": "NL",
"source.geolocation.city": "Enschede",
"source.geolocation.latitude": 52.0000000000000,
"source.geolocation.longitude": 6.0000000000000,
"source.geolocation.region": "Overijssel",
"source.ip": "127.1.2.1",
"source.network": "127.1.0.0/16",
"source.port": 80,
"time.source": "2024-12-16T02:08:06+00:00"
},
{
"extra.dataset_collections": "0",
"extra.dataset_files": "1",
"extra.dataset_infected": "false",
"extra.dataset_ransom": "null",
"extra.dataset_rows": "0",
"extra.dataset_size": "615",
"extra.os_name": "Ubuntu",
"extra.software": "Apache",
"extra.tag": "rescan",
"extra.version": "2.4.58",
"protocol.application": "https",
"protocol.transport": "tcp",
"source.asn": 12345689,
"source.fqdn": "fqdn-example-2.tld",
"source.geolocation.cc": "NL",
"source.geolocation.city": "Eindhoven",
"source.geolocation.latitude": 51.0000000000000,
"source.geolocation.longitude": 5.0000000000000,
"source.geolocation.region": "North Brabant",
"source.ip": "127.1.2.2",
"source.network": "127.1.0.0/16",
"source.port": 443,
"time.source": "2024-12-16T02:08:12+00:00"
},
{
"extra.dataset_collections": "0",
"extra.dataset_files": "1",
"extra.dataset_infected": "false",
"extra.dataset_ransom": "null",
"extra.dataset_rows": "0",
"extra.dataset_size": "421",
"protocol.application": "http",
"protocol.transport": "tcp",
"source.asn": 12345689,
"source.geolocation.cc": "NL",
"source.geolocation.city": "Enschede",
"source.geolocation.latitude": 52.0000000000000,
"source.geolocation.longitude": 6.0000000000000,
"source.geolocation.region": "Overijssel",
"source.ip": "127.1.2.3",
"source.network": "127.1.0.0/16",
"source.port": 9000,
"time.source": "2024-12-15T21:09:49+00:00"
}
]
2 changes: 2 additions & 0 deletions intelmq/tests/bots/parsers/json/ncscnl.json.license
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
SPDX-FileCopyrightText: 2024 Tim de Boer
SPDX-License-Identifier: AGPL-3.0-or-later
27 changes: 25 additions & 2 deletions intelmq/tests/bots/parsers/json/test_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
import base64
import os
import unittest
from json import loads as json_loads, dumps as json_dumps

import intelmq.lib.test as test
from intelmq.bots.parsers.json.parser import JSONParserBot
Expand Down Expand Up @@ -51,6 +52,21 @@
NO_DEFAULT_EVENT = MULTILINE_EVENTS[1].copy()
NO_DEFAULT_EVENT['raw'] = base64.b64encode(b'{"source.ip": "127.0.0.2", "classification.type": "c2-server"}\n').decode()

with open(os.path.join(os.path.dirname(__file__), 'ncscnl.json'), 'rb') as fh:
NCSCNL_FILE = fh.read()
NCSCNL_RAW = base64.b64encode(NCSCNL_FILE).decode()
NCSC_EVENTS = json_loads(NCSCNL_FILE)
for i, event in enumerate(NCSC_EVENTS):
NCSC_EVENTS[i]['raw'] = base64.b64encode(json_dumps(event, sort_keys=True).encode()).decode()
NCSC_EVENTS[i]['classification.type'] = 'undetermined'
NCSC_EVENTS[i]['feed.name'] = 'NCSC.NL'
NCSC_EVENTS[i]['__type'] = 'Event'

NCSCNL_REPORT = {"feed.name": "NCSC.NL",
"raw": NCSCNL_RAW,
"__type": "Report",
}


class TestJSONParserBot(test.BotTestCase, unittest.TestCase):
"""
Expand All @@ -70,8 +86,7 @@ def test_oneline_report(self):
def test_multiline_report(self):
""" Test if correct Event has been produced. """
self.input_message = MULTILINE_REPORT
self.sysconfig = {"splitlines": True}
self.run_bot()
self.run_bot(parameters={"splitlines": True})
self.assertMessageEqual(0, MULTILINE_EVENTS[0])
self.assertMessageEqual(1, MULTILINE_EVENTS[1])

Expand All @@ -81,6 +96,14 @@ def test_default_event(self):
self.run_bot()
self.assertMessageEqual(0, NO_DEFAULT_EVENT)

def test_ncscnl(self):
""" A file containing a list of events (not per line) """
self.input_message = NCSCNL_REPORT
self.run_bot(parameters={'multiple_events': True})
self.assertMessageEqual(0, NCSC_EVENTS[0])
self.assertMessageEqual(1, NCSC_EVENTS[1])
self.assertMessageEqual(2, NCSC_EVENTS[2])


if __name__ == '__main__': # pragma: no cover
unittest.main()
5 changes: 3 additions & 2 deletions intelmq/tests/lib/test_bot_library_mode.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@
"destination.ip": "192.0.43.8", # iana.org.
"time.observation": "2015-01-01T00:00:00+00:00",
}
EXAMPLE_IP_OUTPUT = MessageFactory.from_dict(EXAMPLE_IP_INPUT, default_type='Event') # adds __type = Event


class BrokenInitExpertBot(ExpertBot):
Expand Down Expand Up @@ -130,15 +131,15 @@ def test_bot_multi_message():

def test_bot_raises_and_second_message():
"""
The first message raises an error and the second message
The first message raises an error and the second message is processed correctly
This test is based on an issue where the exception-raising message was not cleared from the internal message store of the Bot/Pipeline instance and thus re-used on the second run
"""
raises_on_first_run = RaisesOnFirstRunExpertBot('raises', settings=BotLibSettings)
with raises(ValueError):
raises_on_first_run.process_message(EXAMPLE_DATA_URL)
queues = raises_on_first_run.process_message(EXAMPLE_IP_INPUT)
assert len(queues['output']) == 1
assertMessageEqual(queues['output'][0], EXAMPLE_IP_INPUT)
assertMessageEqual(queues['output'][0], EXAMPLE_IP_OUTPUT)


if __name__ == '__main__': # pragma: no cover
Expand Down
Loading