Replies: 1 comment
-
Integrated into documentation. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
This is just a brain dump in rough form about things I could think about that are important when writing a crawler from scratch. This should be written out in a nicer way and combined with an example crawler. Note that this talks about things that are not yet in the code base.
New crawler requirements
If you add a new data source:
If you add a new python dependency:
Always required:
Points to note
all=False
when creating/fetching nodes.sys.exit
, but raise an exception to kill a crawler. Also print to log before raising the exception.logging.{info,warning,error}
as appropriate. You can log some steps toinfo
, but do not be too verbose.warning
should be for unexpected cases which are not critical enough to justify killing the crawler.error
should be followed by an exception. Batch functions automatically log node/relationship creations, so you do not have to do this manually.reference_url_data
as precise as possible, especially if it changes for parts of the data within the same crawler. Also try to use URLs that point to the correct data even when accessed at a later point in time. Note:URL
is used as the default value forreference_url_data
. Always specify aURL
, even if it might not be precise and is updated in the code. It makes it easier to know where this crawler gets its data from just by looking at the header.reference_url_info
that gives an explanation / reference to the data.reference_time_modification
, but do not add this if you are unsure. For this field it is better to give no info than wrong info.Crawler(ORG, URL, NAME)
call. Themain
function in the crawler file is only for testing or individual runs of the crawler and should not be modified.NAME
should always bedirectory.file
tmp
directory (advanced usage; not required for most crawlers)Beta Was this translation helpful? Give feedback.
All reactions