You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
a) From the original dataset on huggingface (I requested access for), there is no date column but instead has quarter column and has values like 'Q1_2020'. But the github readme file says for running term_extracter.py we need to specify date and year as command line arguments. Can you please clarify?
b) Should we pass the dataset present at huggingface as a command line input to run the term_extracter.py?
c) Also, I understand that we have to update the prompt present in gpt-4.ipynb so that it included the new_dterms but does gpt-4.ipynb file also lacks the logic to extract new_dterms from the output of term_extracter.py ?
d) To test in a real-world scenario, we could:
Gather a dataset (e.g., 100 tweets)
Randomly select a subset (10-20) as seeds
Extract derogatory terms and identities from these seeds
Incorporate these terms into the prompt and test on the left 80 tweets or 100 tweets, please clarify.
However, some questions arise:
How do we keep the list of derogatory terms and identities up-to-date?
What's the process for automatically updating prompts with new terms before a tweet is sent?
e) Regarding the missing code mentioned in another GitHub issue, I'm seeking confirmation on this approach:
Extract terms from the randomly sampled 10-20 tweets
Send a GPT request to assess if the content is hateful
If hateful, use KeyBERT to identify derogatory terms and identities
Append these newly identified terms to the existing list
Is this understanding correct?
f) Additionally, for the testing mentioned in your paper:
you gather all tweets from a specific time period (quarter or month) that were manually labeled as hateful.
randomly sample 10-20 tweets from this set.
extract derogatory terms from these samples.
test these derogatory terms inside a prompt (with automatic updates) on all the tweets from that time period.
Could you clarify if this process was done quarterly or monthly?
Looking forward to your reply.
Thanks,
Divye
The text was updated successfully, but these errors were encountered:
The dataset on HugginFace is the annotated hate speech dataset instead of the raw data. You cannot directly use it for the term extraction step. The term extraction script is used to get the new terms from your newly collected tweets (including the data information).
If you want to reproduce the original process, please email Nishant ([email protected]) and me for the original data and scripts we used.
d) Your idea for the real-world scenario is absolutely good to go. For your questions:
We directly deployed NLTK to verify the terms. You can also use a larger or newer dictionary.
In our framework, the seed tweets offer the initial "new terms." If you want to achieve a fully automatic system, the identified hate speech from the detection process needs to be given back to extend the seed dataset.
e) we did not ask whether the tweet was hateful. Instead, as it is seed data, the annotators labeled them manually with a codebook.
f) Almost there. However, to avoid data leakage, we used the terms from the previous quarters to test the current ones. For the first quarter, we would not include any new terms. Absolutely, we can do it monthly if you have a larger dataset, which may make it more accurate.
Please directly email us if you have any questions. I do not frequently check the GitHub repo.
a) From the original dataset on huggingface (I requested access for), there is no date column but instead has quarter column and has values like 'Q1_2020'. But the github readme file says for running term_extracter.py we need to specify date and year as command line arguments. Can you please clarify?
b) Should we pass the dataset present at huggingface as a command line input to run the term_extracter.py?
c) Also, I understand that we have to update the prompt present in gpt-4.ipynb so that it included the new_dterms but does gpt-4.ipynb file also lacks the logic to extract new_dterms from the output of term_extracter.py ?
d) To test in a real-world scenario, we could:
However, some questions arise:
e) Regarding the missing code mentioned in another GitHub issue, I'm seeking confirmation on this approach:
Is this understanding correct?
f) Additionally, for the testing mentioned in your paper:
Could you clarify if this process was done quarterly or monthly?
Looking forward to your reply.
Thanks,
Divye
The text was updated successfully, but these errors were encountered: