Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

term_extracter.py inconsistent command line inputs #3

Closed
DivyeMaggo-Deakin opened this issue Mar 18, 2025 · 1 comment
Closed

term_extracter.py inconsistent command line inputs #3

DivyeMaggo-Deakin opened this issue Mar 18, 2025 · 1 comment

Comments

@DivyeMaggo-Deakin
Copy link

DivyeMaggo-Deakin commented Mar 18, 2025

a) From the original dataset on huggingface (I requested access for), there is no date column but instead has quarter column and has values like 'Q1_2020'. But the github readme file says for running term_extracter.py we need to specify date and year as command line arguments. Can you please clarify?

b) Should we pass the dataset present at huggingface as a command line input to run the term_extracter.py?

c) Also, I understand that we have to update the prompt present in gpt-4.ipynb so that it included the new_dterms but does gpt-4.ipynb file also lacks the logic to extract new_dterms from the output of term_extracter.py ?

d) To test in a real-world scenario, we could:

  1. Gather a dataset (e.g., 100 tweets)
  2. Randomly select a subset (10-20) as seeds
  3. Extract derogatory terms and identities from these seeds
  4. Incorporate these terms into the prompt and test on the left 80 tweets or 100 tweets, please clarify.

However, some questions arise:

  • How do we keep the list of derogatory terms and identities up-to-date?
  • What's the process for automatically updating prompts with new terms before a tweet is sent?

e) Regarding the missing code mentioned in another GitHub issue, I'm seeking confirmation on this approach:

  1. Extract terms from the randomly sampled 10-20 tweets
  2. Send a GPT request to assess if the content is hateful
  3. If hateful, use KeyBERT to identify derogatory terms and identities
  4. Append these newly identified terms to the existing list

Is this understanding correct?

f) Additionally, for the testing mentioned in your paper:

  1. you gather all tweets from a specific time period (quarter or month) that were manually labeled as hateful.
  2. randomly sample 10-20 tweets from this set.
  3. extract derogatory terms from these samples.
  4. test these derogatory terms inside a prompt (with automatic updates) on all the tweets from that time period.

Could you clarify if this process was done quarterly or monthly?

Looking forward to your reply.

Thanks,
Divye

@keyanUB
Copy link
Member

keyanUB commented Mar 24, 2025

Hi Divye,

The dataset on HugginFace is the annotated hate speech dataset instead of the raw data. You cannot directly use it for the term extraction step. The term extraction script is used to get the new terms from your newly collected tweets (including the data information).
If you want to reproduce the original process, please email Nishant ([email protected]) and me for the original data and scripts we used.

d) Your idea for the real-world scenario is absolutely good to go. For your questions:

  • We directly deployed NLTK to verify the terms. You can also use a larger or newer dictionary.
  • In our framework, the seed tweets offer the initial "new terms." If you want to achieve a fully automatic system, the identified hate speech from the detection process needs to be given back to extend the seed dataset.

e) we did not ask whether the tweet was hateful. Instead, as it is seed data, the annotators labeled them manually with a codebook.

f) Almost there. However, to avoid data leakage, we used the terms from the previous quarters to test the current ones. For the first quarter, we would not include any new terms. Absolutely, we can do it monthly if you have a larger dataset, which may make it more accurate.

Please directly email us if you have any questions. I do not frequently check the GitHub repo.

Best,
Keyan

@keyanUB keyanUB closed this as completed Mar 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants