Skip to content

Commit cf2a204

Browse files
authored
Merge pull request #408 from eternal-dreamer/nlp
Text_preprocessing
2 parents a12a92c + 90eee41 commit cf2a204

File tree

2 files changed

+436
-0
lines changed

2 files changed

+436
-0
lines changed
+33
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
## Script to preprocess text using NLP
2+
Bag of words technique has been used.
3+
4+
## Libraries imported
5+
# nltk
6+
7+
# from nltk.corpus import brown
8+
Brown corpus is used.
9+
humor category of brown corpus has been taken as sample data.
10+
11+
# contractions
12+
Install contractions using following command :
13+
pip install contractions
14+
This library is used for expanding contraction.
15+
16+
# nltk.download("punkt")
17+
Used to download Punkt Sentence Tokenizer. This tokenizer divides a text into a list of sentences, by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences.
18+
19+
# from nltk.corpus import stopwords
20+
Stopwords contain all the stopwords. English language stopwords have been removed.
21+
22+
# from nltk.stem import PorterStemmer
23+
PorterStemmer is used for stemming.
24+
25+
# from sklearn.feature_extraction.text import CountVectorizer
26+
27+
## Run script
28+
text_preprocessing.ipynb is jupyter notebook.
29+
you can run each cell by clicking on run.
30+
or
31+
You can use google colab also.
32+
33+

0 commit comments

Comments
 (0)