sanscript-tech
diff --git a/‎Python/text_preprocessing_nlp/Readme.md
Lines changed: 33 additions & 0 deletions b/‎Python/text_preprocessing_nlp/Readme.md
Lines changed: 33 additions & 0 deletions
@@ -0,0 +1,33 @@
+## Script to preprocess text using NLP
+Bag of words technique has been used.
+
+##  Libraries imported
+# nltk
+
+# from nltk.corpus import brown 
+Brown corpus is used.
+humor category of brown corpus has been taken as sample data.
+
+# contractions
+Install contractions using following command :
+pip install contractions
+This library is used for expanding contraction.
+
+# nltk.download("punkt")
+Used to download Punkt Sentence Tokenizer. This tokenizer divides a text into a list of sentences, by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences.
+
+# from nltk.corpus import stopwords
+Stopwords contain all the stopwords. English language stopwords have been removed.
+
+# from nltk.stem import PorterStemmer
+PorterStemmer is used for stemming.
+
+# from sklearn.feature_extraction.text import CountVectorizer
+
+## Run script 
+text_preprocessing.ipynb is jupyter notebook.
+you can run each cell by clicking on run. 
+or 
+You can use google colab also.
+
+