MLOnClick/streamlit_app.py at main · EmperorArthurIX/MLOnClick · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
import streamlit as st
import pandas as pd
import seaborn as sns

# GLOBAL STUFF
pages = ["Home Page",
    "Machine Learning",
    "About Us"]

page = st.sidebar.selectbox("What do you wish to see?", pages)


# HOME PAGE
if page == pages[0]:
    st.title("ML On Click")

    st.header("Welcome to the **ML On Click** App!\n")

    st.subheader("\nEnjoy the beauty of Machine Learning algorithms, without having to write code!\n")
    st.write(
        "\n----\n",

        "\n#### Here for the first time?\n",

        "Fret not, this is the right place to get started with the app!\n",

        "\n#### Been here before?\n",

        "\nWe are extremely happy to see you again! Go ahead and do what you have to do!\n",

        "\n----\n",
    )

    st.subheader("About the App")
    with st.expander("Expand"):

        st.write(
            "\nBrilliant! Now that the fast and furious ones are busy with their work, let's have a look at how exactly you can use this app for ML!\n",

            "\n- Machine Learning, as the computer scientists call it, is the process of *training* a computer to perform certain tasks based on statistical calculations and probabilities, producing the impression of *learning* as humans do.\n",

            "- There are algorithms specifically designed for each of these calculations. These are collected together in Libraries like the ones mentioned in the **About Us** section.\n",

            "- Now, these libraries all come packed to be used along with a programming language - like *Python*. One must write down some code to get results to their ML problems.\n",

            "- We have taken care of that part for you. Just choose your data, select your parameters and type of outcome that you need and we will try and produce the results you want to see in a few clicks!\n",
        )
    st.subheader("Guidelines")
    with st.expander("Expand"):
        st.write(
            "- All ML Models need data to work on. Choose from among our sets of data, or go ahead and upload your own data in the *Machine Learning* tab, via the sidebar menu.\n",

            "- Once the data is selected, we will show you an image of how the dataset appears. You can take a good look at what it looks like then go on ahead to select the data that you feel will be most adequate for your analyses.\n",

            "- We will give you a set of options to choose from. At least 2 columns in the dataset must contain numerical data.\n",

            "- If, while displaying our data, you feel like a column has the wrong data type assigned to it, you can go ahead and cast the column to the desired data type from our conversion interface.\n",

            "- You can now choose the type of Machine Learning algorithm you wish to apply on the dataset.\n"

            "- Once the data seems clean and you have selected the columns you want us to act upon, click the 'Learn' button to go ahead with the training process.\n",

            "\n##### That's it! Go ahead and download the output chart as a PNG image!\n"
        )

# END OF HOME PAGE

# MACHINE LEARNING
if page == pages[1]:
    st.title("ML On Click - Machine Learning", anchor="ML")

    # DATA SELECTION
    st.write(
        "\nUpload your data and download your desired results\n",

        "\n**OR**\n",

        "\nUse one of our sample datasets to explore the application and see what all we offer!\n",

        "\n----\n",

        "\nTo get started off with using the app:\n",

        "\n##### Choose Data\n"
    )
    proceed_column_choice = False
    algos = ["Simple Linear Regression", "Logistic Regression", "Polynomial Regression"]
    ds_list = {"Heart Diseases":"Heart Predictions 2.csv", "Titanic Survival":"submission.csv"}
    ds = st.selectbox("From our datasets:",ds_list)

    upload = st.sidebar.file_uploader("Upload Here:", type=['csv', 'txt'], help="You may upload a 'Comma Separated Values' file(.csv) containing your data, or a 'Text' file(.txt) in which the data is Comma-Delimited")

    df = pd.DataFrame()
    try:
        st.write("\nUsing the data from '{}'!\n".format(upload.name))
        df = pd.read_csv(upload)
    except:
        if upload == None:
            st.write("\nCurrently using our data\n")
            df = pd.read_csv(ds_list[ds])

    try:
        st.write(df.head(10))
    except:
        st.write("\nWe are unable to display your file.\n")

    if not df.empty:
        cols = list(df.columns)
        not_rec_cols = []  # Columns which we recommend avoiding
        nice_cols = []
        for column in cols:
            if df[column].isnull().sum()/len(df[column]) > 0.3:
                not_rec_cols.append(column)
            else: nice_cols.append(column)

        if len(not_rec_cols) > 0:
            st.write("\nWe **recommend avoiding** the use of the following columns, as they contain too many impurities:\n")
            st.write(not_rec_cols)
        if len(nice_cols) > 1:
            st.write("\nWe do **recommend using** any of these columns:\n")
            st.write(nice_cols)
            proceed_column_choice = True
        else:
            st.write("\nWe need 1 dependant and at least 1 independant variable to perform regressions. Current count 'ideal columns' does not match the requirements; however, you may go ahead with analysis if you feel like there are other columns which you wish to use.")
    else:
        st.write("\nIt seems your dataset it empty! Please check the file uploaded for errors.\n")

    st.write("\nYou may choose columns you wish to analyse from the sidebar!\n")
    if proceed_column_choice:
        tempcols = cols
        og_chosen_target = st.sidebar.selectbox("Target Variable", tempcols)
        st.write("\nCurrent Target:", og_chosen_target)
    if og_chosen_target:
        othercols = [i for i in tempcols if i != og_chosen_target]
        og_chosen_cols = st.sidebar.multiselect("Independent Variable(s)", othercols, default=[i for i in nice_cols if i != og_chosen_target])

    show_pp = st.button("Brute Force Visualisation")
    if show_pp:
        fig = sns.pairplot(df[og_chosen_cols + [og_chosen_target]])
        st.pyplot(fig)

    ml_type = st.selectbox("Machine Learning Algorithm", algos, help="What kind of algorithm do you feel would be best fit for your analysis?")
    bf_elim = st.sidebar.checkbox("No Backward Feature Elimination")

    # END OF DATA SELECTION

    if ml_type == algos[0]:
        st.write(algos[0])

    elif ml_type == algos[1]:
        import pandas as pd
        import numpy as np
        from sklearn.metrics import confusion_matrix
        from sklearn.model_selection import train_test_split
        import matplotlib.pyplot as plt
        import seaborn as sns
        import pickle

        # Cleaning Data, if possible automatically
        # Obtaining null values
        nullinfo = dict(df.isnull().sum())
        nullkeys = list(nullinfo.keys())
        nullvals = list(nullinfo.values())
        nullcount = 0
        null_rows = df.isnull().sum(axis = 1)
        max_null = nullkeys[nullvals.index(max(nullvals))]

        if max(nullvals) > 0:
            for i in null_rows:
                if i > 0:
                    nullcount += 1
            st.write("\nTotal Null Values: {}\n".format(nullcount))
            nullpct = round((nullcount/len(df.index))*100)
            st.write("\nThis accounts for {}% of the total data.\n".format(nullpct))
            if nullpct < 20:
                st.write("\nWe can do away with this data since the volume of null values is manageable.\n")
                df.dropna(axis=0, inplace=True)
            else:
                st.write("\nThe null values constitute a major portion of the data, consider choosing a different set of columns, or try cleaning the data locally\n")

        # fig = sns.pairplot(df)
        # plt.show()

        from statsmodels.tools import add_constant
        df_constant = add_constant(df)

        chosen_cols = og_chosen_cols
        chosen_target = og_chosen_target

        import scipy.stats as sct
        sct.chisqprob = lambda chisq, df: sct.chi2.sf(chisq, df)
        try:
            import statsmodels.api as sm
            model = sm.Logit(df[chosen_target], df_constant[chosen_cols])
            result = model.fit()
            st.write(result.summary())
            pvals = list(result.pvalues)
            if not bf_elim:
                if max(pvals) > 0.05:
                    st.write("\nP Values are more than optimum 0.05. We will try Backward Feature Elimination\n")

                def back_feature_elem(dataframe, target, cols):
                    while len(cols) > 0:
                        model = sm.Logit(target, dataframe[cols])
                        result = model.fit(disp=0)
                        largest_p=round(result.pvalues, 3).nlargest(1)
                        if largest_p[0] < 0.05:
                            return result
                            # break
                        else:
                            cols.remove(largest_p.index)
                result = back_feature_elem(df_constant, df[chosen_target], chosen_cols)
                st.write(result.summary())

                st.write(
                    "\nAfter testing out impacts on results, we have narrowed down the list of columns to: {}\n".format(chosen_cols),
                    "\nIf you do not wish for this to happen to your data, please choose 'No Backward Feature Elimination' on the sidebar\n")
        except:
            st.write("\nBackward Feature Elimination was not possible. Proceeding without it.\n")
        # MAKING ACTUAL MODEL NOW
        Y_data = df[chosen_target]
        X_data = df[chosen_cols]

        from sklearn.preprocessing import MinMaxScaler
        scaler = MinMaxScaler()
        X_data = scaler.fit_transform(X_data)

        x_train,x_test, y_train,y_test = train_test_split(X_data, Y_data, test_size=0.30, random_state=42)

        from sklearn.linear_model import LogisticRegression

        logreg = LogisticRegression()
        logreg.fit(x_train, y_train)

        y_preds = logreg.predict(x_test)
        fig,ax = plt.subplots()
        sns.heatmap(confusion_matrix(y_test, y_preds), annot = True, ax = ax)
        st.write(fig)
        st.write("\nAccuracy of Model thus far: {}%\n".format(round(logreg.score(x_test, y_test)*100,3)))
        # ADD AS FEATURE AT END OF MODEL TRAINING
        # test_data = pd.read_csv("filename")
        # test_features = test_data[chosen_cols]
        # ### test_target = test_data[chosen_target]  ## PROBABLY WONT EXIST IN DATASET

        # test_preds = logreg.predict(test_features)
        try:
            if st.checkbox("Make file for download"):
                pickle.dump(logreg, open('model.pkl', 'wb'))
                with open('model.pkl', 'rb') as f:
                    st.download_button("Download LogReg Model", f, file_name="LogReg.pkl")
        except:
            st.write("\nCould not store the model in file. ;(\n")

    elif ml_type == algos[2]:
        st.write(algos[2])
        import pandas as pd
        import numpy as np
        import matplotlib.pyplot as plt
        import seaborn as sns
        from sklearn.linear_model import LinearRegression, Ridge, ElasticNet
        from sklearn.model_selection import train_test_split, GridSearchCV
        from sklearn.preprocessing import PolynomialFeatures
        from sklearn.metrics import r2_score, mean_squared_error

        nullinfo = dict(df.isnull().sum())
        nullkeys = list(nullinfo.keys())
        nullvals = list(nullinfo.values())
        nullcount = 0
        null_rows = df.isnull().sum(axis = 1)
        max_null = nullkeys[nullvals.index(max(nullvals))]

        if max(nullvals) > 0:
            for i in null_rows:
                if i > 0:
                    nullcount += 1
            st.write("\nTotal Null Values: {}\n".format(nullcount))
            nullpct = round((nullcount/len(df.index))*100)
            st.write("\nThis accounts for {}% of the total data.\n".format(nullpct))
            if nullpct < 20:
                st.write("\nWe can do away with this data since the volume of null values is manageable.\n")
                df.dropna(axis=0, inplace=True)
            else:
                st.write("\nThe null values constitute a major portion of the data, consider choosing a different set of columns, or try cleaning the data locally\n")

        if len(og_chosen_cols) > 1:
            st.write("\nWe select the first column from this list as the independent variable. If you feel that `{}` is not the column you wish to be used as the `Independent variable`, please choose the column you want to use from the dropdown in the sidebar.\n".format(og_chosen_cols[0]))
        chosen_cols = og_chosen_cols[0]
        chosen_target = og_chosen_target

        X_data = df[chosen_cols].values.reshape(-1,1)
        Y_data = df[chosen_target].values.reshape(-1,1)

        fig = sns.displot(Y_data,kind='kde')
        plt.xlabel(chosen_target)
        plt.title('Original Distribution')
        st.pyplot(fig)

        x_train, x_test, y_train, y_test = train_test_split(X_data, Y_data, test_size=0.3, random_state=42)
        linreg = LinearRegression()

        linreg.fit(x_train, y_train)

        st.write(
            "\nFrom Linear Regression:\n",
            "\n- Coefficient: {}".format(round(linreg.coef_[0][0], 3)),
            "\n- Intercept: {}".format(round(linreg.intercept_[0],3))
            )
        fig2,ax1 = plt.subplots()
        ax1 = sns.regplot(X_data, Y_data)
        st.write(fig2)
        y_hat = linreg.predict(x_test)
        st.write(
            "\nAccuracy of Linear Model: {}%\n".format(round(linreg.score(x_test, y_test)*100, 3)),
            "\nMean Square Error: {}\n".format(round(mean_squared_error(y_test, y_hat),2))
            )

        cv_params = [{'alpha': [0.1, 0.2, 0.4, 0.8, 1.6, 3.2, 6.4] }]
        cv_best_est, cv_best_r2, cv_best_mse, cv_best_param = [],[],[],[]
        degrees = st.sidebar.slider("Max Polynomial Degree", 2, 10)
        for d in range(degrees):
            pf = PolynomialFeatures(degree=d+1)
            Xd = pf.fit_transform(x_train)
            Grid=GridSearchCV(estimator=Ridge(random_state=42), param_grid=cv_params, scoring=['r2','neg_mean_squared_error'], refit='r2')
            Grid.fit(Xd, y_train)
            cv_best_est.append(Grid.best_estimator_)
            cv_best_r2.append(Grid.best_score_)
            cv_best_param.append(Grid.best_params_)
            cv_best_mse.append((-1)*Grid.cv_results_['mean_test_neg_mean_squared_error'][Grid.best_index_])

            best_d_r2 = cv_best_r2.index(max(cv_best_r2))+1
            best_d_mse = cv_best_mse.index(max(cv_best_mse))+1

        st.subheader("Polynomial Regression Results")
        st.write(
            "\nDegree with best R2: {}\n".format(round(best_d_r2, 3)),
            "\nDegree with best MSE: {}\n".format(round(best_d_mse), 3)
            )
        if best_d_mse != best_d_r2:
            st.write("\nIt appears this set of columns are not suitable for prediction by Polynomial Regression\n")
        else:
            st.write(
                "\nBest hyperparameters:\n- Degree: {}\n- Alpha: {}\n".format(best_d_r2, round(cv_best_param[best_d_r2-1]['alpha'],3))
            )
            BestModel = cv_best_est[best_d_r2-1]
            pf = PolynomialFeatures(degree= best_d_r2)
            X_dtest = pf.fit_transform(x_test)
            y_dhat = BestModel.predict(X_dtest)

            st.write(
                "\nAccuracy: {}%\n".format(round(BestModel.score(X_dtest, y_test)*100,3)),
                "\nMean Squared Error: {}\n".format(round(mean_squared_error(y_test, y_dhat),3))
            )
            try:
                import pickle
                if st.checkbox("Make file for download"):
                    pickle.dump(BestModel, open('model.pkl', 'wb'))
                    with open('model.pkl', 'rb') as f:
                        st.download_button("Download PolyReg Model", f, file_name="PolyReg.pkl")
            except:
                st.write("\nCould not store the model in file. ;(\n")

# END OF MACHINE LEARNING

# ABOUT US
if page == pages[-1]:
    st.title("ML On Click - About Us", anchor="#AboutUs")
    st.write("\n\n#### Meet the App-Makers!\n")
    st.markdown("\n|Name|GitHub Profile|LinkedIn|\n|----|----|\n|Aishwarya Funaguskar|[Aish1214](https://github.com/Aish1214)|[Aishwarya Funaguskar](https://www.linkedin.com/in/aishwarya-funaguskar-b05812213/)|\n|Ishaan Sunita Pandita|[EmperorArthurIX](https://github.com/EmperorArthurIX)|[Ishaan Sunita Pandita](https://linkedin.com/in/ishaan-sunita-pandita)|\n|Rahul Pathak|[2911rahulpathak](https://github.com/2911rahulpathak)|[Rahul Pathak](https://www.linkedin.com/in/rahulgovindpathak/)|\n|Yash Shinde|[yashshinde03](https://github.com/yashshinde03)|[Yash Shinde](https://www.linkedin.com/in/yash-shinde-134560202/)|\n")

    st.write(
        "\n#### Here is the Tech Stack we used:\n",
    )
    st.markdown(
        """
        <img src="https://upload.wikimedia.org/wikipedia/commons/c/c3/Python-logo-notext.svg" height="12.5%" width = "12.5%">

        <img src = "https://upload.wikimedia.org/wikipedia/commons/4/48/Markdown-mark.svg" height="12.5%" width="20%" style="margin: 0 5%;">

        <img src="https://streamlit.io/images/brand/streamlit-logo-secondary-colormark-darktext.svg" height="15%" width="50%">
        <br><br>
        """,
        unsafe_allow_html=True
    )

    st.markdown(
            "\n##### Python Libraries:\n\n|Library|Usage|\n|---|---|\n|[Streamlit](https://streamlit.io/)|It would not be an exaggeration to say that none of this would exist without Streamlit. The entire structure of the app can be attributed to Streamlit.|\n|[NumPy](https://numpy.org/)|We have many calculations in data analysis, and the library which helps us quickly do those using NumPy!|\n|[Pandas](https://pandas.pydata.org/)|This is the big brother of NumPy. We use Pandas to organise data into beautiful tables and also read and write these tables from and to files!|\n|[Matplotlib](https://matplotlib.org/)|This is the library that gives you quick and simple graphs!|\n|[Seaborn](https://seaborn.pydata.org/)|Big brother of Matplotlib! Gives you graphs with advanced features such as regression lines and comparison graphs!|\n|[Scikit Learn](https://scikit-learn.org/)|The Brains of this app, does a lot of the Machine Learning hardwork!|\n"
        )

# END OF ABOUT US