Skip to content

Commit

Permalink
Merge branch 'master' into develop
Browse files Browse the repository at this point in the history
# Conflicts:
#	docker/Dockerfile.fe
#	src/main/java/usi/si/seart/controller/RootController.java
  • Loading branch information
dabico committed Jun 16, 2023
2 parents 9cfcce5 + 64ce546 commit 9543fe5
Show file tree
Hide file tree
Showing 9 changed files with 186 additions and 290 deletions.
2 changes: 1 addition & 1 deletion .github/dependabot.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ updates:
labels:
- dependencies
- package-ecosystem: "docker"
directory: "/"
directory: "/docker"
schedule:
interval: "weekly"
target-branch: "develop"
Expand Down
183 changes: 160 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,39 +1,176 @@

# GHSearch Platform
# GitHub Search · [![Status](https://badgen.net/https/dabico.npkn.net/ghs-status)](http://seart-ghs.si.usi.ch) [![MIT license](https://img.shields.io/badge/License-MIT-blue.svg)](https://github.com/seart-group/ghs/blob/master/LICENSE) [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.4588464.svg)](https://doi.org/10.5281/zenodo.4588464)

This project is made of two subprojects:
1. `application`: The main application has two main responsibilities:
1. Crawling GitHub and retrieving repository information. This can be disabled with `app.crawl.enabled` argument.
2. Serving as the backend server for website/frontend
2. `front-end`: A frontend for searching the database, which is available at http://seart-ghs.si.usi.ch
This project is made of two components:
1. A Spring Boot powered back-end, responsible for:
1. Continuously crawling GitHub API endpoints for repository information, and storing it in a central database;
2. Acting as an API for providing access to the stored data.
2. A Bootstrap-styled and jQuery powered web user interface, serving an accessible front for the API, available at http://seart-ghs.si.usi.ch

## Setup & Run Project Locally (for development)
## Running Locally

The detailed instruction can be find [here](./README_SETUP.md).
### Prerequisites

- Java 11
- Maven (3.8+)
- MySQL (8.0.32+)
- Git

### Database

Before choosing whether to start with a clean slate or pre-populated database, make sure the following requirements are met:

1. The database timezone is set to UTC (+00:00). You can verify this via:

```sql
SELECT @@global.time_zone, @@session.time_zone;
```

2. The `gse` database exists. To create it:

```sql
CREATE DATABASE gse CHARACTER SET utf8 COLLATE utf8_bin;
```

3. The `gseadmin` user exists. To create one, run:

```sql
CREATE USER IF NOT EXISTS 'gseadmin'@'%' IDENTIFIED BY 'Lugano2020';
GRANT ALL ON gse.* TO 'gseadmin'@'%';
```

If you want to start with a completely blank database, then no further action is required.
The necessary tables will be created by virtue of Flyway migrations, which will run on initial server startup.
However, if you want your local database to be pre-initialized with the data we have mined, then you can use the compressed SQL dump we provide.
Said dump can be found in [docker-compose/initdb](docker-compose/initdb), and to import it you would run:

```shell
gzcat < docker-compose/initdb/gse.sql.gz | mysql -u gseadmin -pLugano2020 gse
```

### Server

Before attempting to run the server, I advise you generate your own GitHub personal access token (PAT).
Said token should include the `repo` scope, in order for it to effectively crawl the GitHub API.
While the token is not mandatory, the impact its presence has on the mining speed can not be understated.

Once that is done, you can run the server locally using Maven:

```shell
mvn spring-boot:run
```

If you want to make use of the token when crawling, specify it in the run arguments:

```shell
mvn spring-boot:run -Dspring-boot.run.arguments=--app.crawl.tokens=<your_access_token>
```

Alternatively, you can compile and run the JAR directly:

```shell
mvn clean package
ln target/ghs-application-*.jar target/ghs-application.jar
java -Dapp.crawl.tokens=<your_access_token> -jar target/ghs-application.jar
```

Here's a list of project-specific arguments supported by the application that you can find in the `application.properties`:
| variable name | type | default value | description |
|------------------------------|--------------------|-------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `app.crawl.enabled` | boolean | true | Specifies if the crawling jobs are enabled on startup |
| `app.crawl.languages` | List&lt;String&gt; | See [application.properties](src/main/resources/application.properties) | A comma-separated list of language names that will be targeted during crawling |
| `app.crawl.tokens` | List&lt;String&gt; | | A comma-separated list of GitHub personal access tokens (PATs) that will be used for mining the GitHub API |
| `app.crawl.scheduling` | String | 21600000 (6h, in ms) | Crawler scheduling rate, expressed as a numeric string |
| `app.crawl.startdate` | String | 2008-01-01T00:00:00 | "Beginning of time". Basically the earliest supported date for crawling repos, if no crawl jobs were previously performed. Formatted as a yyyy-MM-ddTHH:MM:SS string. |
| `app.cleanup.enabled` | boolean | true | Specified if the job responsible for removing unavailable repositories is enabled on startup |
| `app.cleanup.scheduling` | String | 21600000 (6h, in ms) | Cleanup scheduling rate, expressed as a numeric string |
| `app.cache-evict.scheduling` | String | 21600000 (6h, in ms) | Query cache eviction scheduling rate, expressed as a numeric string |
### Web UI
The easiest way to start the front-end is through IntelliJ's built-in web server.
After starting the application back-end right click on `index.html` in the [html](html) directory,
and select one of the provided launch options from `Open In > Browser`.
Alternatively, you can self-host the web UI by virtue of tools such as `http-server`:

```shell
# install by running: npm install -g http-server
http-server html -p 3030
```

Regardless of which method you choose for hosting, the back-end CORS restricts you to using either port `3030` or `7030`.

## Dockerisation :whale:
The instruction to deploy the project via Docker is available [here](./README_DEPLOY.md).

The deployment stack consists of the following containers:

| Service/Container name | Image | Purpose | Enabled |
|------------------------|:------------------------------------------------:|-----------------------------------|:-----------------------------:|
| `gse-app` | [gse/backend](docker/Dockerfile.be) | for the spring application itself | :white_check_mark: |
| `gse-fe` | [gse/frontend](docker/Dockerfile.fe) | for supplying the front end files | :white_check_mark: |
| `gse-db` | [mysql](https://registry.hub.docker.com/_/mysql) | for the database | :white_check_mark: |
| `gse-bkp` | [gse/backup](docker/Dockerfile.bkp) | for the automatic backups | :negative_squared_cross_mark: |

Deploying is as simple as, in the [docker-compose](docker-compose) directory, run:

```shell
docker-compose -f docker-compose.yml up -d
```

It's worth mentioning that the database setup steps outlined in the previous section are not needed when running with docker,
as the environment properties passed to the service will create the user and pre-populate the DB on first ever startup.
The database data itself is kept in the `gse-data` volume,
while detailed back-end logse are kept in a local mount called "logs" in [docker-compose](docker-compose).
The database backup service is disabled by default, as we use it primarily in production.
Should you chose to enable it, you would have to define your own personal override file.
Here's an example of a `docker-compose.override.yml` that re-enables backups:

## More Info on Flyway and Database Migration
To learn more about Flyway you can read on [here](./README_flyway.md).
```yaml
version: '3.9'
name: 'gse'
services:
gse-bkp:
restart: always
entrypoint: "/init"
```

You can also use this override file to change the service configurations of other services,
for instance specifying your own PAT for the crawler:

```yaml
version: '3.9'
name: 'gse'
services:
gse-app:
environment:
APP_CRAWL_ENABLED: 'true'
APP_CRAWL_TOKENS: '<your_access_token>'
```

Any of the Spring Boot properties or aforementioned application-specific properties can be overridden.
Just keep in mind that `app.x.y` corresponds to the `APP_X_Y` service environment setting.
Don't forget to specify the override file when running the command:
```shell
docker-compose -f docker-compose.yml -f docker-compose.override.yml up -d
```
---
## FAQ
### How can I report a bug or request a feature or ask a question?**
Please add a [new issue](https://github.com/seart-group/ghs/issues/) and we will get back to you very soon.
### How can I report a bug or request a feature or ask a question?
### How add a new programming language to platform?
Add the new **language name** to `supported_languages` table via:
Please add a [new issue](https://github.com/seart-group/ghs/issues/), and we will get back to you very soon.
1. Flyway migration file (recommended): Create a new file `src/main/resources/db/migration/Vx__NewLangs.sql` containing:
`INSERT INTO supported_language (name,added) VALUES ('C++',current_timestamp);`
2. Or, manually editing the table.
### How do I extend/modify the existing database schema?
- **Note**:
- A comprehensive list of valid languages (and their aliases) are available at [here](https://github.com/github/linguist/blob/master/lib/linguist/languages.yml).
- Plus you can see a similar list at [GitHub Advanced Search Page](https://github.com/search/advanced).
- You can use the following link to verify if a language is valid, and gives an upper-bound for the number of new repositories to be mined:
- Example: `C++` -> `C%2B%2B`:`https://api.github.com/search/repositories?q=is:public+stars:%3E10+language:C%2B%2B`.
In order to do that, you should be familiar with database migration tools and practices.
This project in particular uses [Flyway](https://flywaydb.org/) by Redgate.
However, the general rule for schema manipulation is: create new migrations, and _do not_ edit existing ones.
45 changes: 0 additions & 45 deletions README_DEPLOY.md

This file was deleted.

Loading

0 comments on commit 9543fe5

Please sign in to comment.