Explore the Stack Overflow data set with the Elastic Stack using this gentle introduction. Stack Overflow data is indexed using .NET Core, a cross platform, open source platform for building applications, using NEST, the official Elasticsearch client for .NET.
- Download at least Elasticsearch 7.4.2
- Download at least Kibana 7.4.2 (version must match same version as Elasticsearch)
- Install .NET Core 3.0
- Download latest Stack Overflow data set
- Under 7Z files, choose
stackoverflow.com-Posts.7z,stackoverflow.com-Users.7zandstackoverflow.com-Badges.7z
- Under 7Z files, choose
- Unzip Stack Overflow data set to a directory. You'll need around 90GB of available space!
-
Restore project Nuget package dependencies. In the solution root directory
dotnet restore
-
Build the solution in Release configuration. In the solution root directory
dotnet build -c Release
-
Set the JVM heap size to at least 8GB, by adding the following to the
jvm.optionsfile inconfigdirectory within Elasticsearch home directory, and saving the file-Xms8g -Xmx8g -
Start Elasticsearch using the
elasticsearch.[sh|bat]file inbindirectory within Elasticsearch home directory./elasticsearch.bat
-
Navigate to
StackOverflow.Indexer/bin/Release/netcoreapp3.0directory from the root of the solution. There should be a compiledStackOverflow.Indexer.dllfile in the directory from compiling the solution in previous steps. -
Check available options for indexing posts or users using
--helpargumentdotnet .\StackOverflow.Indexer.dll --help dotnet .\StackOverflow.Indexer.dll posts --help dotnet .\StackOverflow.Indexer.dll users --help dotnet .\StackOverflow.Indexer.dll tags --help -
Index posts data
dotnet .\StackOverflow.Indexer.dll posts -e "http://localhost:9200" -f "/path/to/Posts.xml"Wait ~90 minutes to index all questions and answers on a local single node Elasticsearch cluster
-
Index users data
dotnet .\StackOverflow.Indexer.dll users -e "http://localhost:9200" -f "/path/to/Users.xml" -b "/path/to/Badges.xml"Wait ~15 minutes to index all users and their badges on a local single node Elasticsearch cluster
-
(Optional) Update answers with tags
If you'd like to be able to filter both questions and answers using tags, it can be useful to denormalize question tags onto answers. The source data can be transformed before ingesting to do this, but can also be achieved using the update by query API, which is what this command does.
dotnet .\StackOverflow.Indexer.dll tags -e "http://localhost:9200" -f "/path/to/Posts.xml"This can take a few hours. The
-sargument can be used to change the number of concurrent updates, so depending on the performance of the cluster into which you're indexing, you may be able to increase this to speed up the process.
The kibana_saved_objects_742.ndjson file can be
imported into Kibana to apply some preconfigured saved queries, visualizations and a dashboard:
- Navigate to
Managementmenu item within Kibana - Under Kibana, select
Saved Objects - Select
Importand choose thekibana_saved_objects_742.ndjsonfile.
There should now be
- a Dashboard under the
Dashboardmenu item - a collection of Vizualizations under
Vizualizemenu item - a collection of Saved Queries under
Discovermenu item
- Content of this repository made available under Apache 2.0 license.
- Stack Overflow data is made available under Creative Commons Attribution-ShareAlike 4.0 International license.