The matrix-datasets module provides a collection of common, public domain, and open-source datasets in Matrix format for Groovy applications. These datasets are similar to those commonly used in R and Python for data analysis and machine learning tasks.
To use the matrix-datasets module, you need to add it as a dependency to your project.
implementation platform('se.alipsa.matrix:matrix-bom:2.2.0')
implementation 'se.alipsa.matrix:matrix-datasets'<project>
<!-- Other project configurations -->
<dependencyManagement>
<dependencies>
<dependency>
<groupId>se.alipsa.matrix</groupId>
<artifactId>matrix-bom</artifactId>
<version>2.2.0</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
<dependencies>
<dependency>
<groupId>se.alipsa.matrix</groupId>
<artifactId>matrix-core</artifactId>
</dependency>
<dependency>
<groupId>se.alipsa.matrix</groupId>
<artifactId>matrix-datasets</artifactId>
</dependency>
</dependencies>
</project>The matrix-datasets module includes several popular datasets that are commonly used in data science and statistics:
- iris: A famous dataset containing measurements of iris flowers
- mtcars: Motor Trend Car Road Tests
- PlantGrowth: Results from an experiment on plant growth
- ToothGrowth: The effect of vitamin C on tooth growth in guinea pigs
- USArrests: Violent crime rates by US state
- diamonds: A dataset containing prices and attributes of diamonds
- mpg: Fuel economy data from the EPA
- Map data: Various geographical data
Using the datasets is straightforward. You simply import the necessary classes and access the datasets through the Dataset class:
import se.alipsa.matrix.datasets.*
import se.alipsa.matrix.core.*
// Load the iris dataset
Matrix iris = Dataset.iris()
// Print the first few rows
println(iris.head(5))
// Get basic information about the dataset
println("Dimensions: ${iris.rowCount()} rows x ${iris.columnCount()} columns")
println("Column names: ${iris.columnNames()}")Output
Sepal Length Sepal Width Petal Length Petal Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5 3.6 1.4 0.2 setosa
Dimensions: 150 rows x 5 columns
Column names: [Sepal Length, Sepal Width, Petal Length, Petal Width, Species]
Let's explore the iris dataset in more detail:
import se.alipsa.matrix.datasets.*
import se.alipsa.matrix.core.*
// Load the iris dataset
Matrix iris = Dataset.iris()
// Calculate mean sepal length by species
Matrix speciesMeans = Stat.meanBy(iris, 'Sepal Length', 'Species')
println(speciesMeans.content())Output:
Iris-means by Species: 3 obs * 2 variables
Species Sepal Length
virginica 6.588000000
setosa 5.006000000
versicolor 5.936000000
// Calculate summary statistics for each numeric column
def summary = Stat.summary(iris)
println(summary)output
Sepal Length
------------
Type: BigDecimal
Min: 4.3
1st Q: 5.1
Median: 5.8
Mean: 5.843333333
3rd Q: 6.4
Max: 7.9
Sepal Width
-----------
Type: BigDecimal
Min: 2
1st Q: 2.8
Median: 3
Mean: 3.057333333
3rd Q: 3.3
Max: 4.4
Petal Length
------------
Type: BigDecimal
Min: 1
1st Q: 1.6
Median: 4.35
Mean: 3.758000000
3rd Q: 5.1
Max: 6.9
Petal Width
-----------
Type: BigDecimal
Min: 0.1
1st Q: 0.3
Median: 1.3
Mean: 1.199333333
3rd Q: 1.8
Max: 2.5
Species
-------
Type: String
Number of unique values: 3
Most frequent: setosa occurs 50 times (33.33%)
// Filter the dataset to get only one species
def speciesIdx = iris.columnIndex("Species")
def setosa = iris.subset {
it["Species"] == 'setosa'
}
// Calculate the mean of each measurement for setosa
println("Setosa means:")
println("Sepal Length: ${setosa['Sepal Length'].mean()}")
println("Sepal Width: ${setosa['Sepal Width'].mean()}")
println("Petal Length: ${setosa['Petal Length'].mean()}")
println("Petal Width: ${setosa['Petal Width'].mean()}")output
Setosa means:
Sepal Length: 5.006000000
Sepal Width: 3.428000000
Petal Length: 1.462000000
Petal Width: 0.246000000
The mtcars dataset contains information about various car models:
import se.alipsa.matrix.datasets.*
import se.alipsa.matrix.core.*
import se.alipsa.matrix.stats.*
// Load the mtcars dataset
Matrix mtcars = Dataset.mtcars()
// Print the first few rows
println(mtcars.head(3))
// Calculate the average mpg (miles per gallon) by number of cylinders
Matrix mpgByCyl = Stat.meanBy(mtcars, 'mpg', 'cyl')
println(mpgByCyl.content())
// Find cars with high horsepower (> 200)
def highPowerCars = mtcars.subset('hp', { it > 200 })
println("Cars with high horsepower:")
println(highPowerCars.content())
// Calculate correlation between mpg and weight
def correlation = Correlation.cor(mtcars['mpg'], mtcars['wt'])
println("Correlation between mpg and weight: ${correlation}")The PlantGrowth dataset contains results from an experiment on plant growth:
import se.alipsa.matrix.datasets.*
import se.alipsa.matrix.core.*
import se.alipsa.matrix.stats.Student
// Load the PlantGrowth dataset
Matrix plantGrowth = Dataset.plantGrowth()
// Print the dataset structure
println(plantGrowth.content())
// Calculate mean weight by group
Matrix weightByGroup = Stat.meanBy(plantGrowth, 'weight', 'group')
println(weightByGroup.content())
// Extract control and treatment groups
def ctrl = plantGrowth.subset('group', { it == 'ctrl' })
def trt1 = plantGrowth.subset('group', { it == 'trt1' })
// Perform t-test to compare control vs treatment 1
def tTestResult = Student.tTest(ctrl['weight'], trt1['weight'], false)
println("T-test result (ctrl vs trt1):")
println(tTestResult)The diamonds dataset contains information about diamond prices and attributes:
import se.alipsa.matrix.datasets.*
import se.alipsa.matrix.core.*
// Load the diamonds dataset
Matrix diamonds = Dataset.diamonds()
// Print the first few rows
println(diamonds.head())
// Calculate average price by diamond cut
Matrix priceByQuality = Stat.meanBy(diamonds, 'price', 'cut')
println("Average price by cut:")
println(priceByQuality.content())
// Calculate average price by diamond color
Matrix priceByColor = Stat.meanBy(diamonds, 'price', 'color')
println("Average price by color:")
println(priceByColor.content())
// Find the most expensive diamonds (top 5)
def sortedByPrice = diamonds.orderBy('price', false)
println("Top 5 most expensive diamonds:")
println(sortedByPrice.subset(0..4).content())While the matrix-datasets module provides several common datasets, you might want to create your own datasets for specific use cases. You can do this by creating a Matrix object from your data:
import se.alipsa.matrix.core.*
import se.alipsa.matrix.csv.*
// Create a custom dataset
def customData = Matrix.builder().data(
name: ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
age: [25, 30, 35, 40, 45],
score: [85, 90, 78, 92, 88]
).build()
// Save the dataset to a CSV file for future use
CsvExporter.exportToCsv(customData, new File('/path/to/custom_dataset.csv'))
// Later, you can load it back
def loadedData = Matrix.builder().data(new File('/path/to/custom_dataset.csv')).build()The matrix-datasets module provides a convenient way to access common datasets for data analysis and machine learning tasks in Groovy. These datasets can be used for learning, testing algorithms, or as reference data for your applications.
In the next section, we'll explore the matrix-spreadsheet module, which provides functionality for importing and exporting data between Matrix objects and Excel or OpenOffice Calc spreadsheets.
Go to previous section | Go to next section | Back to outline