Skip to content

Latest commit

 

History

History
575 lines (425 loc) · 15.7 KB

File metadata and controls

575 lines (425 loc) · 15.7 KB

Matrix Core Module

The matrix-core module is the foundation of the Alipsa Matrix library. It provides the essential classes and functionality for working with tabular data. In this section, we'll explore the key components of the matrix-core module and how to use them effectively.

Understanding the Matrix Class

What is a Matrix?

A Matrix in the Alipsa library is a grid with a header and defined column types. You can think of it as an in-memory ResultSet or a data frame similar to those in R or pandas in Python. It provides a structured way to work with tabular data, making operations like filtering, transformation, and analysis more intuitive.

Matrix vs. Grid

The library provides two main classes for working with tabular data:

  1. Matrix: A typed table with headers and defined column types
  2. Grid: A uniformly typed table without headers

The Matrix class is more feature-rich and is the primary focus of this tutorial, while Grid provides a simpler structure for working with uniform data types.

Creating a Matrix

There are several ways to create a Matrix object, depending on your data source and requirements.

Creating from Groovy Code

The most common way to create a Matrix is using the builder pattern:

import se.alipsa.matrix.core.*

// Create a Matrix with data defined inline
def employees = [
    "employee": ['John Doe', 'Peter Smith', 'Jane Doe'],
    "salary": [21000, 23400, 26800],
    "startDate": ListConverter.toLocalDates(['2013-11-01', '2018-03-25', '2017-03-14']),
    "reviewPeriod": ListConverter.toYearMonths(['2020-01', '2019-04', '2018-10'])
]

def table = Matrix.builder().data(employees).build()

// Verify the data
assert table.salary[0] == 21000 // get index 0 from the column "salary"
assert table[0, 1] == 21000 // get row 0, column 1
assert table[0, 'salary'] == 21000 // get row index 0 from the column name salary
assert table['salary'][0] == 21000 // get the column and the first index
assert table.row(0).salary == 21000 // get the first row and then the salary column
assert table.row(0)[1] == 21000 // get the first row and then the second column

You can also create a Matrix by specifying columns and their types:

import se.alipsa.matrix.core.*
import java.time.LocalDate

// Create a Matrix with column definitions
def table = Matrix.builder().columns([
    'place': [1, 2, 3],
    'firstname': ['Lorena', 'Marianne', 'Lotte'],
    'start': ['2021-12-01', '2022-07-10', '2023-05-27']
])
    .types([String]*3)
    .build()

// Convert column types after creation
def table2 = table.convert([place: Integer, start: LocalDate])
println(table2.typeNames())  // Prints: [Integer, String, LocalDate]

Creating from a Result Set

When working with databases, you can create a Matrix directly from a JDBC ResultSet:

import se.alipsa.matrix.core.Matrix
import se.alipsa.groovy.datautil.SqlUtil

// Connect to a database and execute a query
SqlUtil.withInstance(dbUrl, dbUser, dbPasswd, dbDriver, this) { sql ->
    sql.query('SELECT * FROM PROJECT') { rs ->
        project = Matrix.builder().data(rs).build()
    }
}

// Now you can work with the project Matrix
println(project.content())

Creating from a CSV File

You can easily create a Matrix from a CSV file:

import se.alipsa.matrix.core.Matrix

// Create a Matrix from a CSV file
def table = Matrix.builder().data(new File('/some/path/foo.csv'), ';').build()

Accessing Data in a Matrix

The Matrix class provides multiple ways to access and manipulate data.

Using [] Notation

Data can be referenced using [] notation. For example, to get the content of the 3rd row and 2nd column:

def value = table[3, 2]  // Row 3, Column 2

You can also access data by column name:

def priceColumn = table["price"]  // Get the entire price column
def price = table[3, "price"]     // Get the price from row 3

If you pass only one argument, you get the column:

def priceColumn = table["price"]
// or alternatively
def priceColumn = table.price

Using Column Names

Matrix provides property-like access to columns:

// These are equivalent
def priceColumn = table.price
def priceColumn2 = table["price"]
def priceColumn3 = table.getAt("price")
def priceColumn4 = table.column("price")
def priceColumn5 = table.column(table.columnNames().indexOf("price"))
def priceColumn6 = table.columns().get(table.columnNames().indexOf("price"))

Using Row Indices

You can access rows by index:

def row = table.row(3)  // Get the 4th row (0-based indexing)
def value = row.price   // Get the price value from that row

Data Manipulation

Converting Data Types

The Matrix class provides several methods to convert column types:

  1. Using a map of column names and types:
def convertedTable = table.convert([
    place: Integer,
    start: LocalDate
])
  1. Using a list of column types:
def convertedTable = table.convert([Integer, String, LocalDate])
  1. Using a closure for complex conversions:
def convertedTable = table.convert('place', Integer, { 
    String val = String.valueOf(it).trim()
    if (val == 'null' || val == ',' || val.isBlank()) return null
    return Integer.valueOf(val)
})

Transforming Data

You can transform data using the apply method:

import se.alipsa.matrix.core.*
import java.time.*

def data = [
    'place': ['1', '2', '3', ','],
    'firstname': ['Lorena', 'Marianne', 'Lotte', 'Chris'],
    'start': ['2021-12-01', '2022-07-10', '2023-05-27', '2023-01-10']
]

def table = Matrix.builder('emp')
    .columns(data)
    .build()
    .convert(place: Integer, start: LocalDate)

// Add 10 days to the start dates
def table2 = table.apply("start", { startDate ->
    startDate.plusDays(10)
})

println(table2.content())

Adding a column

You can add new columns to an existing Matrix using addColumn. A common use case is to derive a new column from an existing one. Here is an example:

import se.alipsa.matrix.core.Matrix
import java.time.LocalDate

import static se.alipsa.matrix.core.ListConverter.*

import java.time.YearMonth

def empData = Matrix.builder().columns(
    emp_id: 1..5,
    emp_name: ["Rick", "Dan", "Michelle", "Ryan", "Gary"],
    salary: [623.3, 515.2, 611.0, 729.0, 843.25],
    start_date: toLocalDates("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11", "2015-03-27"))
    .types([int, String, Number, LocalDate]).build()

// Add a yearMonth column derived from the start_date column
def table = empData
    .clone() // clone to avoid modifying the original (optional)
    .addColumn("yearMonth", YearMonth, toYearMonths(empData.start_date))

println table.toMarkdown()

Which will output

emp_id emp_name salary start_date yearMonth
1 Rick 623.3 2012-01-01 2012-01
2 Dan 515.2 2013-09-23 2013-09
3 Michelle 611.0 2014-11-15 2014-11
4 Ryan 729.0 2014-05-11 2014-05
5 Gary 843.25 2015-03-27 2015-03

Subsetting Data

You can create subsets of your data using various methods:

// Using findIndexValues to select rows
def rows = table.rows(table['place'].findIndexValues { it > 1 })
assertEquals(2, rows.size())

// Using subset method
def subSet = table.subset('place', { it > 1 })
assert table.rows(1..2) == subSet.grid().rowList

There is also a filter {} alias with the same behavior, which can read more naturally for dataframe-style code:

def highPlaces = table.filter { it.place > 1 }
assert table.rows(1..2) == highPlaces.grid().rowList

Performing Calculations

Using the Apply Method

The apply method allows you to perform calculations on columns:

import se.alipsa.matrix.core.*

def account = Matrix.builder('account')
    .columns([
        'id': ['1', '2', '3', '4'],
        'balance': [12323, 23400, 45932, 77200]
    ])
    .build()

def ir = Matrix.builder('interest rates')
    .columns([
        id: [1, 2, 3, 4],
        interestRate: [0.034, 0.022, 0.019, 0.028]
    ])
    .build()

def accountAndInterest = account.clone().withMatrixName('accountAndInterest')
accountAndInterest['interestAmount', Double] = Stat.apply(account['balance'], ir['interestRate']) { b,r -> b * r }

// Result will have interest amounts calculated for each account
println(accountAndInterest.content())

Column Arithmetics

The Matrix class overrides basic mathematical operations to work on columns:

import se.alipsa.matrix.core.*

def m = Matrix.builder()
    .columns([
        id: [1, 2, 3, 4],
        balance: [10000, 1000.23, 20122, 12.1]
    ])
    .types(int, double)
    .build()

def r = Matrix.builder()
    .columns([
        id: [1, 2, 3, 4],
        ir: [0.041, 0.020, 0.035, 0.5]
    ])
    .types(int, double)
    .build()

def interest = [10000*0.041, 1000.23*0.020, 20122*0.035, 12.1*0.5]
assert interest == m.balance * r.ir

// You can also multiply all items in a column
assert [10000*0.05, 1000.23*0.05, 20122*0.05, 12.1*0.05] == m.balance * 0.05

Statistical Operations with Stat Class

The Stat class provides various statistical functions:

import se.alipsa.matrix.core.*
import static se.alipsa.matrix.core.Stat.*

def table = Matrix.builder().data([
    v0: [0.3, 2, 3],
    v1: [1.1, 1, 0.9],
    v2: [null, 'Foo', 'Foo']
])
.types(Number, Double, String)
.build()

def summary = summary(table)
println(summary)

// Calculate mean, median, standard deviation, etc.
def mean = mean(table.v0)
def median = median(table.v0)
def sd = sd(table.v0)
def variance = variance(table.v0)

Rolling, Cumulative, and Shift Operations

Matrix 3.7.0 adds time-series style helpers directly on Column and Matrix.

import se.alipsa.matrix.core.*

def prices = Matrix.builder('prices').data(
    day: [1, 2, 3, 4, 5],
    close: [100, 105, 102, 108, 110],
    volume: [10, 12, 11, 15, 18]
).types(Integer, Integer, Integer).build()

// Rolling windows on a single column
def movingAverage = prices.close.rolling(3).mean()  // [null, null, 102.33..., 105, 106.66...]

// Rolling windows across a matrix
def rollingMean = prices.rolling(window: 2).mean()

// Cumulative operations
assert prices.volume.cumsum() == [10, 22, 33, 48, 66]
assert prices.cummax().close == [100, 105, 105, 108, 110]

// Shift-style helpers
assert prices.close.lag() == [null, 100, 105, 102, 108]
assert prices.close.lead() == [105, 102, 108, 110, null]
assert prices.close.diff() == [null, 5, -3, 6, 2]

Rolling windows can also be centered, require fewer observations via minPeriods, and for matrices can be evaluated in a sorted order via by: while preserving original row order.

Indexing and Grouped Access

Matrix now supports database-style indexes on ordinary data columns:

def sales = Matrix.builder().data(
    country: ['USA', 'USA', 'UK', 'UK', 'USA'],
    quarter: ['Q1', 'Q2', 'Q1', 'Q2', 'Q1'],
    revenue: [100, 200, 150, 250, 300]
).types(String, String, Integer).build()

sales.createIndex('country', 'quarter')

assert sales.hasIndex()
assert sales.indexedColumns() == ['country', 'quarter']
assert sales.lookup('USA', 'Q1').revenue == [100, 300]
assert sales.lookup('USA').rowCount() == 3

Grouping now returns a GroupedMatrix instead of a plain map:

def grouped = sales.groupBy('country')

assert grouped.groupCount() == 2
assert grouped.get('USA').rowCount() == 3
assert grouped.level('country') == ['USA', 'UK'] as Set

def totals = grouped.agg(revenue: { Stat.sum(it) })
println totals.content()

If you need the pre-3.7.0 map style for older code, use grouped.toStringKeyMap().

Pipe-Style Chaining

For left-to-right transformation pipelines, use pipe {} or the Groovy | operator:

def result = prices
    .pipe { it.filter { row -> row.close >= 102 } }
    .pipe { it.orderBy('close', true) }

def result2 = prices | { it.selectColumns('close', 'volume') } | { it.cumsum() }

Using Groovy Integrated Queries (Ginq)

Groovy Integrated queries can be used on Matrix rows for powerful data manipulation:

import se.alipsa.matrix.core.Matrix

def stock = Matrix.builder('stock').data(
    name: ['Orange', 'Apple', 'Banana', 'Mango', 'Durian'],
    price: [11, 6, 4, 29, 32],
    stock: [2, 3, 1, 10, 9]
)
.types(String, int, int)
.build()

def result = GQ {
    from f in stock
    where f.price < 32
    orderby f.price in desc
    select f.name, f.price, f.stock
}

// You can also use Matrix's built-in query capabilities
def exp = stock.subset{it.price < 32}.orderBy('price', true)

// Create a new Matrix from the query result
def matrix2 = Matrix.builder(stock.matrixName).ginqResult(result).build()
println matrix2.content()

Will output

stock: 4 obs * 3 variables 
name  	price	stock
Mango 	   29	   10
Orange	   11	    2
Apple 	    6	    3
Banana	    4	    1

The Grid Class

A Grid is a uniformly typed table. The grid class contains static functions to operate on a 2D list structure.

Creating a Grid

import se.alipsa.matrix.core.*

def foo = [
    [12.0, 3.0, Math.PI],
    ["1.9", 2, 3],
    ["4.3", 2, 3]
] as Grid

// A typed way is usually clearer
def bar = new Grid<Number>([
    [12.0, 3.0, Math.PI],
    [1.9, 2, 3],
    [4.3, 2, 3]
])

// Calculate mean
println Stat.means(bar)

Output

[6.066666667, 2.333333333, 3.047197551]

Grid Operations

The Grid class provides several operations:

  • convert: Converts one column type to another
  • clone: Creates a deep copy of the grid
  • transpose: "Rotates" the grid 90 degrees
  • isValid: Checks if it is a proper grid or not

Elements can be accessed using the simple square bracket notation:

foo[0, 1] = 3.23
assert 3.23 == foo[0, 1]

Using Matrix from Other JVM Languages

If you're using the Matrix library from Java, you'll need to use the underlying methods instead of the shorthand notation:

// Instead of
myMatrix[2, 'id'] = 34;

// You must use
myMatrix.putAt(2, "id", 34);

// Instead of
def myGroovyVar = myMatrix[1, 2];

// You must use
var myJavaVar = myMatrix.getAt(1, 2);

There are some utility classes making Matrix creation less painful in Java:

import se.alipsa.matrix.core.Matrix
import static se.alipsa.matrix.core.ListConverter.*
import java.time.LocalDate
// In Groovy
def gEmpData = Matrix.builder()
    .data(
        emp_id: 1..5,
        emp_name: ["Rick", "Dan", "Michelle", "Ryan", "Gary"],
        salary: [623.3, 515.2, 611.0, 729.0, 843.25],
        start_date: toLocalDates("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11", "2015-03-27")
    )
    .types(int, String, Number, LocalDate)
    .build()
// In Java
import se.alipsa.matrix.core.Matrix;
import static se.alipsa.matrix.core.ListConverter.*;
import java.time.LocalDate;
import se.alipsa.matrix.core.util.Columns;
import static se.alipsa.matrix.core.util.CollectionUtils.*;

var jEmpData = Matrix.builder().data( new Columns()
    .add("emp_id", r(1, 5))
    .add("emp_name", "Rick", "Dan", "Michelle", "Ryan", "Gary")
    .add("salary", 623.3, 515.2, 611.0, 729.0, 843.25)
    .add("start_date", toLocalDates("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11", "2015-03-27"))
)
.types(int.class, String.class, Number.class, LocalDate.class)
.build();

Conclusion

The matrix-core module provides a solid foundation for working with tabular data in Groovy and Java. Its intuitive API, powerful data manipulation capabilities, and statistical functions make it an excellent choice for data analysis and processing tasks.

In the next sections, we'll explore the additional modules that build upon this foundation to provide specialized functionality for various data formats and operations.

Go to previous section | Go to next section | Back to outline