Skip to content
KristinJensen edited this page Apr 4, 2012 · 1 revision

Notes from Juxta Camp Day Two

Introduction

These are notes from Day Two of Juxta Camp held at Performant Software's offices in Charlottesville, Virginia on July 12, 2011. These notes represent an attempt to paraphrase the discussion and should not be taken as verbatim quotations from the participants. Brackets and ellipses mark spots where your note taker missed details.

Notes from Juxta Camp, Day Two (July 12, 2011)

Nick Laiacona: begin with Jim’s presentation

James Smith: presentation: Corpora Space Architecture

Walk through slides from Toolmixer at MITH – Bamboo / Corpora Space

Bamboo: phase one work is ongoing

4 parts

  1. workspace

  2. service platform – enterprise-level platform for services and instrumentation

  3. collection interoperability – how can we make various collections present themselves in such a way that the tool can use them – Hathi Trust, etc.

  4. Corpora Space

Part of the project is designing phase 2: testing out architecture ideas

In September, we will put together the document for a proposal in March

Questions:

To what degree do we want serendipity? Important to humanists—recreate experience in library where you find what you weren’t looking for but what you need

Ownership of material: how important is it? Curation

How do we define a collection?

You want to move computation to the data, not data to the computation

Activity groupings: model, view, transform, auxiliary, other

Common flow for data: modeling > transformation > viewing or viewing > transformation > modeling

Cycle for data: curated data, found data, transformed data, annotation > proposed data

Principles:

Layered such that each layer uses the one below it – enables changing implementation details without changing other layers

Maximize agency for each component – person providing tools, content, service; person running service; want everyone to feel ownership

Maximize flow for users and developers: match challenge to skill

Need a wide variety of interfaces: tools, workspaces, command line

Spectrum of users – spectrum of ways to use stuff

It Just Works: it should just work: clue to think about how interface is working

Initial idea:

We have clients, agents, and a switchboard between the two

Clients consume

Agents provide stuff

Switchboard mediates

Client can expect certain services without knowing where they live; don’t have to ask where things are

Switchboard figures out which agent provides it

Agent provides model; client provides view

Layered nature of it:

Network protocol, agent library

Put a compute engine in the library: taking the computing to the data

Library sits above the client library: does the abstractions specific to Corpora Space

Working set or intermediate result set storage

Might be restricted based on metadata

Able to store stuff back in: essentially a cloud for humanities computing

Intermediate result stuff = agent providing a service

Example: two different researchers using Corpora Space

Diagram: looks like one tool is going to the next tool: not actually how bytes are streaming around, but conceptually how it works

Working sets going between the tools

How the tool sees the world: pulls stuff out of Corpora Space and pushes stuff back in

If we have two tools, first one pulls stuff out, pushes it back in; second pulls it back out

Ways tools can interact with Corpora Space

Basic profile: mirrors how people work with file systems

Tool can use data from Corpora Space, authenticate to Corpora Space, provide data to Corpora Space

Want to add provenance data

Advanced tool: curation profile

Tool provides audit or provenance data

Can run tools, play, then say “how did I get here?”

Workflow profile: most advanced, most integrated with Corpora Space

Tool can let you tell me what to do with the data: reconstruct flow from previous tool usage

Want to work with other tools without having a huge investment for integration

Nitty-gritty:

Function libraries, client/agent structure, etc.

Function libraries: set of functions referenced with XML namespace

Provides permanence of function

Agents can move around, come and go; functionality always called by that name

Functions don’t have side effects, actions do

Four types of functions: mappings, reductions, consolidations, ‘plain functions’

Reduction/consolidation: example: nasty nested set of function calls: tree of calls: reduction = leaf node, consolidation = branch node

Allows us to take what humanist wants and split it into parallel chunks

Client with everything in it

Client with remote topic modeling – from client perspective nothing has changed

SOA, ROA

Protocol between client agent and switchboard is websocket

Can have multiple requests and responses don’t have to be in same order

REST+protocol for accessing resources

Code available on github

Compute engine based on stuff in Jim’s github repository: all open source

Nick Laiacona: any questions?

James Smith: quick demo

First, command line client

Nick Laiacona: questions?

Gregor Middell: we decomposed collation into different functions

One problem we ran into: decomposing of problem into several steps was easy, but you end up handing off a lot of data from function to function

Example: handing off document to tokenizer, then matcher, then aligner, then visualizer possibly in different countries

How do you handle the problem of having to hand off the data?

James Smith: a couple possibilities:

You could have stickiness, for example: if UVA had installation of switchboard and wanted to provided local versions of tokenizer etc, then we’d want a notion of the closest implementation to the data. Doesn’t solve all of transport problems, but helps.

When you’re building something this flexible, you’re never going to have the performance of a monolithic desktop app.

If we can get to a model where people—batch model—know it’s going to be a long-running job, fire it off and come back later.

Thinking back to time when I used to use Medline—fire off request, come back next day and second day for results—people were used to that. Now we have faster and faster computers. Need to get distributed stuff out there—expectations have to be reconfigured.

It’s not going to match what people experience on the desktop.

Nick Laiacona: what really kills you is the last mile on the DSL modem

If switchboard can cache the data and pass a pointer …

James Smith: working on letting client say “if I disconnect, don’t kill the job”

Nick Laiacona: otherwise you’re capped on the amount you can do …

James Smith: if you’re doing a topic map, you may want to see the data before it’s finished; build an idea of what you’re expecting it to look like

Nick Laiacona: so for Juxta: bring back up the diagram that shows the researcher, the tools, and the Corpora Space cloud

Where would Juxta fit in? A tool? But on the web

James Smith: a couple options

One: the tool could be the Javascript client

You’d be able to have a Javascript app that would handle everything

Another: the server can connect

That hasn’t been the primary model we’ve worked with; but I’m starting to think about it more

I expect it ends up being comparable to connecting to a database; do you want to be able to cache the connection?

All in the browser vs. server doing some of the computational stuff

If you wanted to interact with Corpora Space you could treat it like you would treat a database: persistent connection

Retrieve documents, put them back in

We would need to provide a way of redoing the authentication part

Client needs to be able to tell switchboard who is using it

If you’re doing server-side connection then you need to be able to redo authentication piece

Nick Laiacona: we’re not providing any authentication layer—it would be like Solr

James Smith: some things would be available without authentication, some need authentication

It could be that there are agents providing services that don’t need authentication

Gregor Middell: How would you discover documents on Bamboo?

James Smith: Plan now is that the document collection (intermediate results, working collection) is provided by an agent—set of functions provided by the agent to the environment

Gregor Middell: Did big repositories such as Hathi Trust already say which functions they would want?

James Smith: I was thinking of intermediate results; that’s something different

We haven’t provided the detail

Coming up with a standard way of addressing documents in these collections

We would provide some way of providing the semis [?] semantics, do queries, get back URLs

Nick Laiacona: [question I missed]

James Smith: if you want other tools outside Juxta to access the capabilities of Juxta, then it would be an agent

Ronald Dekker: I would think the current Juxta product would be both a client and an agent

Nick Laiacona: It would be on the other side of the switchboard

Alex Gil: when you say you can switch data from one tool to another—Juxta still hasn’t decided how it’s going to add info about collation to a database or XML file—when it sends it back to be used by another tool, how can the interpreter of that other tool guarantee that it’s going to be able to read it? How is that info going to be useful to something like semantic analysis?

James Smith: you’d be able to attach metadata about the content: mime type, etc.

The tool can say “if the metadata matches this pattern, I can use it”

Alex Gil: example: say you run Juxta, it tells you differences between text A and B

Say I want to mine just the differences

Differences will have their own mime tag

They can be extracted by a query, jquery?

James Smith: yes and no

Social aspect of environment

People agreeing on standards

How granular they want to label it

Tool provides information to decide can I use it or not

I’m not trying to “overfit” right now

Alex Gil: that’s typical of workflows

Somebody wants to study just words relating to geographical locations, difference between two texts

Start with Juxta, end up with something like [?]

Dana Wheeles: Hathi Trust, ECCO, large repositories we might be able to act upon: will they live in their own places, or copied and kept in one place?

James Smith: may depend on collection

Hathi Trust interested in installing Corpora Space

Dana Wheeles: oh so there could be multiple installations

James Smith: need to work out details, but UMD could have installation of switchboard, UVA too, they might communicate with each other

You have your home institution and that’s where your result sets are

Alex Gil: that allows some institutions not to share but to have access to this

There would be a central one where you could get access to Hathi?

James Smith: yes

Alex Gil: could I install on my computer?

James Smith: yes, you’d have to have agreement to get access somewhere else

Sustainability considerations: 1. who’s paying? 2. how can we get people to play with it?

Gregor Middell: that could mean that the service would have to be written to be placed close to data

James Smith: not necessary; we can put it far from data

But university might want to put it close to their data

Fetch of data might be remote, results local

Dana Wheeles: what does that mean for updates? Proliferation of different versions?

James Smith: one way to guard against that: when interface changes you have a different namespace (like when XML schema changes)

If expectations of functions change, new namespace

Dana Wheeles: who’s the gatekeeper? Who says, “you’ve got to switch your installation”?

James Smith: no—the person who owns the switchboard will have the ability to say “this agent can or cannot connect”—e.g. through SSL certificates. That’s one way of gatekeeping. The other is that one of the expectations is that the list of functions specified by a namespace is essentially a contract. As long as this namespace is available, these functions will be here. If we fix bugs, the namespace doesn’t change. If we add functionality, we change the namespace. No way to force people to upgrade.

Dana Wheeles: what does that mean for vetting? There will be new bugs. Is there a testing time period?

James Smith: don’t see why you couldn’t have a sandbox

Dana Wheeles: working on same corpus?

James Smith: yes

Gregor Middell: you want to have reproduceablity of certain results—you would need the old version of Juxta

New versions might produce different results

Nick Laiacona: gets into that provenance thing.

Question about the compute engine—Ruby implementation had native Utukku script. Will there be native APIs for native languages?

James Smith: what I’m planning on: while the engine will be at the core, where the protocol is, we’ll have a wrapper that will give a language-native way of accessing that functionality

The only time you’ll need the script is if you’re doing stuff that’s not Corpora Space

For example to access Juxta since it’s not a native part of Corpora Space

Nick Laiacona: a tool could provide its own Ruby version to make that nicer

James Smith: common and simple things should be simple

As you take advantage of power, it collects complications

Nick Laiacona: for our tool, we would integrate through Javascript/HTML—fire off different function set—grab documents, collate them; it would speak through your API

The web service would have a wrapper that would talk with your API

So there would be a transport wrapper on the web service

And there would be the user interface—what does that look like

Not just “here’s some collation result,” but what happens before that?

James Smith: Ruby library, also Javascript library—still a bit raw

The goal would be to provide libraries in the common languages people are using

If we do the ruby one, it will run as [?]

I wouldn’t mind seeing a native Java installation

Nick Laiacona: so you would need Scala [scalar? To scale it?] and JRuby? It would be the switchboard that was running those things?

James Smith: right now we have a Ruby library that would run under JRuby if you were wanting to build a Java-based client

Nick Laiacona: our system’s in Java so we would need JRuby

[Gregor asks to see some of the details]

James Smith: library doesn’t change if you load it into your client—it would all be local

Nick Laiacona: is this your own language syntax?

James Smith: yes; based on XQuery and XPath

Nick Laiacona: do you have an xpointer implementation?

James Smith: no

Came out of curating digital humanities projects

Instead of having to maintain PHP over time, we have one framework we manage and run various projects on it

It’s like having a PDF reader—instead of having to maintain PDFs, you just have the reader.

[break]

Alex Gil: run two texts—approximate string match—once the string match makes some suggestions, identify a block, then run that block through an index of the whole corpus—you don’t have to do approximate string match on whole corpus and you’re already establishing a relationship

The scholar is always reflecting what is the meaningful block

Nick Laiacona: if we had a Java API that could do that, we could plug it in

Alex Gil: I was thinking of the way Bamboo is going to start working with these chunks

Gregor Middell: have been working on Juxta for a while now

Trying to make Juxta scale to larger text tradition

Current state of affairs: original version of Juxta was working with plain text

So internal data model is geared towards plain text witnesses

Only recently (1.4) was XML support introduced

Juxta—only collation tool that supports XML

We switched to XML and data model had to be extended substantially

Internally we not only have plain text but something that resembles a DOM

Allows us to handle the tagging that was in the original witnesses

One of the main challenges: this data model is kept completely in memory

Problems …

My main job is working for a genetic edition in Germany (Goethe’s Faust)

Provide a lot of different perspectives on Faust

Juxta’s ability to be integrated in different settings is useful

Talk about Faust first, then return to Juxta

When we started 2.5 years ago, the idea was to build a very simple digital edition

Classical XML architecture

Data storage, usually natively XML (XML databases); might also have relational databases

Get connected by some kind of logic that allows you to extract information you want to view and deliver it to the client

All the technologies in this pipeline are XML or based on XML

Tree-based

That was a challenge: genetic edition

Trying to show different perspectives: document and textual oriented

In German editorial theory, the basic idea is that if you are able to give a very faithful representation of the record, communicate to the user what you found in the archive, then you have a lot more freedom in interpreting the results

This matches up with the markup perspectives on Faust

Document markup is very faithful, as close to original as possible

Then go to textual perspective, maybe change something of the findings and deliver something more readable

Textual view very ordered, strict order

On manuscript, not that linear order at all—different orientation of texts, strikethroughs that cross boundaries: must represent that in order to adhere to German editorial theory

Spend a lot of time reproducing these things in SVG

Other views;

Genetic markup,

Metadata,

Text/Image annotation: archive of illustrations: want to link these illustrations to the text

Want automatic linkage to motifs in the text

Challenge from markup perspective: with each perspective you have a different view on the data

Example of three-line text, with middle line inserted, marked up as page zones faithful to original document; textual markup with line B inline; genetic with order of stages

Problem: how do you encode this? In XML you can encode one perspective very well, but must subordinate the other perspectives

We encode every structure separately

Different transcripts

How do we collate different encodings?

We want to use collation to merge these different views

Graph-based model of the text: not singular trees, but interrelated trees in the database

User should see not just one tree, but have links to other views

If you collate texts from different structures, you come up with correlations

We use collation internally to merge different perspectives, which allows us to change the order of things

Collation algorithm is able to handle transpositions

It’s common to see text on the document that does not belong to Faust

Document oriented view contains much more text—for example, letter written on the reverse side of a piece of paper where Goethe scribbled lines of Faust

We are interested in collation from editorial perspective: not just collating different witnesses, but solving the overlapping hierarchies problem

We want to be able to encode our findings in a way that is standards-compliant so that anybody else can see the difference and collate in isolation but also see what the collation came up with

We also try to influence TEI to try to make apparatus TEI-compliant

Critical Apparatus Workgroup (TEI wiki)

Silence of TEI re: things like transposition because transpositions are hard to code in XML

Identified some issues with the current critical apparatus module

Specific phenomena not covered e.g. transposition

Handling of punctuation—what is a collation supposed to do with punctuation?

Representing omissions

How does it scale? What if you do encoding with five witnesses, then discover a sixth? You have to redo encoding

TEI does not differentiate between model and representation of textual variants

There might be multiple representations of textual variants

How do you model textual variants?

Ronald was talking about pipeline process: tokenize, align, etc.

You would expect a tokenizer not just to split up into tokens but also keep markup context

You have tagging that tries to normalize tokens e.g. abbreviations

Collator might recognize that same tokens might have different meaning based on tagging context

Alignment means introducing gaps into original witnesses such that they line up correctly

You can come up with a fairly simple encoding of an alignment table

Then you can detect transpositions

Alex Gil: what if all sections are jumbled in relation to each other and there are no alignments providing sequence?

Ronald Dekker: that is what Collatex does: it doesn’t look at the order; it tries to find identical blocks. When there are multiple possibilities for a block, then it looks at the order.

Gregor Middell: alignment process is normally based on sequence alignments; sequence is assumed to be inherent in the text

Alex Gil: I’m trying to challenge that

Gregor Middell: Apparatus markup works well for some purposes, but not others

You sometimes need reference back to specific witness

So we came up with new model

You want to include tokenization tagging—impose it on top of other tagging

End up with XPointer scheme: ability to point to specific words

If you don’t have a means to point directly into the witnesses, you’re lost

Could use w tag for words, line tag for granularity at line level

If tokens are addressable, then an alignment is sets of tokens that line up.

Alex Gil: so the tokens exist separately from the original spelling?

Gregor Middell: I think that the alignment would point to the original content. There might be some normalization but you want to point back to the original reading.

That is what you want to point the reader to.

Alex Gil: you accomplish that through IDs? XPath?

Gregor Middell: that’s what I want to get at in my presentation

I settled for an offset model

You can encode as sets: querying sets becomes easy

You can ask what other alignments are there

This would be the basic data model of textual variations: tokens that line up

Now you can extract different views of it

You can use these sets to reconstruct critical apparatus by embedding pointers back to original witness

It doesn’t matter whether you solve the reference to the tokens or keep them in

I favor the embedding of pointers

That’s as far as the workgroup got

Let me get to what we would be getting in Juxta

Lack of XPointer as a language—not able to use it

Lack of DOM—models don’t allow us to express string ranges

We had to do something different

Architecture of Juxta allows pointing

Having a flat text is reasonably fast

Nice if you have a database as a back end

There was a value in not sticking with XPointer

You have a document model based on a tree

If you try to come up with a different model not based on tree, you are in the wilds

Experimental markup languages

We think of markup as something that is offset range based

Simple conceptual model: markup consists of start of text and range

Allows arbitrary overlap

You can layer arbitrary ranges

Integrating these over a flat stream of characters is easy

What we implemented in Juxta: we’re handed an XML document with a tree structure and we flatten it: transform start and end into ranges

There are ways to reconstruct trees based on flat model

Currently what we need is flat model where we have witnesses with ranges

Finish with a short demo

Want to be able to upload an XML document and it gets transformed into range-based model

You want an annotation for every single token

A tokenizer introduces additional annotations; doesn’t have to worry about existing markup; it’s been flattened

Alignment process would query different ranges: just give me all the ranges that constitute tokens

Aligner would align the text, create an alignment table, find alignments and put them back into textual repository

What we have in Juxta right now is annotation model bound to RDB

Instead of embedded Java database we could use MySQL or other database for scale

This is all currently internal to Juxta

I am working on restful service that would have this functionality

[gives a demo]

Dana Wheeles: how does it handle really long texts?

Gregor Middell: text repository is completely stream based; no predetermined maximum

Parsing goes in stream-based fashion

Does not hold in memory, no memory constraint

One problem we have: diff algorithm or Juxta’s algorithm: for this to work, it still loads the document into memory because it wants to find all the alignments

I think we can segment larger witnesses beforehand

Juxta filters out the transposition, then drills down into the transposition and does recursive collation

If you generalize this concept and think of these segments as fragments or witnesses, then you have a solution to scalability

While declaring transpositions, you are fragmenting the witness set

We could design something that imposes a maximum witness size and requires you to fragment your witnesses or does approximate matching as preprocessing

Alex Gil: that’s the solution to my problem: reduces computation time, and you get a result that can be used in more ways

Gregor Middell: the downside of this pre-fragmentation: you lose segments

Dana Wheeles: a possible solution would be other visualizations: we need to have global view, find sites with a lot of changes

But we also need to deliver close map of changes in local area

How do we deliver both?

Maybe we can brainstorm how to make those visualizations without taxing the service

Alex Gil: maybe the time-consuming process could run separately

Interesting visualizations can come out of larger view

You can do both

Gregor Middell: conceptually, you would have to do both

We are thinking about how to constrain algorithm to make it predictable in terms of resource requirements

Example of large text that choked Collatex: had to allocate 2 gigabytes of memory; can’t do that in all cases

Have to make a statement about what the user can expect/do

Nick Laiacona: if the cost was not an issue: we can collate documents up to X size immediately or larger documents if you come back later

Alex Gil: access to a grid? So you don’t have to do it on the server machine

Gregor Middell: grid technology would help

Nick Laiacona: agenda: brainstorming sessions and hacking

A number of technical problems on the table that we could spend our time on

Maybe we could spend a little time discussing what to discuss

What I’m hoping to get out of this for Juxta:

Two main pieces:

Web service model: we need to work out what the protocols are going to be

Pragmatic piece: Lou and Gregor have been working on different branches; we need to get together on that

Collatex has a working web service; we have the structures for Bamboo; it would be interesting to see if you could hack some stuff together

Other people’s ideas of things we could achieve?

Dana Wheeles: eager to find out more about how we can think about a web service in terms of Corpora Space

Andrew Stauffer: easy to figure out with quick back-and-forthing

Lou Foster: I’ve done a prototype just like Gregor’s presentation

Alex Gil: want more explanation of the range

Nick Laiacona: range offsets have implications for a lot of stuff beyond Juxta

Alex Gil: it sounds like if you flatten out, you can have all kinds of stuff later

Gregor Middell: all the ranges over the text still exhibit the properties of the XML elements

Alex Gil: I’m worried more about how this is going to look and what it’s going to do for users than what it looks like in the back

When I’m doing an html visualization of a text, at one level I’m showing Juxta stuff, at another level I’m showing semantic meaningful stuff like geographic places; when I switch between two copies they’re still in the same space so it looks like I’m getting the same text but with this you can have the one text and …

Gregor Middell: that’s the aim

Alex Gil: but you can do more than that; you can introduce texts within texts—some of the text remains the same, the annotation stays the same, if you were to introduce an annotation bubble, you don’t’ mess with the rest: you couldn’t do that with the two texts

I want to hear more about how this works

Nick Laiacona: one brainstorming session could be talking more about offset range stuff

Dana Wheeles: plan to have lunch, then schedule exact working groups

Do we want full working groups or small groups?

Make agenda before lunch gets here

Offset ranges; web services and pipelines

Nick Laiacona: Juxta and Collatex are both going to be on the Gothenburg model …

Ron + Jim work together

Andrew Stauffer: makes sense to have Lou, Ron and Jim together

Gregor should be in every group, but he’s here longer

Nick Laiacona: maybe we should work on resolving two branches when we have more time after the workshop

Maybe I, Lou, and Gregor could work on implications of ranges

Clone this wiki locally