-
Notifications
You must be signed in to change notification settings - Fork 4
JuxtaCampNotes2
These are notes from Day Two of Juxta Camp held at Performant Software's offices in Charlottesville, Virginia on July 12, 2011. These notes represent an attempt to paraphrase the discussion and should not be taken as verbatim quotations from the participants. Brackets and ellipses mark spots where your note taker missed details.
Nick Laiacona: begin with Jim’s presentation
James Smith: presentation: Corpora Space Architecture
Walk through slides from Toolmixer at MITH – Bamboo / Corpora Space
Bamboo: phase one work is ongoing
4 parts
-
workspace
-
service platform – enterprise-level platform for services and instrumentation
-
collection interoperability – how can we make various collections present themselves in such a way that the tool can use them – Hathi Trust, etc.
-
Corpora Space
Part of the project is designing phase 2: testing out architecture ideas
In September, we will put together the document for a proposal in March
Questions:
To what degree do we want serendipity? Important to humanists—recreate experience in library where you find what you weren’t looking for but what you need
Ownership of material: how important is it? Curation
How do we define a collection?
You want to move computation to the data, not data to the computation
Activity groupings: model, view, transform, auxiliary, other
Common flow for data: modeling > transformation > viewing or viewing > transformation > modeling
Cycle for data: curated data, found data, transformed data, annotation > proposed data
Principles:
Layered such that each layer uses the one below it – enables changing implementation details without changing other layers
Maximize agency for each component – person providing tools, content, service; person running service; want everyone to feel ownership
Maximize flow for users and developers: match challenge to skill
Need a wide variety of interfaces: tools, workspaces, command line
Spectrum of users – spectrum of ways to use stuff
It Just Works: it should just work: clue to think about how interface is working
Initial idea:
We have clients, agents, and a switchboard between the two
Clients consume
Agents provide stuff
Switchboard mediates
Client can expect certain services without knowing where they live; don’t have to ask where things are
Switchboard figures out which agent provides it
Agent provides model; client provides view
Layered nature of it:
Network protocol, agent library
Put a compute engine in the library: taking the computing to the data
Library sits above the client library: does the abstractions specific to Corpora Space
Working set or intermediate result set storage
Might be restricted based on metadata
Able to store stuff back in: essentially a cloud for humanities computing
Intermediate result stuff = agent providing a service
Example: two different researchers using Corpora Space
Diagram: looks like one tool is going to the next tool: not actually how bytes are streaming around, but conceptually how it works
Working sets going between the tools
How the tool sees the world: pulls stuff out of Corpora Space and pushes stuff back in
If we have two tools, first one pulls stuff out, pushes it back in; second pulls it back out
Ways tools can interact with Corpora Space
Basic profile: mirrors how people work with file systems
Tool can use data from Corpora Space, authenticate to Corpora Space, provide data to Corpora Space
Want to add provenance data
Advanced tool: curation profile
Tool provides audit or provenance data
Can run tools, play, then say “how did I get here?”
Workflow profile: most advanced, most integrated with Corpora Space
Tool can let you tell me what to do with the data: reconstruct flow from previous tool usage
Want to work with other tools without having a huge investment for integration
Nitty-gritty:
Function libraries, client/agent structure, etc.
Function libraries: set of functions referenced with XML namespace
Provides permanence of function
Agents can move around, come and go; functionality always called by that name
Functions don’t have side effects, actions do
Four types of functions: mappings, reductions, consolidations, ‘plain functions’
Reduction/consolidation: example: nasty nested set of function calls: tree of calls: reduction = leaf node, consolidation = branch node
Allows us to take what humanist wants and split it into parallel chunks
Client with everything in it
Client with remote topic modeling – from client perspective nothing has changed
SOA, ROA
Protocol between client agent and switchboard is websocket
Can have multiple requests and responses don’t have to be in same order
REST+protocol for accessing resources
Code available on github
Compute engine based on stuff in Jim’s github repository: all open source
Nick Laiacona: any questions?
James Smith: quick demo
First, command line client
Nick Laiacona: questions?
Gregor Middell: we decomposed collation into different functions
One problem we ran into: decomposing of problem into several steps was easy, but you end up handing off a lot of data from function to function
Example: handing off document to tokenizer, then matcher, then aligner, then visualizer possibly in different countries
How do you handle the problem of having to hand off the data?
James Smith: a couple possibilities:
You could have stickiness, for example: if UVA had installation of switchboard and wanted to provided local versions of tokenizer etc, then we’d want a notion of the closest implementation to the data. Doesn’t solve all of transport problems, but helps.
When you’re building something this flexible, you’re never going to have the performance of a monolithic desktop app.
If we can get to a model where people—batch model—know it’s going to be a long-running job, fire it off and come back later.
Thinking back to time when I used to use Medline—fire off request, come back next day and second day for results—people were used to that. Now we have faster and faster computers. Need to get distributed stuff out there—expectations have to be reconfigured.
It’s not going to match what people experience on the desktop.
Nick Laiacona: what really kills you is the last mile on the DSL modem
If switchboard can cache the data and pass a pointer …
James Smith: working on letting client say “if I disconnect, don’t kill the job”
Nick Laiacona: otherwise you’re capped on the amount you can do …
James Smith: if you’re doing a topic map, you may want to see the data before it’s finished; build an idea of what you’re expecting it to look like
Nick Laiacona: so for Juxta: bring back up the diagram that shows the researcher, the tools, and the Corpora Space cloud
Where would Juxta fit in? A tool? But on the web
James Smith: a couple options
One: the tool could be the Javascript client
You’d be able to have a Javascript app that would handle everything
Another: the server can connect
That hasn’t been the primary model we’ve worked with; but I’m starting to think about it more
I expect it ends up being comparable to connecting to a database; do you want to be able to cache the connection?
All in the browser vs. server doing some of the computational stuff
If you wanted to interact with Corpora Space you could treat it like you would treat a database: persistent connection
Retrieve documents, put them back in
We would need to provide a way of redoing the authentication part
Client needs to be able to tell switchboard who is using it
If you’re doing server-side connection then you need to be able to redo authentication piece
Nick Laiacona: we’re not providing any authentication layer—it would be like Solr
James Smith: some things would be available without authentication, some need authentication
It could be that there are agents providing services that don’t need authentication
Gregor Middell: How would you discover documents on Bamboo?
James Smith: Plan now is that the document collection (intermediate results, working collection) is provided by an agent—set of functions provided by the agent to the environment
Gregor Middell: Did big repositories such as Hathi Trust already say which functions they would want?
James Smith: I was thinking of intermediate results; that’s something different
We haven’t provided the detail
Coming up with a standard way of addressing documents in these collections
We would provide some way of providing the semis [?] semantics, do queries, get back URLs
Nick Laiacona: [question I missed]
James Smith: if you want other tools outside Juxta to access the capabilities of Juxta, then it would be an agent
Ronald Dekker: I would think the current Juxta product would be both a client and an agent
Nick Laiacona: It would be on the other side of the switchboard
Alex Gil: when you say you can switch data from one tool to another—Juxta still hasn’t decided how it’s going to add info about collation to a database or XML file—when it sends it back to be used by another tool, how can the interpreter of that other tool guarantee that it’s going to be able to read it? How is that info going to be useful to something like semantic analysis?
James Smith: you’d be able to attach metadata about the content: mime type, etc.
The tool can say “if the metadata matches this pattern, I can use it”
Alex Gil: example: say you run Juxta, it tells you differences between text A and B
Say I want to mine just the differences
Differences will have their own mime tag
They can be extracted by a query, jquery?
James Smith: yes and no
Social aspect of environment
People agreeing on standards
How granular they want to label it
Tool provides information to decide can I use it or not
I’m not trying to “overfit” right now
Alex Gil: that’s typical of workflows
Somebody wants to study just words relating to geographical locations, difference between two texts
Start with Juxta, end up with something like [?]
Dana Wheeles: Hathi Trust, ECCO, large repositories we might be able to act upon: will they live in their own places, or copied and kept in one place?
James Smith: may depend on collection
Hathi Trust interested in installing Corpora Space
Dana Wheeles: oh so there could be multiple installations
James Smith: need to work out details, but UMD could have installation of switchboard, UVA too, they might communicate with each other
You have your home institution and that’s where your result sets are
Alex Gil: that allows some institutions not to share but to have access to this
There would be a central one where you could get access to Hathi?
James Smith: yes
Alex Gil: could I install on my computer?
James Smith: yes, you’d have to have agreement to get access somewhere else
Sustainability considerations: 1. who’s paying? 2. how can we get people to play with it?
Gregor Middell: that could mean that the service would have to be written to be placed close to data
James Smith: not necessary; we can put it far from data
But university might want to put it close to their data
Fetch of data might be remote, results local
Dana Wheeles: what does that mean for updates? Proliferation of different versions?
James Smith: one way to guard against that: when interface changes you have a different namespace (like when XML schema changes)
If expectations of functions change, new namespace
Dana Wheeles: who’s the gatekeeper? Who says, “you’ve got to switch your installation”?
James Smith: no—the person who owns the switchboard will have the ability to say “this agent can or cannot connect”—e.g. through SSL certificates. That’s one way of gatekeeping. The other is that one of the expectations is that the list of functions specified by a namespace is essentially a contract. As long as this namespace is available, these functions will be here. If we fix bugs, the namespace doesn’t change. If we add functionality, we change the namespace. No way to force people to upgrade.
Dana Wheeles: what does that mean for vetting? There will be new bugs. Is there a testing time period?
James Smith: don’t see why you couldn’t have a sandbox
Dana Wheeles: working on same corpus?
James Smith: yes
Gregor Middell: you want to have reproduceablity of certain results—you would need the old version of Juxta
New versions might produce different results
Nick Laiacona: gets into that provenance thing.
Question about the compute engine—Ruby implementation had native Utukku script. Will there be native APIs for native languages?
James Smith: what I’m planning on: while the engine will be at the core, where the protocol is, we’ll have a wrapper that will give a language-native way of accessing that functionality
The only time you’ll need the script is if you’re doing stuff that’s not Corpora Space
For example to access Juxta since it’s not a native part of Corpora Space
Nick Laiacona: a tool could provide its own Ruby version to make that nicer
James Smith: common and simple things should be simple
As you take advantage of power, it collects complications
Nick Laiacona: for our tool, we would integrate through Javascript/HTML—fire off different function set—grab documents, collate them; it would speak through your API
The web service would have a wrapper that would talk with your API
So there would be a transport wrapper on the web service
And there would be the user interface—what does that look like
Not just “here’s some collation result,” but what happens before that?
James Smith: Ruby library, also Javascript library—still a bit raw
The goal would be to provide libraries in the common languages people are using
If we do the ruby one, it will run as [?]
I wouldn’t mind seeing a native Java installation
Nick Laiacona: so you would need Scala [scalar? To scale it?] and JRuby? It would be the switchboard that was running those things?
James Smith: right now we have a Ruby library that would run under JRuby if you were wanting to build a Java-based client
Nick Laiacona: our system’s in Java so we would need JRuby
[Gregor asks to see some of the details]
James Smith: library doesn’t change if you load it into your client—it would all be local
Nick Laiacona: is this your own language syntax?
James Smith: yes; based on XQuery and XPath
Nick Laiacona: do you have an xpointer implementation?
James Smith: no
Came out of curating digital humanities projects
Instead of having to maintain PHP over time, we have one framework we manage and run various projects on it
It’s like having a PDF reader—instead of having to maintain PDFs, you just have the reader.
[break]
Alex Gil: run two texts—approximate string match—once the string match makes some suggestions, identify a block, then run that block through an index of the whole corpus—you don’t have to do approximate string match on whole corpus and you’re already establishing a relationship
The scholar is always reflecting what is the meaningful block
Nick Laiacona: if we had a Java API that could do that, we could plug it in
Alex Gil: I was thinking of the way Bamboo is going to start working with these chunks
Gregor Middell: have been working on Juxta for a while now
Trying to make Juxta scale to larger text tradition
Current state of affairs: original version of Juxta was working with plain text
So internal data model is geared towards plain text witnesses
Only recently (1.4) was XML support introduced
Juxta—only collation tool that supports XML
We switched to XML and data model had to be extended substantially
Internally we not only have plain text but something that resembles a DOM
Allows us to handle the tagging that was in the original witnesses
One of the main challenges: this data model is kept completely in memory
Problems …
My main job is working for a genetic edition in Germany (Goethe’s Faust)
Provide a lot of different perspectives on Faust
Juxta’s ability to be integrated in different settings is useful
Talk about Faust first, then return to Juxta
When we started 2.5 years ago, the idea was to build a very simple digital edition
Classical XML architecture
Data storage, usually natively XML (XML databases); might also have relational databases
Get connected by some kind of logic that allows you to extract information you want to view and deliver it to the client
All the technologies in this pipeline are XML or based on XML
Tree-based
That was a challenge: genetic edition
Trying to show different perspectives: document and textual oriented
In German editorial theory, the basic idea is that if you are able to give a very faithful representation of the record, communicate to the user what you found in the archive, then you have a lot more freedom in interpreting the results
This matches up with the markup perspectives on Faust
Document markup is very faithful, as close to original as possible
Then go to textual perspective, maybe change something of the findings and deliver something more readable
Textual view very ordered, strict order
On manuscript, not that linear order at all—different orientation of texts, strikethroughs that cross boundaries: must represent that in order to adhere to German editorial theory
Spend a lot of time reproducing these things in SVG
Other views;
Genetic markup,
Metadata,
Text/Image annotation: archive of illustrations: want to link these illustrations to the text
Want automatic linkage to motifs in the text
Challenge from markup perspective: with each perspective you have a different view on the data
Example of three-line text, with middle line inserted, marked up as page zones faithful to original document; textual markup with line B inline; genetic with order of stages
Problem: how do you encode this? In XML you can encode one perspective very well, but must subordinate the other perspectives
We encode every structure separately
Different transcripts
How do we collate different encodings?
We want to use collation to merge these different views
Graph-based model of the text: not singular trees, but interrelated trees in the database
User should see not just one tree, but have links to other views
If you collate texts from different structures, you come up with correlations
We use collation internally to merge different perspectives, which allows us to change the order of things
Collation algorithm is able to handle transpositions
It’s common to see text on the document that does not belong to Faust
Document oriented view contains much more text—for example, letter written on the reverse side of a piece of paper where Goethe scribbled lines of Faust
We are interested in collation from editorial perspective: not just collating different witnesses, but solving the overlapping hierarchies problem
We want to be able to encode our findings in a way that is standards-compliant so that anybody else can see the difference and collate in isolation but also see what the collation came up with
We also try to influence TEI to try to make apparatus TEI-compliant
Critical Apparatus Workgroup (TEI wiki)
Silence of TEI re: things like transposition because transpositions are hard to code in XML
Identified some issues with the current critical apparatus module
Specific phenomena not covered e.g. transposition
Handling of punctuation—what is a collation supposed to do with punctuation?
Representing omissions
How does it scale? What if you do encoding with five witnesses, then discover a sixth? You have to redo encoding
TEI does not differentiate between model and representation of textual variants
There might be multiple representations of textual variants
How do you model textual variants?
Ronald was talking about pipeline process: tokenize, align, etc.
You would expect a tokenizer not just to split up into tokens but also keep markup context
You have tagging that tries to normalize tokens e.g. abbreviations
Collator might recognize that same tokens might have different meaning based on tagging context
Alignment means introducing gaps into original witnesses such that they line up correctly
You can come up with a fairly simple encoding of an alignment table
Then you can detect transpositions
Alex Gil: what if all sections are jumbled in relation to each other and there are no alignments providing sequence?
Ronald Dekker: that is what Collatex does: it doesn’t look at the order; it tries to find identical blocks. When there are multiple possibilities for a block, then it looks at the order.
Gregor Middell: alignment process is normally based on sequence alignments; sequence is assumed to be inherent in the text
Alex Gil: I’m trying to challenge that
Gregor Middell: Apparatus markup works well for some purposes, but not others
You sometimes need reference back to specific witness
So we came up with new model
You want to include tokenization tagging—impose it on top of other tagging
End up with XPointer scheme: ability to point to specific words
If you don’t have a means to point directly into the witnesses, you’re lost
Could use w tag for words, line tag for granularity at line level
If tokens are addressable, then an alignment is sets of tokens that line up.
Alex Gil: so the tokens exist separately from the original spelling?
Gregor Middell: I think that the alignment would point to the original content. There might be some normalization but you want to point back to the original reading.
That is what you want to point the reader to.
Alex Gil: you accomplish that through IDs? XPath?
Gregor Middell: that’s what I want to get at in my presentation
I settled for an offset model
You can encode as sets: querying sets becomes easy
You can ask what other alignments are there
This would be the basic data model of textual variations: tokens that line up
Now you can extract different views of it
You can use these sets to reconstruct critical apparatus by embedding pointers back to original witness
It doesn’t matter whether you solve the reference to the tokens or keep them in
I favor the embedding of pointers
That’s as far as the workgroup got
Let me get to what we would be getting in Juxta
Lack of XPointer as a language—not able to use it
Lack of DOM—models don’t allow us to express string ranges
We had to do something different
Architecture of Juxta allows pointing
Having a flat text is reasonably fast
Nice if you have a database as a back end
There was a value in not sticking with XPointer
You have a document model based on a tree
If you try to come up with a different model not based on tree, you are in the wilds
Experimental markup languages
We think of markup as something that is offset range based
Simple conceptual model: markup consists of start of text and range
Allows arbitrary overlap
You can layer arbitrary ranges
Integrating these over a flat stream of characters is easy
What we implemented in Juxta: we’re handed an XML document with a tree structure and we flatten it: transform start and end into ranges
There are ways to reconstruct trees based on flat model
Currently what we need is flat model where we have witnesses with ranges
Finish with a short demo
Want to be able to upload an XML document and it gets transformed into range-based model
You want an annotation for every single token
A tokenizer introduces additional annotations; doesn’t have to worry about existing markup; it’s been flattened
Alignment process would query different ranges: just give me all the ranges that constitute tokens
Aligner would align the text, create an alignment table, find alignments and put them back into textual repository
What we have in Juxta right now is annotation model bound to RDB
Instead of embedded Java database we could use MySQL or other database for scale
This is all currently internal to Juxta
I am working on restful service that would have this functionality
[gives a demo]
Dana Wheeles: how does it handle really long texts?
Gregor Middell: text repository is completely stream based; no predetermined maximum
Parsing goes in stream-based fashion
Does not hold in memory, no memory constraint
One problem we have: diff algorithm or Juxta’s algorithm: for this to work, it still loads the document into memory because it wants to find all the alignments
I think we can segment larger witnesses beforehand
Juxta filters out the transposition, then drills down into the transposition and does recursive collation
If you generalize this concept and think of these segments as fragments or witnesses, then you have a solution to scalability
While declaring transpositions, you are fragmenting the witness set
We could design something that imposes a maximum witness size and requires you to fragment your witnesses or does approximate matching as preprocessing
Alex Gil: that’s the solution to my problem: reduces computation time, and you get a result that can be used in more ways
Gregor Middell: the downside of this pre-fragmentation: you lose segments
Dana Wheeles: a possible solution would be other visualizations: we need to have global view, find sites with a lot of changes
But we also need to deliver close map of changes in local area
How do we deliver both?
Maybe we can brainstorm how to make those visualizations without taxing the service
Alex Gil: maybe the time-consuming process could run separately
Interesting visualizations can come out of larger view
You can do both
Gregor Middell: conceptually, you would have to do both
We are thinking about how to constrain algorithm to make it predictable in terms of resource requirements
Example of large text that choked Collatex: had to allocate 2 gigabytes of memory; can’t do that in all cases
Have to make a statement about what the user can expect/do
Nick Laiacona: if the cost was not an issue: we can collate documents up to X size immediately or larger documents if you come back later
Alex Gil: access to a grid? So you don’t have to do it on the server machine
Gregor Middell: grid technology would help
…
Nick Laiacona: agenda: brainstorming sessions and hacking
A number of technical problems on the table that we could spend our time on
Maybe we could spend a little time discussing what to discuss
What I’m hoping to get out of this for Juxta:
Two main pieces:
Web service model: we need to work out what the protocols are going to be
Pragmatic piece: Lou and Gregor have been working on different branches; we need to get together on that
Collatex has a working web service; we have the structures for Bamboo; it would be interesting to see if you could hack some stuff together
Other people’s ideas of things we could achieve?
Dana Wheeles: eager to find out more about how we can think about a web service in terms of Corpora Space
Andrew Stauffer: easy to figure out with quick back-and-forthing
Lou Foster: I’ve done a prototype just like Gregor’s presentation
Alex Gil: want more explanation of the range
Nick Laiacona: range offsets have implications for a lot of stuff beyond Juxta
Alex Gil: it sounds like if you flatten out, you can have all kinds of stuff later
Gregor Middell: all the ranges over the text still exhibit the properties of the XML elements
Alex Gil: I’m worried more about how this is going to look and what it’s going to do for users than what it looks like in the back
When I’m doing an html visualization of a text, at one level I’m showing Juxta stuff, at another level I’m showing semantic meaningful stuff like geographic places; when I switch between two copies they’re still in the same space so it looks like I’m getting the same text but with this you can have the one text and …
Gregor Middell: that’s the aim
Alex Gil: but you can do more than that; you can introduce texts within texts—some of the text remains the same, the annotation stays the same, if you were to introduce an annotation bubble, you don’t’ mess with the rest: you couldn’t do that with the two texts
I want to hear more about how this works
Nick Laiacona: one brainstorming session could be talking more about offset range stuff
Dana Wheeles: plan to have lunch, then schedule exact working groups
Do we want full working groups or small groups?
Make agenda before lunch gets here
Offset ranges; web services and pipelines
Nick Laiacona: Juxta and Collatex are both going to be on the Gothenburg model …
Ron + Jim work together
Andrew Stauffer: makes sense to have Lou, Ron and Jim together
Gregor should be in every group, but he’s here longer
Nick Laiacona: maybe we should work on resolving two branches when we have more time after the workshop
Maybe I, Lou, and Gregor could work on implications of ranges