Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identical names for Instances of different types in the same Model. Bug or feature? #781

Open
RS-Credentive opened this issue Oct 26, 2024 · 7 comments
Labels
question Further information is requested

Comments

@RS-Credentive
Copy link
Contributor

Please see this issue from the OSCAL Repo

The OSCAL metadata metaschema defines a field named "location-uuid" and a flag named "location-uuid". I get that a flag can reference an externally defined flag, and a field can reference an externally defined field, so the names could be the same, but introduces a good amount of complexity because of the need to introduce namespaces to disambiguate between globals of different types with the same name. Is this kind of name conflict possibility intentional? Is it possible to require unique names for instances in the same model? At least for top-level globals?

@RS-Credentive RS-Credentive added the question Further information is requested label Oct 26, 2024
@iMichaela
Copy link
Collaborator

@RS-Credentive - Thank you for your question. The code you point to was inserted 3years 7 months ago by a team member no longer with us. I will need to research it with @wendellpiez before addressing it.

@wendellpiez
Copy link
Collaborator

@RS-Credentive awesome question - I wish I had a better answer. The capability to overload names like this, for better or worse, is a feature of Metaschema. Keep in mind that not only can you have different types going by the same names - sometimes a locally defined type (i.e. a 'type in context') can be different (albeit with same name) as a global type. Etc.

Were we coming at the same problem again today I might have argued harder for much/many more constraints over naming. But the capabilities offered both by local definitions (which were not in the original Metaschema v0.1) or use-name (also a feature added at others' request) are real ones and arguably useful to modelers and (by extension or at least when done right) to users.

I'm glad to address this further in a broader Metaschema context, as well.

@wendellpiez
Copy link
Collaborator

BTW take note that renaming the underlying structures while deploying use-name might ease this problem for OSCAL tool-builders even if it does not change the external representation (tagging) deployed in OSCAL itself - so data would be tagged the same while the tool is able to 'cheat' wrt to the type assignment. Would that help with your problem, @RS-Credentive (as OSCAL tool-maker, not as Metaschema developer)? or just move it (from type disambiguation to use-name?)

In any case it's something to think about for an improved OSCAL and could probably even be done in the metaschema sources with backward compatibility wrt data.

@RS-Credentive
Copy link
Contributor Author

Let me share my perspective on this question since I'm approaching it from a different angle.

I see documents conforming to Metaschema models as the output of a process that incorporates data sourced from outside the model or produced by processes outside the model.

  • An application may produce an OSCAL SSP from an internal system configuration database
  • An OSCAL audit report may be produced by unit testing framework that executes code tests and generates the AR as an output

In these cases, the elements of the Metaschema documents will be built from the bottom up, not the top down. The data is already there, or is coming from somewhere else, and we need to translate it into a structure that can be represented using one of the metaschema encodings. The tools generating the structure will know about a "location", but people writing the tools shouldn't really have to care about fields or flags. It's just a piece of data that is related to another piece of data in a hierarchical structure. If the software composes the hierarchy correctly, a metaschema aware library can validate it and produce the expected output in the approprate encoding.

Now consider the case of the "location-uuid". When I construct a "location-uuid" for inclusion in OSCAL metadata, I need to know whether it is a "flag flavored" or "field flavored" uuid. Of course, my application won't care about a flag or field, but presumably, some assemblies will want a "field flavored" uuid, and some fields might want a "flag flavored" uuid.

Here are the definitions that got me all worked up, by the way:

    <define-flag name="location-uuid" as-type="uuid">
        <formal-name>Location Universally Unique Identifier Reference</formal-name>
        <description>Reference to a location by UUID.</description>
        <prop name="value-type" value="identifier-reference"/>
        <prop name="identifier-type" value="machine-oriented"/>
        <prop name="identifier-scope" value="cross-instance"/>
        <constraint>
            <index-has-key name="index-metadata-location-uuid">
                <key-field target="."/>
            </index-has-key>
        </constraint>
    </define-flag>

    <define-field name="location-uuid" as-type="uuid">
        <formal-name>Location Universally Unique Identifier Reference</formal-name>
        <description>Reference to a location by UUID.</description>
        <prop name="value-type" value="identifier-reference"/>
        <prop name="identifier-type" value="machine-oriented"/>
        <prop name="identifier-scope" value="cross-instance"/>
        <constraint>
            <index-has-key name="index-metadata-location-uuid" target=".">
                <key-field target="."/>
            </index-has-key>
        </constraint>
    </define-field>

I don't need to distinguish between the flag-flavor or field-flavor because there's any difference in the data. I need to differentiate because I have to encode them differently if I encode them in XML. This is absolutely not the point of metaschema. A flag should be a flag because you don't need to attach any metadata to it, and a field should be a field because you might need to attach some metadata to it.

The distinction between field-flavored and flag-flavored locations doesn't convey any useful information, but tracking the flavor of data will add a lot of complexity to the wrong layers. The fact that there are two different top-level, global instances in the same schema with the same name means that libraries, and thus application authors, will have to keep track of which flavor they've got and which flavor they need.

I would propose the following rule, enforced technically if possible but with strong guidance if not:
Within a given scope, all elements must have a unique "effective name". This allows definitions to be redefined and overridden elsewhere in a specification but means that the instances in the same place in a model can be unambiguously identified. You could use "use-name" to distinguish between two elements with the same name, but if you have two instances with the same name as children of the same parent, you probably need to go look at your data model again. If you need metadata about an instance sometimes but not always, use a field with optional attributes. If you never need metadata, use a flag. If you think you need both, go look at the data again because you're probably wrong.

I see this being the same as an import cycle. Technically, we can't design a way to forbid import cycles in metaschema itself, but implementations are required to nope out immediately if one is detected. Similarly, even if we can't design a constraint or rule in the XSD that forbids instances of different types with the same names or instances of the same type and name with different definitions, it should be considered bad style.

@wendellpiez
Copy link
Collaborator

wendellpiez commented Oct 30, 2024

Thanks for putting this on the record @RS-Credentive, you make a good case.

I might make rules for a next-generation metaschema even tighter than this ... if it were up to me. As @iMichaela hinted above, the current design bears the marks of the process of evolution that produced it: it is not without flaws, both acknowledged and unknown. (And having them exposed for remediation is better than just suffering with them.)

In general I would also offer that the difference between a field and a flag is much more consequential in XML-flavored Metaschema applications than in JSON-flavored applications. Both aesthetically and with respect to affordances (in the model), "element or attribute" is a difference that can make a difference.

Indeed as a data modeler, I would offer that (a) if you only care about JSON never XML, and (b) you never have mixed content (i.e. Markdown-y) data values including insert markers, maybe you never want fields at all. (Just use more assemblies with more flags.) Of course OSCAL has both XML and mixed content within view.

In any case, at a metaschema-redesign table, I would see you and raise you a bunch -- no one has asked but I would actually like to think about doing away with all local definitions and all overloading, with only use-name as an escape hatch. Not in the Metaschema feature set necessarily (I can be convinced they are needed there) but in a public-facing application such as OSCAL.

For the record, a rule such as what you propose could indeed be enforced with the help of a query over a set of metaschema documents (however defined).

However, I am not sure I would even go that far, or not yet. Is there a concrete recommendation that could be made for this case only? For example to rename the field to say -field, and use use-name there to see to it that data has the old name? (This is what we might have originally done, but we may not have had use-name at that point.) Would that be an OSCAL proposal to be tracked further in usnistgov/OSCAL#2056 ?

In any case the problem goes a little beyond discerning and enforcing the right kinds of consistency. We need a development model (dare I say a spiral) capable of implementing and demonstrating these ideas through actual testing. Nonetheless the input is valuable and demonstrates the need.

@RS-Credentive
Copy link
Contributor Author

@wendellpiez, thanks for the thoughtful reply. Regarding my concrete recommendation for the case at hand, it looks like "location-uuid" is only referenced as a field in the particular specification we're talking about, so it could be as simple as eliminating the redundant "flag flavored" definition of location-uuid. I'll make that recommendation over on the OSCAL issue.

It's possible that a metaschama could be as expressive without fields, and maybe that simplification would be good, but I think one of the benefits of Metaschema is its encoding independence, so I think the importance of XML vs JSON/YAML shouldn't really be a factor. If XML is really, really important to your application, you should use one of the XML schema specifications. Am I mistaken to think that the question of how to model the data should be distinct from the question of encoding the representation of the data? If there's truly no difference between a field and an assembly from a pure data modelling perspective, then I guess Metaschema 2 should just pick one of them and go with it :)

@wendellpiez
Copy link
Collaborator

wendellpiez commented Oct 31, 2024

@RS-Credentive thanks, perfect.

As to larger question, I'd probably agree with you except for the perturbing fact that there is a great deal of information in the world (let's not call it 'data') that is not yet encoded, not yet machine-readable, not yet processable using any model. That sounds grand, but it's very mundane. We are not starting with RDBMS but with something closer to PDF (but not even).

(Once upon a time I wrote a paper on this topic, here: https://balisage.net/Proceedings/vol21/html/Piez01/BalisageVol21-Piez01.html - for reading while waiting for a bus?)

It's not JSON's fault that it has no native constructs for what's called 'mixed content' - to the point that in actual systems (such as the one where I'm typing) we end up folding in Markdown (urp!) to achieve some meager machine readability.

In Metaschema, the entire modeling problem there is tucked away into the markup-line and markup-multiline data types. Essentially hidden from view. This sleight-of-hand seems to be Good Enough for at least 95% of the needs we have down there - which, to be sure, are more in the catalog model than elsewhere. But just listen to people scream if you try to take their italics away. Or, for that matter, their bulleted lists or inline cross-references.

Accordingly, it leaves you and devil's-advocate-me to address the question: what about the markup-line and markup-multiline datatypes? They are exceedingly useful for helping to bridge between markup-based and object-notation based representations, and essential in catalogs we see in the wild (which have inline cross-references, insertion points and links among other inline features we need to be 'live'). But they are only allowed on fields - due to the XML serialization hook whose value and necessity you question.

This could undoubtedly be designed differently, the point here being not that we have found the best balance, but rather that a balance must be struck, and that's where we put it. I actually think fields are useful in other ways as well, conceptually. And without a better solution for full-text data capture of catalogs, the markup-line and markup-multiline dedicated datatypes have stood up pretty well. (FWIW to consider them datatypes instead of a special kind of node was one of the many important contributions of @david-waltermire.)

As for the advice that if XML is so important, we should just be using an XML technology ... you sound like me now. :-) From an XML point of view, Metaschema is really just a set of rules to follow for keeping our hands tied behind our backs, since the JSON people don't have hands, we can only play football with them (no hands), not handball (or 'football' with throws, catches and grappling).

The 'hands' in this analogy is a concept of mixed content, whether mixed element content such as (HTML) p, p, ul, p, table, p, or mixed text content such as text, a, text, insert, text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants