Date issued: | 2008-02-07 |
---|---|
Status of document: | Proposed Informational Note. This document is in the midst of open review by the community. |
This version: | http://www.language-archives.org/NOTE/metrics-20080207.html |
Latest version: | http://www.language-archives.org/NOTE/metrics.html |
Previous version: | http://www.language-archives.org/NOTE/metrics-20071117.html |
Abstract: |
Explains the metrics that are implemented on the OLAC web site for summarizing the coverage of the participating archives and for evaluating the quality of their metadata records. |
Editors: |
|
Changes since previous version: |
In section 2, removed the proposal to add a "not_applicable" term to the Linguistic Data Type vocabulary in favor of changing to a Language Resource Type vocabulary in the next round of metadata development. Added section 3 describing all the other metrics beyond the quality score. |
Copyright © 2008 Gary Simons (SIL International and Graduate Institute of Applied Linguistics). This material may be distributed and repurposed subject to the terms and conditions set forth in the Creative Commons Attribution-ShareAlike 2.5 License.
References
The vision of OLAC is that "any user on the Internet should be able to go to a single gateway to find all the language resources available at all participating institutions" (see vision statement in [OLAC-Process]). The ability of a user to discover any relevant language resource is dependent on the quality of the metadata that describe it. Ensuring quality through peer review is a core value that OLAC employs to achieve its vision. "OLAC also conducts automated review based on peer consensus regarding best practice" (see core value statements in [OLAC-Process]).
Section 2 of this note explains the automated system that is implemented on the OLAC web site for evaluating the quality of metadata records. Section 3 explains the other metrics that are reported in the [OLAC-Metrics] reports to support comparison of size and coverage of collections in addition to aspects of metadata quality and usage.
The peer consensus regarding best practice is expressed in [OLAC-BPR] and further elucidated in [OLAC-Usage]. Many of the best practice recommendations for resource description cannot be automatically checked for conformance; however, there are many that can be. As an aid to creating descriptive metadata that meet the latter set of recommendations, OLAC has implemented an automated metadata quality score. Each metadata record receives a score in the range of 0 to 10 based on the presence or absence of recommended practices. In some contexts the score is reported as a number in this range; in others it is summarized graphically as a rating of 1 to 5 stars. That is, any score of 9 or higher is reported as 5 stars, scores in the range of 7 to 9 are reported as 4 stars, and so on.
The practices in focus for the evaluation of metadata quality are ones that contribute to resource discovery. The score has two major parts: 50% is based on the metadata elements that are present and 50% is based on the use of encoding schemes. The elements provide the breadth and depth of the description, while the encoding schemes provide precision for interoperable searching.
The element part of the score consists of 4 points awarded for each of four basic metadata elements that must be present to give the record minimal breadth of coverage, plus a further point awarded for additional elements that add to the depth of description. In the descriptions below, a non-empty metadata element is one that supplies a value, whether through element content or through the olac:code attribute. The element-based components of the score are awarded as follows:
- Title
One point is awarded for the presence of a non-empty Title element. Absence of a title that is inherent to the resource does not block achieving this point, since in that case it is recommended best practice for the cataloger to supply a descriptive title enclosed in square brackets.
- Date
One point is awarded for the presence of at least one non-empty Date element (or any of its refinements). Absence of a date in the resource itself does not block achieving this point, since in that case it is recommended best practice for the cataloger to supply an estimated date enclosed in square brackets.
- Agent (Contributor, Creator, or Publisher)
One point is awarded for the presence of at least one non-empty element that provides an indication of who is behind the resource, whether as Contributor or Creator or Publisher.
- About (Subject, Description, or Coverage)
One point is awarded for the presence of at least one non-empty element that provides an indication of what the resource is about, whether Subject or Description or Coverage (or any refinement of the latter two).
- Depth
One-sixth point (up to a maximum of one point) is awarded for each element that is present in addition to the 8 that must be present in order to receive the 4 points above for basic elements and the 4 points that follow for basic encoding schemes. If the record has fewer than 8 elements, this part of the score is 0; otherwise, it is (total elements - 8) / 6 or 1, whichever is less. Note that in order to get the full score on this point, a record must contain at least 14 elements.
The encoding scheme part of the score consists of 4 points awarded for each of four basic element-plus-scheme pairs that must be present to support high recall and precision in searches for language resources. A further point is awarded for additional use of encoding schemes that add to the precision of resource description. The scheme-based components of the score are awarded as follows:
- Content Language
One point is awarded for the presence of at least one Language element that uses the olac:language encoding scheme [OLAC-Language] to precisely identify the language of content of the resource. Absence of any natural language content in a resource (such as in a software tool) does not block achieving this point, since in that case it is recommended best practice is to use the ISO 639-3 code zxx meaning "No linguistic content."
- Linguistic Type
One point is awarded for the presence of at least one Type element that uses the olac:linguistic-type encoding scheme [OLAC-Type] to precisely identify the type of the resource from a linguistic point of view. Such a metadata element is relevant to the majority of OLAC records, but not to all. The remedy that has been identified is to extend the Linguistic Data Type vocabulary to a generally applicable Language Resource Type vocabulary that will be relevant to all OLAC records. Until the work is done to redefine the vocabulary, records for which Linguistic Data Type is not relevant will not earn this point.
- Subject Language
One point is awarded for appropriate use of the olac:language encoding scheme [OLAC-Language] with the Subject element to precisely identify the language that the resource is about. The notion of subject language is not relevant to every language resource. When the linguistic type of a resource is "primary_text" it is not required to have a subject language, and this point is awarded automatically. (Until the problem mentioned above under Linguistic Type is solved by creating a more general Language Resource Type vocabulary, the point will also be awarded automatically when there is no instance of olac:linguistic-type. This means that a resource other than a primary text for which subject language is truly not applicable will lose the point for Linguistic Type, but not be doubly penalized in the point for Subject Language.) When the linguistic type has any other value, there must be at least one Subject element using the olac:language encoding scheme in order to earn this point.
- DCMI Type
One point is awarded for the presence of at least one Type element that uses the dcterms:DCMIType encoding scheme [DCMI-Type] to identify the generic type of the resource. The vocabulary is designed to be applicable to any resource and this is considered mandatory for OLAC metadata in order to support reliable searching for resources by type (such as audio recordings versus video recordings versus textual data versus software).
- Precision
One-third point (up to a maximum of one point) is awarded for each additional encoding scheme that is used in the metadata record. Thus in order to earn full points, a record must use at least three encoding schemes in addition to olac:language, olac:linguistic-type, and dcterms:DCMIType.
The free-standing metadata service [OLAC-Free] can be used to see what quality score will be awarded to a given OLAC metadata record. The XML encoding of a record is pasted into a submission form. The service then validates the record, and if it is valid, a report of its quality score is generated with comments on what must be done to raise the score to 10. The same quality analysis is shown for a sample record from each participating archive by following the "Sample Record" link on the [OLAC-Archives] page.
The average quality score for all the records provided by a given participating archive can be seen by following the "Metrics" link on the [OLAC-Archives] page. The metrics report also shows the breakdown across the collection of all the components that go into the quality score.
The [OLAC-Metrics] page reports a set of metrics that summarize the size and coverage of each participating archive as well as the quality of their metadata records. The "OLAC Archive Metrics" tab reports the metrics for the participating archive that has been selected from the drop down list. The "Comparative Archive Metrics" tab shows the summary statistics for all participating archives in a single table. When first opened, the rows of the table are in alphabetical order of the archive names. The rows can be reordered to reflect their rank with respect to a particular metric by clicking in the column header for that metric. Clicking again reverses the order.
When "ALL ARCHIVES" is selected, the Summary Statistics table begins with the following three metrics that apply only to the OLAC catalog as a whole; when an individual archive is selected, these metrics are absent.
- Number of Archives
The total number of metadata repositories that are currently being harvested by the OLAC aggregator. A complete enumeration of the participating archives is given on the [OLAC-Archives] page.
- Archives with Fresh Metadata
The number (and percentage) of participating archives that have updated their metadata repositories within the past twelve months.
- Archives with Five-star Metadata
The number (and percentage) of participating archives for which the average metadata quality score is 9 or higher (see The quality score).
The following metrics summarize the size and coverage of the selected archive (or of all archives when that is selected):
- Number of Resources
The total number of metadata records in the repository of the selected archive.
- Number of Resources Online
The number of records from the selected archive describing resources that are accessible online; that is, they have an Identifier element whose value is a URL beginning with http:, https:, or ftp:.
- Distinct Languages
The number of distinct languages that are covered within the selected archive's collection; that is, the number of distinct code values that are used from the olac:language encoding scheme [OLAC-Language], whether with the Language element or the Subject element.
- Distinct Linguistic Subfields
The number of distinct linguistic subfields that occur as subject classifications within the selected archive's collection; that is, the number of distinct code values that are used from the olac:linguistic-field encoding scheme [OLAC-Field].
- Distinct Linguistic Types
The number of distinct linguistic data types (e.g. primary_text versus lexicon versus language_description) that occur within the selected archive's collection; that is, the number of distinct code values that are used from the olac:linguistic-type encoding scheme [OLAC-Type].
- Distinct DCMI Types
The number of distinct DCMI resource types (e.g. Text, Sound, MovingImage, Software, and so on) that occur within the selected archive's collection; that is, the number of distinct values that are used from the dcterms:DCMIType encoding scheme [DCMI-Type].
The following metrics summarize aspects of metadata quality for the selected archive (or for all archives when that is selected):
- Average Elements Per Record
The average number of elements (including refinements from the dcterms namespace) per metadata record.
- Average Encoding Schemes Per Record
The average number of elements per metadata record that use the xsi:type attribute to specify an encoding scheme for expressing the value.
- Average Metadata Quality Score
The average of the quality score for all the metadata records in the selected archive (see The quality score); the maximum value is 10.
- Date of Latest Update
The date on which the archive last updated its metadata repository. It is computed as the most recent of the <datestamp> values that occur in the headers of the metadata records as returned by the OAI-PMH protocol.
The OLAC Archive Metrics page continues with a Metadata Usage summary consisting of four histograms:
- Core Components
This histogram reports the use of core metadata components as recommended by [OLAC-BPR]. The eight lines correspond to the eight components of The quality score that awarded as full points for the presence or absence of a recommended element or encoding scheme. The length of a bar represents the percentage of metadata records that contain that metadata component.
- Element Usage
This histogram lists all of the metadata elements in the Dublin Core scheme. The length of a bar represents the total number of times a given element has been used within the records of the selected archive. It is the count of element uses (not records that use the element); thus the counts exceed the total number of resources in the archive for elements that occur multiple times per record.
- Refinement Usage
This histogram lists all of the defined refinements to metadata elements in the Dublin Core scheme. The length of a bar represents the total number of times a given refinement has been used within the records of the selected archive. It is the count of refinement uses (not records that use the refinement); thus the counts exceed the total number of resources in the archive for refinements that occur multiple times per record.
- Encoding Scheme Usage
This histogram lists all of the encoding schemes that may occur as the value of the xsi:type attribute. The length of a bar represents the total number of times a given encoding scheme has been used within the records of the selected archive. It is the count of encoding scheme uses (not records that use the encoding scheme); thus the counts exceed the total number of resources in the archive for encoding schemes that occur multiple times per record.
The guard against empty elements should be built into the harvester so that elements having neither element content nor an olac:code value should simply be ignored and not entered into the aggregated database.