Abstract

LanguageDCAT-AP caters for the description of language data, language models and language technology services.

These are also referred to as "language resources" or "language resources and technologies". As defined in the META-SHARE ontology, a "language resource" is "a resource composed of linguistic material used in the construction, improvement and/or evaluation of language processing applications, but also, in a broader sense, in language and language-mediated research studies and applications; the term is used with a broader meaning, encompassing (a) data sets (textual, multimodal/multimedia and lexical data, grammars, language models, etc.) in machine readable form, and (b) tools/technologies/services used for their processing and management."

Work on LanguageDCAT-AP was initiated in the Common European Language Data Space (LDS), but it is based on a long history of activities carried out in the context of infrastructures intended for the sharing of language resources. LanguageDCAT-AP is, thus, designed with a wider scope covering broader description needs in the Language Data, Language Technology and Language-centric Artificial Intelligence domain.

Introduction

LanguageDCAT-AP is conceived and implemented as an extension of (a) the DCAT-AP application profile for Catalogues containing Datasets and Data Services in Europe [[DCAT-AP]] and (b) the MLDCAT-AP profile for machine learning models and their datasets. It also builds upon the META-SHARE ontology, and the vocabularies associated with it, such as the OMTD-SHARE ontology, especially for domain-specific concepts and properties. To cover language data specificities, it adopts classes and properties from these vocabularies, introduces new ones, where needed, and proposes domain-specific controlled vocabularies.

Τhe current version (v0.9.3) caters for all resource types (i.e., datasets, and more specifically, corpora and lexical/conceptual resources (see Section on Terminology), and language models), as well as for language technology (Natural Language Processing and Language-centric AI) services. All resource types are accompanied with a minimal set of attributes that help consumers discover them, while the full set of attributes are an ongoing work and will be added in future versions.

Version v0.9.3 replaces v0.9.2. All changes made aim to ensure compatibility with DCAT-AP. For a quick overview of the differences, see the overview of changes.

Licence

All material in this repository is published under the licence CC-BY 4.0, unless explicitly otherwise mentioned.

Context

One of the main objectives of LanguageDCAT-AP is to cover the requirements of exchanging language data and models in the context of the Common European Language Data Space and, as such, takes into account the building principles and specifications of data spaces. The most notable ones are the Data Space Protocol (DSP) and the IDS Reference Architecture Model (RAM, version 4 at the time of writing). Other data spaces initiatives, especially Gaia-X and its specifications (cf., Gaia-X architecture document), and the Data Spaces Support Centre (DSSC), with its Blueprint (version 3.0 at the time of writing), bear a strong influence on the design of LanguageDCAT-AP.

For semantic interoperability purposes, data spaces initiatives recommend the use of DCAT [[vocab-dcat-3]] for the description of catalogues, datasets and data services, and [[ODRL]] for the representation of policies. Specific data spaces have further adopted and extend [[DCAT-AP]] to their domains needs (cf., for instance, mobilityDCAT-AP and HealthDCAT-AP). LDS carries on along these steps, and to this end, closely collaborates with the SEMIC team.

The language technology community has specific needs and demands that lead to the adaptation and extension of DCAT-AP and, for models, MLDCAT-AP. Compatibility with the aforemetioned specifications and guidelines, as well as implementation requirements for data spaces impose further constraints on the way these two models are deployed in LanguageDCAT-AP. Differences are manifested in the data type used for properties, the cardinality of properties, the recommended controlled vocabularies as well as the introduction of new classes and properties. These differences are documented in the descriptions of classes and properties in the following sections (see introduction in the Main entities section for the symbols used). A subset of such differences is presented here for illustration purposes:

Terminology

An Application Profile is a data specification aimed to facilitate the data exchange in a well-defined application context. It re-uses concepts from one or more semantic data specifications, while adding more specificity, by identifying mandatory, recommended, and optional elements, addressing particular application needs, and providing recommendations for controlled vocabularies to be used. More information can be found on the SEMIC Style Guide. [Definition from MLDCAT-AP]

A dataset represents a collection of data, published or curated by a single agent or identifiable community. [Definition from [[vocab-dcat-3]]]

A language resource is a resource composed of linguistic material used in the construction, improvement and/or evaluation of language processing applications, but also, in a broader sense, in language and language-mediated research studies and applications; the term is used with a broader meaning, encompassing (a) data sets (textual, multimodal/multimedia and lexical data, grammars, language models, etc.) in machine readable form, and (b) tools/technologies/services used for their processing and management. [Definition from META-SHARE]

Language resources are further distinguished into:

Initially, models were considered under the class of "language descriptions", but given the way they have evolved, especially with the advent of Large Language Models (LLMs), they have been recognised as a distinct category.

Used prefixes

LanguageDCAT-AP is a semantic specification, where each class and property has an URI. The prefixed form of the URI, can be found in part in the class diagram and in the quick reference table at the bottom, while, for ease of reading, full URI can be found in the hyperlink of each class and property. This specification uses the following prefixes to shorten the URIs for readability:

PrefixNamespace IRI
admshttp://www.w3.org/ns/adms#
cchttp://creativecommons.org/ns#
dcathttp://www.w3.org/ns/dcat#
dcatldshttp://w3id.org/lang-dcat-ap/
dcthttp://purl.org/dc/terms/
dpvhttps://w3id.org/dpv#
foafhttp://xmlns.com/foaf/0.1/
it6http://data.europa.eu/it6/
ldshttps://language-data-space.eu/entity/
lexmetahttp://w3id.org/meta-share/lexmeta#
mshttp://w3id.org/meta-share/meta-share/
odrlhttp://www.w3.org/ns/odrl/2/
omtdhttp://w3id.org/meta-share/omtd-share/
owlhttp://www.w3.org/2002/07/owl#
provhttp://www.w3.org/ns/prov#
rdfhttp://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfshttp://www.w3.org/2000/01/rdf-schema#
schemahttps://schema.org/
skoshttp://www.w3.org/2004/02/skos/core#
xmlhttp://www.w3.org/XML/1998/namespace
xsdhttp://www.w3.org/2001/XMLSchema#

WARNING: LanguageDCAT-AP v0.9.1 used the "http://www.nlpli.gr/dcat-lds#" namespace for its own defined elements, which has now been replaced with: "http://w3id.org/lang-dcat-ap/".

Overview

This document describes the usage of the following main entities for a correct usage of the Application Profile:
| Agent | Catalogue | Corpus | Dataset | Data Service | Distribution | Lexical/Conceptual Resource | Language | Licence | Model | Organisation | Person | Policy | Processing Resource | Software Distribution | Tool/Service |

The main entities are supported by the following entities:
| Annotation Type | Anonymised | Catalogue Record | Catalogued Resource | Concept | Corpus Subclass | Data Protection Principle | Document | Domain | Identifier | Language Code | Language Resource Type | Lexical/Conceptual Resource Subclass | Linguality Type | Linguistic System | Location | Media Type | Media Type Or Extent | Microstructure Feature | Modality Type | Model Type | Multilinguality Type | Operation | Period of Time | Personal Data Included | Processing Resource Type | Resource | Script | Size | Size Unit | Software Distribution Form | Special Category Data Included | Standard | Technical Organisational Measure | Variant |

And supported by these datatypes:
| boolean | date | float | integer | langString | Literal | nonNegativeInteger | string | Temporal Literal |

The following diagram gives an overview of LanguageDCAT-AP. For simplicity sake, only classes whose properties are specified by the profile and/or are deemed important for the overview are displayed.

Main entities

The main entities are those that form the core of the Application Profile.

The properties and their associated constraints that apply in the context of this profile are listed in a tabular form. Each row corresponds to one property. In addition to the constraints also cross-references are provided to DCAT-AP, and MLDCAT-AP. To save space, the following abbreviations are used:

This reuse qualification assessment is with respect to a specific version of the two profiles. Therefore it may vary over time when new versions thereof are created.

It should also be noted that the cross-references are also used in the case of subclasses or equivalent classes of relevant entities; for instance, properties assigned to Corpus and Lexical/Conceptual Resource (subclasses of "Dataset") include links to the DCAT-AP properties assigned to "Dataset", when re-used from it.

Agent

Definition
An agent (person or organisation) carrying out activities related to the Catalogue and Catalogued resources.
Reference
DCAT-AP [E]
Usage Note
Class used as an abstract class. Only the subclasses Organisation and Person should be used in a data exchange.
Properties
This specification does not impose any additional requirements to properties for this entity.

Catalogue

Definition
A catalogue or repository that hosts the Catalogued Resources being described.
Reference
DCAT-AP [A]
Properties
For this entity the following properties are defined: creator, dataset, description, homepage, identifier, keyword, licence, modification date, publisher, record, release date, service, title.
PropertyRangeCardinalityDefinitionUsageControlled vocabularyReference
creatorAgent0..1An entity responsible for the creation of the Catalogue.DCAT-AP [A]
datasetDataset0..nA Dataset that is part of the Catalogue.DCAT-AP [A]
descriptionlangstring1..nA free-text account of the Catalogue.This property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]].DCAT-AP [E]
homepageDocument0..1A web page that acts as the main page for the Catalogue.DCAT-AP [A]
identifierLiteral0..1A unique identifier of the resource being described or cataloged.MLDCAT-AP [A]
keywordlangstring0..nA keyword or tag describing the resource.This property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]].MLDCAT-AP [E]
licenceLicence Document0..1A licence under which the Catalogue can be used or reused.DCAT-AP [A]
modification dateTemporal Literal0..1The most recent date on which the Catalogue was modified.DCAT-AP [A]
publisherAgent1..1An entity (organisation) responsible for making the Catalogue available.In case multiple organisations are considered the publishers of the catalogue, it is recommended to use foaf:Group to bundle them into one entity.DCAT-AP [A]
recordCatalogue Record0..nA Catalogue Record that is part of the Catalogue.DCAT-AP [A]
release dateTemporal Literal0..1The date of formal issuance (e.g., publication) of the Catalogue.DCAT-AP [A]
serviceData Service0..nA site or end-point (Data Service) that is listed in the Catalogue.DCAT-AP [A]
titlelangstring1..nA name given to the Catalogue.This property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]].DCAT-AP [E]

Corpus

Definition
A structured collection of pieces of data (textual, audio, video, multimodal/multimedia, etc.) typically of considerable size and selected according to criteria external to the data (e.g., size, type of language, type of text producers or expected audience, etc.) to represent as comprehensively as possible the object of study.
Reference
No reference.
Subclass of
Dataset
Usage Note
Corpus is a subclass of dcat:Dataset. It is recommended to use this instead of the Dataset class.
Properties
For this entity the following properties are defined: alternative, annotation type, anonymisation details, anonymised, conforms to, corpus subclass, data protection principle applied, description, detailed language, distribution, domain, has original source, has policy, has technical organisational measure, identifier, IPR holder, is documented by, keyword, language, language resource type, licence, linguality type, modality type, multilinguality type, original source description, other identifier, personal data details, personal data included, pivot language, publisher, source language, spatial, special category data details, special category data included, target language, temporal, title, version.
PropertyRangeCardinalityDefinitionUsageControlled vocabularyReference
alternative langString 0..n An alternative name for the resource. It is recommended to use "dct:title" for the full name of a dataset and "dct:alternative" for the short name.This property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]]. P
annotation type Annotation Type 0..n Specifies the annotation type of the annotated version(s) of a resource. OMTD-SHARE Annotation Type Vocabulary P
anonymisation details langString 0..n If the resource has been anonymised, this field can be used for entering more information, e.g., tool or method used for the anonymisation, by whom it has been performed, whether there was any check of the results, etc. This property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]]. P
anonymised Anonymised 1..1 Indicates whether the language resource has been anonymised; anonymous data is information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable META-SHARE Anonymised Vocabulary P
conforms to Standard 1..n An established standard to which the described resource conforms META-SHARE Vocabulary of Standards and Best Practices DCAT-AP [E]
corpus subclass Corpus Subclass 0..1 Introduces a classification of corpora into types (used for descriptive reasons) META-SHARE Corpus Subclass Vocabulary P
data protection principle applied Data Protection Principle 0..n Specifies the data protection principles that have been applied in compliance with the General Data Protection Regulation (Regulation (EU) 2016/679) META-SHARE Data Protection Principle Vocabulary P
description langString 1..n An account of the resource This property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]]. DCAT-AP [E]
detailed language Language 1..n Specifies the language used in the resource represented according to the BCP47 standard [[rfc5646]] This property takes a value in compliance with the BCP47 standard [[rfc5646]] and, thus, allows for the detailed description of the language (e.g., British English, Brazilian Portuguese, Greek written with the Latin script, etc.). It must be conformant with the value of "dct:language". P
distribution Distribution 1..n An available distribution of the dataset DCAT-AP [E]
domain Domain 0..n Identifies the domain according to which an entity is classified META-SHARE Vocabulary of Domains P
has original source Dataset 0..n Links a language resource to the original source that has been used for its creation, where it's derived or elicited from P
has policy Policy 1..1 Identifies an ODRL Policy for which the identified Asset is the target Asset to all the Rules This property is used for "policies", i.e. the machine-readable representations of the licensing terms under which a dataset/distribution is made available. The representation MUST be expressed in the ODRL vocabulary. P
has Technical Organisational Measure Technical Organisational Measure 0..n Indicates use or applicability of Technical or Organisational measure META-SHARE Technical and Organisational Measure Vocabulary P
identifier Literal 0..1 An unambiguous reference to the resource within a given context The main identifier for the resource, e.g. the URI or other unique identifier in the context of the Catalogue. It MUST be automatically assigned by the system when adding the resource to the Catalogue. DCAT-AP [A]
IPR holder Agent 0..n A person or an organisation who holds the full Intellectual Property Rights (Copyright, trademark, etc.) that subsist in the resource. The IPR holder could be different from the creator that may have assigned the rights to the IPR holder (e.g., an author as a creator assigns her rights to the publisher who is the IPR holder) and the distributor that holds a specific licence (i.e. a permission) to distribute the work via a specific distributor. The IPR holder may be identical in many cases with the Publisher of the resource (see property dct:publisher). In this case, the contact data MAY be copied from the corresponding Publisher entries. There might be also cases with non-identical entities: e.g., when one or several IPR Holders assign another entity as the Publisher responsible for producing, hosting and publishing a resource; such entities may be, for instance, a data distribution agency, or a specific partner representing a project consortium. The subproperty ms:iprHolder is preferred over dct:rightsHolder in order to differentiate with other types of rights (e.g. distribution rights). P
is documented by Document 0..n Links a language resource to a document (e.g., research paper describing its contents or its use in a project, user manual, etc.) or any other form of documentation (e.g., a URL with support information) that is related to the resource P
keyword langString 1..n A keyword or tag describing a resource This property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]]. DCAT-AP [E]
language Linguistic System 1..n A language of the resource. This property is used for the language of the contents of a dataset and takes a value only for the language proper, without any reference to script, regional variant, etc. It caters for semantic interoperability with other catalogues, yet fails to capture the full language details that are required for language data. For a more detailed description, the "ms:languge" property is preferred, but the two values must be aligned. The EU Authority languages vocabulary DCAT-AP [A]
language resource type Language Lesource type 1..1 Specifies the type of a language resource This element allows for a more fine-grained categorisation of datasets into 'corpora' (collections of data files in text, audio, video and/or image modality), 'lexical/conceptual resources' (e.g., lexica, vocabularies, gazetteers, terminological lexica, etc.) and 'grammars'. Note: In the META-SHARE ontology, models and grammars are grouped as "language descriptions" and included as subclasses of datasets. Given the evolution of Machine Learning Models, this has been revisited and models are included as a separate subclass distinguished from datasets. META-SHARE Language Resource Type Vocabulary P
licence Licence Document 0..1 A legal document giving official permission to do something with the resource This property refers to a "licence", i.e. a human readable text with legal code, with which the resource/distribution is made available. This property SHOULD refer to a concrete standard or proprietary licence, so that the data users can assess the licence conditions in human-readable format before using the data. P
linguality type LingualityType 1..1 Indicates whether the resource includes one, two or more languages META-SHARE Linguality Type Vocabulary P
modality type Modality Type 1..n Specifies the modality (aka media) type of a language resource (the physical medium of the contents representation) or of the input/output of a language processing tool/service; each modality type is described through a distinctive set of technical features; a language resource may consist of different modality parts META-SHARE modality type vocabulary P
multilinguality type Multilinguality Type 0..1 Indicates whether the resource (part) is parallel, comparable or mixed META-SHARE Multilinguality Type Vocabulary P
original source description langString 0..n A description in free text of the source material that has been used for the creation of a language data resource This property can be used to provide further information on the source resource. For instance, provide information such as mode and timespan of collection of a dataset. P
other identifier Identifier 0..n Links a resource to an adms:Identifier class. This property MAY be used as an additional identifier for existing identifiers used for the same resource in other Catalogues (e.g. DOI, ISLRN, DataCite, Handle PIDs) DCAT-AP [A]
personal data details langString 0..n If the resource includes personal data, this field can be used for entering more information, e.g., whether special handling of the resource is required (e.g., anonymisation, further request for use, etc.) This property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]]. P
personal data included Personal Data Included 1..1 Specifies whether the language resource contains personal data, i.e., any information relating to an identified or identifiable natural person (data subject); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person [Article 4(1) of the General Data Protection Regulation (Regulation (EU) 2016/679)] This property MUST be filled in for all non-anonymised datasets. If a dataset is anonymised, it is presumed that it contains no personal and sensitive data META-SHARE Personal Data Included Vocabulary P
pivot language Language 0..n The language acting as an intermediary for translations between many languages P
publisher Agent 1..1 An entity responsible for making the resource available This property refers to the entity that "publishes", i.e. makes available to the specific platform, the corresponding resource. The information may be identical to the property ms:iprHolder of the resource. DCAT-AP [E]
source language Language 0..n The language from which a translation is made. P
spatial Location 0..n Spatial characteristics of the resource A geographic region that is covered by the dataset. For language data, this refers to the geographic region where the language of the dataset is spoken/written and not where the dataset was implemented; for instance, in the case of a German institution creating a dataset of Cypriot Greek, the geographic region is "Cyprus". The EU Vocabularies Continents Named Authority List for continents (http://publications.europa.eu/resource/dataset/continent), countries (http://publications.europa.eu/resource/dataset/country), places (http://publications.europa.eu/resource/dataset/place) and, if not covered, GeoNames (https://www.geonames.org/) DCAT-AP [A]
special category data details langString 0..n If the resource includes special category data, this field can be used for entering more information, e.g., whether special handling of the resource is required (e.g., anonymisation, further request for use, etc.) This property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]]. P
special category data included Special Category Data Included 1..1 Specifies whether the language resource contains special category data, i.e., personal data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, and the processing of genetic data, biometric data for the purpose of uniquely identifying a natural person, data concerning health or data concerning a natural person’s sex life or sexual orientation [Article 9(1) of the General Data Protection Regulation (Regulation (EU) 2016/679)] This property MUST be filled in for all non-anonymised datasets. If a dataset is anonymised, it is presumed that it contains no personal and special category data META-SHARE Special Category Data Included Vocabulary P
target language Language 0..n The language into which a translation is made P
temporal Period of Time 0..n Temporal characteristics of the resource A temporal period that the contents of a dataset cover. For language data, this can be the time period in which the language of a dataset is spoken. DCAT-AP [A]
title langString 1..n A name given to the resource This property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]]. DCAT-AP [E]
version string 1..1 The version indicator (name or identifier) of a resource This property MUST contain a version number or other version designation of the dataset. It is recommended to follow W3C Data on the Web Best Practices [[DWBP]]. Version identifiers should enable comparison of versions and distinguishing major from minor versions, such as Semantic Versioning. DCAT-AP [E]

Data Service

Definition
A site or end-point providing operations related to the discovery of, access to, or processing functions on, data or related resources.
Reference
DCAT-AP [A]
Subclass of
Catalogued Resource
Properties
For this entity the following properties are defined: endpoint description, endpoint URL, title.
PropertyRangeCardinalityDefinitionUsageControlled vocabularyReference
endpoint descriptionlangString0..nA description of the service end-point, including its operations, parameters etc.This property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]].DCAT-AP [E]
endpoint URLResource1..1The root location or primary endpoint of the service (a web-resolvable IRI).DCAT-AP [E]
titlelangString0..nThe name of the data serviceThis property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]].DCAT-AP [E]

Dataset

Definition
A collection of data, published or curated by a single agent, and available for access or download in one or more representations.
Reference
DCAT-AP [A]
Usage Note
Class used as an abstract class. Only the subclasses Corpus and Lexical/Conceptual Resource should be used in a data exchange.
Properties
This specification does not impose any additional requirements to properties for this entity.

Distribution

Definition
A specific representation of a dataset. A dataset might be available in multiple serialisations that may differ in various ways, including natural language, media-type or format, schematic organisation, temporal and spatial resolution, level of detail or profiles (which might specify any or all of the above).
Reference
DCAT-AP [A]
Properties
For this entity the following properties are defined: access service, access URL, byte size, download URL, format, has policy, licence, media type, package format, size, title.
To comply with the Data Space Protocol, some of these properties are treated differently in the LDS implementation; check the LDS Notes section for further information.

PropertyRangeCardinalityDefinitionUsageControlled vocabularyReference
access serviceData Service0..1A site or end-point that gives access to the distribution of the datasetSee LDS notes for LDS-specific implementationDCAT-AP [E]
access URLResource1..nA URL of a resource that gives access to a distribution of the dataset, e.g. landing page, feed, SPARQL endpoint.dcat:accessURL SHOULD be used for the URL of a service or location that can provide access to this distribution, typically through a Web form, query or API call, while dcat:downloadURL is preferred for direct links to downloadable resources. If needed, the dcat:downloadURL can be duplicated as dcat:accessURL for validation purposes.

See also LDS notes for LDS-specific implementation
DCAT-AP [E]
byte sizenonNegativeInteger0..1The size of a distribution in bytes.DCAT-AP [E]
download URLResource1..1The URL of the downloadable file in a given format. E.g. CSV file or RDF file. The format is indicated by the distribution's dct:format and/or dcat:mediaType.dcat:downloadURL SHOULD be used for the URL at which this distribution is available directly, typically through a HTTP Get request.

See also LDS notes for LDS-specific implementation
DCAT-AP [E]
format Media Type Or Extent 1..n The file format, physical medium, or dimensions of the resource dcatlds:mediaType SHOULD be used when the media type of the distribution is defined in IANA [IANA-MEDIA-TYPES], otherwise dcatlds:format MAY be used with different values. EU Vocabularies File Type Named Authority List P
has policyPolicy1..1Identifies an ODRL Policy for which the identified Asset is the target Asset to all the RulesThis property is used for "policies", i.e. the machine-readable representation of the licensing terms under which a dataset/distribution is made available. The representation MUST be expressed in the ODRL vocabulary.

See also LDS notes for LDS-specific implementation
DCAT-AP [E]
licenceLicence Document0..1A legal document giving official permission to do something with the resourceThis property refers to a "licence", i.e. a human readable text with legal code, with which the resource/distribution is made available. This property SHOULD refer to a concrete standard or proprietary licence, so that the data users can assess the licence conditions in human-readable format before using the data. See also LDS notes for LDS-specific implementationSee also LDS notes for LDS-specific implementationDCAT-AP [E]
media typeMedia Type0..nThe media type of the distribution as defined by IANA [IANA-MEDIA-TYPES].dcatlds:mediaType SHOULD be used when the media type of the distribution is defined in IANA [IANA-MEDIA-TYPES], otherwise dcatlds:format MAY be used with different values. IANA-MEDIA-TYPES VocabularyP
package formatMedia Type0..1The package format of the distribution in which one or more data files are grouped together, e.g. to enable a set of related files to be downloaded together.OMTD-SHARE Vocabulary of Package FormatsDCAT-AP [E]
sizeSize1..nSpecifies the size of a countable entity with regard to the SizeUnit measurement in form of a numberP
titlelangString0..nA name given to the resourceThis property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]].DCAT-AP [E]

Language

Definition
A linguistic system that follows for its encoding the BCP47 standard [[rfc5646]].
Reference
No reference
Subclass of
Linguistic System
Properties
For this entity the following properties are defined: language code, language tag, language variety name, region, script, variant.
PropertyRangeCardinalityDefinitionUsageControlled vocabularyReference
language codeLanguage Code1..1Used to specify the first part of a language tag according to the BCP47 standard [[rfc5646]] which indicates the languageMETA-SHARE Language VocabularyP
language tagstring1..1The identifier of a language, according to the IETF BCP47 standard [[rfc5646]]This is automatically built from the values of the four language subtags (language, script, region and variant) provided by the userP
language variety namelangString0..nA textual string used for referring to a language varietyThis property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
regionLocation0..1Specifies the geographical region where a language is used according to the BCP47 standard [[rfc5646]]META-SHARE Location VocabularyP
scriptScript0..1Specifies the script used for writing a languag eaccording to the BCP47 standard [[rfc5646]]META-SHARE Script VocabularyP
variantVariant0..nSpecifies a variant for a language according to the BCP47 standard [[rfc5646]]META-SHARE Variants VocabularyP

Lexical/Conceptual Resource

Definition
A resource organised on the basis of lexical or conceptual entries (lexical items, terms, concepts, etc.) with their supplementary information (e.g., grammatical, semantic, statistical information, etc.)
Reference
No reference.
Subclass of
Dataset
Usage Note
Lexical/Conceptual resource is a subclass of dcat:Dataset. It is recommended to use this instead of the Dataset class.
Properties
For this entity the following properties are defined: alternative, anonymisation details, anonymised, conforms to, data protection principle applied, description, detailed language, distribution, domain, has original source, has policy, has technical organisational measure, identifier, IPR holder, is documented by, keyword, language, language resource type, LCR subclass, licence, linguality type, linguistic information, metalanguage, modality type, multilinguality type, original source description, other identifier, personal data details, personal data included, pivot language, publisher, source language, spatial, special category data details, special category data included, target language, temporal, title, version.
PropertyRangeCardinalityDefinitionUsageControlled vocabularyReference
alternativelangString0..nAn alternative name for the resource.It is recommended to use "dct:title" for the full name of a dataset and "dct:alternative" for the short name. This property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
anonymisation detailslangString0..nIf the resource has been anonymised, this field can be used for entering more information, e.g., tool or method used for the anonymisation, by whom it has been performed, whether there was any check of the results, etc.This property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
anonymisedAnonymised1..1Indicates whether the language resource has been anonymised; anonymous data is information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiableMETA-SHARE Anonymised VocabularyP
conforms toStandard1..nAn established standard to which the described resource conformsMETA-SHARE Vocabulary of Standards and Best PracticesDCAT-AP [E]
data protection principle appliedData Protection Principle0..nSpecifies the data protection principles that have been applied in compliance with the General Data Protection Regulation (Regulation (EU) 2016/679)META-SHARE Data Protection Principle VocabularyP
descriptionlangString1..nAn account of the resourceThis property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]].DCAT-AP [E]
detailed languageLanguage1..nSpecifies the language that is used in the resource or supported by the tool/service expressed according to the BCP47 standard [[rfc5646]]This property takes a value in compliance with the BCP47 standard [[rfc5646]] and, thus, allows for the detailed description of the language (e.g., British English, Brazilian Portuguese, Greek written with the Latin script, etc.). It must be conformant with the value of "dct:language".P
distributionDistribution1..nAn available distribution of the datasetDCAT-AP [E]
domainDomain0..nIdentifies the domain according to which an entity is classifiedMETA-SHARE Vocabulary of DomainsP
has original sourceDataset0..nLinks a language resource to the original source that has been used for its creation, where it's derived or elicited fromP
has policyPolicy1..1Identifies an ODRL Policy for which the identified Asset is the target Asset to all the RulesThis property is used for "policies", i.e. the machine-readable representations of the licensing terms under which a dataset/distribution is made available. The representation MUST be expressed in the ODRL vocabulary.P
has technical organisational measureTechnical Organisational Measure0..nIndicates use or applicability of Technical or Organisational measureMETA-SHARE Technical and Organisational Measure VocabularyP
identifierLiteral0..1An unambiguous reference to the resource within a given contextThe main identifier for the resource, e.g. the URI or other unique identifier in the context of the Catalogue. It MUST be automatically assigned by the system when adding the resource to the Catalogue.DCAT-AP [A]
IPR holderAgent0..nA person or an organisation who holds the full Intellectual Property Rights (Copyright, trademark, etc.) that subsist in the resource. The IPR holder could be different from the creator that may have assigned the rights to the IPR holder (e.g., an author as a creator assigns her rights to the publisher who is the IPR holder) and the distributor that holds a specific licence (i.e. a permission) to distribute the work via a specific distributor.The IPR holder may be identical in many cases with the Publisher of the resource (see property dct:publisher). In this case, the contact data MAY be copied from the corresponding Publisher entries. There might be also cases with non-identical entities: e.g., when one or several IPR Holders assign another entity as the Publisher responsible for producing, hosting and publishing a resource; such entities may be, for instance, a data distribution agency, or a specific partner representing a project consortium. The subproperty ms:iprHolder is preferred over dct:rightsHolder in order to differentiate with other types of rights (e.g. distribution rights).P
is documented byDocument0..nLinks a language resource to a document (e.g., research paper describing its contents or its use in a project, user manual, etc.) or any other form of documentation (e.g., a URL with support information) that is related to the resourceP
keywordlangString1..nA keyword or tag describing a resourceThis property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]].DCAT-AP [E]
languageLinguistic System1..nA language of the resource.This property is used for the language of the contents of a dataset and takes a value only for the language proper, without any reference to script, regional variant, etc. It caters for semantic interoperability with other catalogues, yet fails to capture the full language details that are required for language data. For a more detailed description, the "ms:language" property is preferred, but the two values must be aligned. The EU Authority languages vocabulary DCAT-AP [A]
language resource typeLanguage Resource Type1..1Specifies the type of a language resourceThis element allows for a more fine-grained categorisation of datasets into 'corpora' (collections of data files in text, audio, video and/or image modality), 'lexical/conceptual resources' (e.g., lexica, vocabularies, gazetteers, terminological lexica, etc.) and 'grammars'. Note: In the META-SHARE ontology, models and grammars are grouped as "language descriptions" and included as subclasses of datasets. Given the evolution of Machine Learning Models, this has been revisited and models are included as a separate subclass distinguished from datasets.META-SHARE Language Resource Type VocabularyP
LCR subclassLCR Subclass0..1Introduces a classification of lexical/conceptual resources into types (used for descriptive reasons)META-SHARE Lexical Subclasses VocabularyP
licenceLicence Document0..1A legal document giving official permission to do something with the resource.This property refers to a "licence", i.e. a human readable text with legal code, with which the resource/distribution is made available. This property SHOULD refer to a concrete standard or proprietary licence, so that the data users can assess the licence conditions in human-readable format before using the data. P
linguality typeLinguality Type1..1Indicates whether the resource includes one, two or more languagesMETA-SHARE Linguality Type VocabularyP
linguistic informationMicrostructure Feature0..nProvides a detailed account of the linguistic information contained in the lexical/conceptual resourceLEXMETA Vocabulary for Microstructure FeaturesP
metalanguageLanguage0..nSpecifies the language that is used as support for the resource (e.g., English for a grammar of French described in English or for a French dictionary with English definitions)P
modality typeModality Type1..nSpecifies the media type of a language resource (the physical medium of the contents representation) or of the input/output of a language processing tool/service; each media type is described through a distinctive set of technical features; a language resource may consist of different media partsMETA-SHARE Modality Type VocabularyP
multilinguality typeMultilinguality Type0..1Indicates whether the resource (part) is parallel, comparable or mixedMETA-SHARE Linguality Type VocabularyP
original source descriptionlangString0..nA description in free text of the source material that has been used for the creation of a language data resourceThis property can be used to provide further information on the source resource. For instance, provide information such as mode and timespan of collection of a dataset.P
other identifierIdentifier0..nLinks a resource to an adms:Identifier class.This property MAY be used as an additional identifier for existing identifiers used for the same resource in other Catalogues (e.g. DOI, ISLRN, DataCite, Handle PIDs)DCAT-AP [A]
personal data detailslangString0..nIf the resource includes personal data, this field can be used for entering more information, e.g., whether special handling of the resource is required (e.g., anonymisation, further request for use, etc.)This property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
personal data includedPersonal Data Included1..1Specifies whether the language resource contains personal data, i.e., any information relating to an identified or identifiable natural person (data subject); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person [Article 4(1) of the General Data Protection Regulation (Regulation (EU) 2016/679)]This property MUST be filled in for all non-anonymised datasets. If a dataset is anonymised, it is presumed that it contains no personal and sensitive dataMETA-SHARE Personal Data Included VocabularyP
pivot languageLanguage0..nThe language acting as an intermediary for translations between many languagesP
publisherAgent1..1An entity responsible for making the resource availableThis property refers to the entity that "publishes", i.e. makes available to the specific platform, the corresponding resource. The information may be identical to the property ms:iprHolder of the resource.DCAT-AP [E]
source languageLanguage0..nThe language from which a translation is made.P
spatialLocation0..nSpatial characteristics of the resourceA geographic region that is covered by the dataset. For language data, this refers to the geographic region where the language of the dataset is spoken/written and not where the dataset was implemented; for instance, in the case of a German institution creating a dataset of Cypriot Greek, the geographic region is "Cyprus".The EU Vocabularies Continents Named Authority List for continents (http://publications.europa.eu/resource/dataset/continent), countries (http://publications.europa.eu/resource/dataset/country), places (http://publications.europa.eu/resource/dataset/place) and, if not covered, GeoNames (https://www.geonames.org/)DCAT-AP [A]
special category data detailslangString0..nIf the resource includes special category data, this field can be used for entering more information, e.g., whether special handling of the resource is required (e.g., anonymisation, further request for use, etc.)This property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
special category data includedSpecial Category Data Included1..1Specifies whether the language resource contains special category data, i.e., personal data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, and the processing of genetic data, biometric data for the purpose of uniquely identifying a natural person, data concerning health or data concerning a natural person’s sex life or sexual orientation [Article 9(1) of the General Data Protection Regulation (Regulation (EU) 2016/679)]This property MUST be filled in for all non-anonymised datasets. If a dataset is anonymised, it is presumed that it contains no personal and sensitive dataMETA-SHARE Special Category Data Included VocabularyP
target languageLanguage0..nThe language into which a translation is madeP
temporalPeriod of Time0..nTemporal characteristics of the resourceA temporal period that the contents of a dataset cover. For language data, this can be the time period in which the language of a dataset is spoken.DCAT-AP [A]
titlelangString1..nA name given to the resourceThis property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]].DCAT-AP [E]
versionstring1..1The version indicator (name or identifier) of a resourceThis property MUST contain a version number or other version designation of the dataset. It is recommended to follow W3C Data on the Web Best Practices [DWBP]. Version identifiers should enable comparison of versions and distinguishing major from minor versions, such as Semantic Versioning.DCAT-AP [E]

Licence

Definition
A legal document giving official permission to do something with a resource
Reference
DCAT-AP [A]
Properties
For this entity the following properties are defined: alternative, description, identifier, legal code, see also, title.
PropertyRangeCardinalityDefinitionUsageControlled vocabularyReference
alternative langString 0..n An alternative name for the resource. This property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]]. P
description langString 1..n An account of the resource This property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]]. P
identifier Identifier 0..n Links a resource to an adms:Identifier class. When a standard licence is used, this MUST be the identifier from the SPDX list of licences. P
legal code Resource 1..1 The URL of the legal text of a Licence. P
see also Resource 0..n Further information about the subject resource. This property MUST be used for additional URLs that contain the licence text besides the official one. It can be used for identifying duplicates P
title langString 1..n A name given to the resource This property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]]. P

Model

Definition
The model artifact that is created through a training process involving an ML algorithm (that is, the learning algorithm) and the training data to learn from.
Reference
No reference.
Usage note
This class is considered equivalent to Machine Learning Model in MLDCAT-AP.
Properties
For this entity the following properties are defined: alternative, anonymisation details, anonymised, bias, conforms to, context length, creation details, data protection principle applied, description, detailed language, distribution, domain, evaluation dataset, evaluation results, finetune dataset, has policy, has technical organisational measure, identifier, IPR holder, is documented by, keyword, language, language resource type, licence, limitations, linguality type, modality type, model function, model type, other identifier, parameter precision, personal data details, personal data included, publisher, quantisation process, quantised, source language, spatial, special category data details, special category data included, target language, temporal, title, trained on, variant of, version.
PropertyRangeCardinalityDefinitionUsageControlled vocabularyReference
alternativelangString0..nAn alternative name for the resource.It is recommended to use "dct:title" for the full name of a dataset and "dct:alternative" for the short name. This property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
anonymisation detailslangString0..nIf the resource has been anonymised, this field can be used for entering more information, e.g., tool or method used for the anonymisation, by whom it has been performed, whether there was any check of the results, etc.This property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
anonymisedAnonymised1..1Indicates whether the language resource has been anonymised; anonymous data is information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiableMETA-SHARE Anonymised VocabularyP
biaslangString0..nA description of the possible biases affecting the overall output of the modelThis property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]].MLDCAT-AP [A]
conforms toStandard1..nAn established standard to which the described resource conformsMETA-SHARE Vocabulary of Standards and Best PracticesP
context lengthinteger0..1The maximum amount of text that an AI model can process and retain in memory at any given timeP
creation detailslangString0..nProvides additional information on the creation of a language resourceThis property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
data protection principle appliedData Protection Principle0..nSpecifies the data protection principles that have been applied in compliance with the General Data Protection Regulation (Regulation (EU) 2016/679)META-SHARE Data Protection Principle VocabularyP
descriptionlangString1..nAn account of the resourceThis property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]].MLDCAT-AP [E]
detailed languageLanguage1..nSpecifies the language that is used in the resource or supported by the tool/service expressed according to the BCP47 recommendationThis property takes a value in compliance with the BCP47 recommendation and, thus, allows for the detailed description of the language (e.g., British English, Brazilian Portuguese, Greek written with the Latin script, etc.). It must be conformant with the value of "dct:language".P
distributionDistribution1..nAn available distribution of the datasetP
domainDomain0..nIdentifies the domain according to which an entity is classifiedMETA-SHARE Vocabulary of DomainsP
evaluation datasetCorpus0..nThe dataset used for the evaluation of the machine learning modelP
evaluation resultslangString0..nA description of the evaluation results against the evaluation datasetThis property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
finetune datasetCorpus0..nThe dataset used to fine-tune the machine learning modelP
has policyPolicy1..1Identifies an ODRL Policy for which the identified Asset is the target Asset to all the RulesThis property is used for "policies", i.e. the machine-readable representations of the licensing terms under which a dataset/distribution is made available. The representation MUST be expressed in the ODRL vocabulary.MLDCAT-AP [E]
has technical organisational measureTechnical Organisational Measure0..nIndicates use or applicability of Technical or Organisational measureMETA-SHARE Technical and Organisational Measure VocabularyP
identifierLiteral0..1An unambiguous reference to the resource within a given contextThe main identifier for the resource, e.g. the URI or other unique identifier in the context of the Catalogue. It MUST be automatically assigned by the system when adding the resource to the Catalogue.MLDCAT-AP [A]
IPR holderAgent0..nA person or an organisation who holds the full Intellectual Property Rights (Copyright, trademark, etc.) that subsist in the resource. The IPR holder could be different from the creator that may have assigned the rights to the IPR holder (e.g., an author as a creator assigns her rights to the publisher who is the IPR holder) and the distributor that holds a specific licence (i.e. a permission) to distribute the work via a specific distributor.The IPR holder may be identical in many cases with the Publisher of the resource (see property dct:publisher). In this case, the contact data MAY be copied from the corresponding Publisher entries. There might be also cases with non-identical entities: e.g., when one or several IPR Holders assign another entity as the Publisher responsible for producing, hosting and publishing a resource; such entities may be, for instance, a data distribution agency, or a specific partner representing a project consortium. The subproperty ms:iprHolder is preferred over dct:rightsHolder in order to differentiate with other types of rights (e.g. distribution rights).P
is documented byDocument0..nLinks a language resource to a document (e.g., research paper describing its contents or its use in a project, user manual, etc.) or any other form of documentation (e.g., a URL with support information) that is related to the resourceP
keywordlangString1..nA keyword or tag describing a resourceThis property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]].MLDCAT-AP [E]
languageLinguistic System1..nA language of the resource.This property is used for the input language supported by a model and takes a value only for the language proper, without any reference to script, regional variant, etc. It caters for semantic interoperability with other catalogues, yet fails to capture the full language details that are required for language data. For a more detailed description, the "ms:language" property is preferred, but the two values must be aligned.The EU Authority languages vocabularyP
language resource typeLanguage Resource Type1..1Specifies the type of a language resourceThis element allows for a more fine-grained categorisation of datasets into 'corpora' (collections of data files in text, audio, video and/or image modality), 'lexical/conceptual resources' (e.g., lexica, vocabularies, gazetteers, terminological lexica, etc.) and 'grammars'. Note: In the META-SHARE ontology, models and grammars are grouped as "language descriptions" and included as subclasses of datasets. Given the evolution of Machine Learning Models, this is revisited and models are included as a separate subclass distinguished from datasets.META-SHARE Language Resource Type VocabularyP
licenceLicence Document0..1A legal document giving official permission to do something with the resourceThis property refers to a "licence", i.e. a human readable text with legal code, with which the resource/distribution is made available. This property SHOULD refer to a concrete standard or proprietary licence, so that the data users can assess the licence conditions in human-readable format before using the data.MLDCAT-AP [E]
limitationslangString0..nThe limited capabilities of the modelThis property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]].MLDCAT-AP [A]
linguality type LingualityType 1..1 Indicates whether the resource includes one, two or more languages META-SHARE Linguality Type Vocabulary P
modality typeModality Type1..nSpecifies the media type of a language resource (the physical medium of the contents representation) or of the input/output of a language processing tool/service; each media type is described through a distinctive set of technical features; a language resource may consist of different media partsMETA-SHARE media type vocabularyP
model functionOperation1..nThe function/task/operation a model performsOMTD-SHARE Vocabulary of OperationsP
model typeModel Type0..1A classification of models based on their algorithmMETA-SHARE Vocabulary of Model TypesP
other identifierIdentifier0..nLinks a resource to an adms:Identifier class.This property MAY be used as an additional identifier for existing identifiers used for the same resource in other Catalogues (e.g. DOI, ISLRN, DataCite, Handle PIDs)MLDCAT-AP [A]
parameter precisionstring0..1The number of bits used to represent a model's parameters, which directly impacts the model's memory usage and computational efficiencyP
personal data detailslangString0..nIf the resource includes personal data, this field can be used for entering more information, e.g., whether special handling of the resource is required (e.g., anonymisation, further request for use, etc.)This property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
personal data includedPersonal Data Included1..1Specifies whether the language resource contains personal data, i.e., any information relating to an identified or identifiable natural person (data subject); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person [Article 4(1) of the General Data Protection Regulation (Regulation (EU) 2016/679)]This property MUST be filled in for all non-anonymised datasets. If a dataset is anonymised, it is presumed that it contains no personal and sensitive dataMETA-SHARE Personal Data Included VocabularyP
publisherAgent1..1An entity responsible for making the resource availableThis property refers to the entity that "publishes", i.e. makes available to the specific platform, the corresponding resource. The information may be identical to the property ms:iprHolder of the resource.P
quantisation processlangString0..nThe process used for the quantisationThis property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
quantisedboolean0..1Whether this version is quantisedP
source languageLanguage0..nThe language from which a translation is made.P
spatialLocation0..nSpatial characteristics of the resourceA geographic region that is covered by the dataset. For language data, this refers to the geographic region where the language of the dataset is spoken/written and not where the dataset was implemented; for instance, in the case of a German institution creating a dataset of Cypriot Greek, the geographic region is "Cyprus".The EU Vocabularies Continents Named Authority List for continents (http://publications.europa.eu/resource/dataset/continent), countries (http://publications.europa.eu/resource/dataset/country), places (http://publications.europa.eu/resource/dataset/place) and, if not covered, GeoNames (https://www.geonames.org/)P
special category data detailslangString0..nIf the resource includes special category data, this field can be used for entering more information, e.g., whether special handling of the resource is required (e.g., anonymisation, further request for use, etc.)This property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
special category data includedSpecial Category Data Included1..1Specifies whether the language resource contains special category data, i.e., personal data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, and the processing of genetic data, biometric data for the purpose of uniquely identifying a natural person, data concerning health or data concerning a natural person’s sex life or sexual orientation [Article 9(1) of the General Data Protection Regulation (Regulation (EU) 2016/679)]This property MUST be filled in for all non-anonymised datasets. If a dataset is anonymised, it is presumed that it contains no personal and sensitive dataMETA-SHARE Special Category Data Included VocabularyP
target languageLanguage0..nThe language into which a translation is madeP
temporalPeriod of Time0..nTemporal characteristics of the resourceA temporal period that the contents of a dataset cover. For language data, this can be the time period in which the language of a dataset is spoken.P
titlelangString1..nA name given to the resourceThis property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
trained onCorpus1..nThe training dataset used to train the machine learning modelMLDCAT-AP [E]
variant ofModel0..nThe model on which this model has been based on or derived from or is a variation ofP
versionstring1..1The version indicator (name or identifier) of a resourceThis property MUST contain a version number or other version designation of the dataset. It is recommended to follow W3C Data on the Web Best Practices [DWBP]. Version identifiers should enable comparison of versions and distinguishing major from minor versions, such as Semantic Versioning Semantic Versioning.MLDCAT-AP [E]

Organisation

Definition
An organisation.
Reference
No reference
Subclass of
Agent
Properties
For this entity the following properties are defined: identifier, name.
PropertyRangeCardinalityDefinitionUsageControlled vocabularyReference
identifierIdentifier0..nLinks a resource to an adms:Identifier class.Recommended to use ROR for research institutionsP
namelangString1..nA name for some thing.This property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P

Person

Definition
A person.
Reference
No reference
Subclass of
Agent
Properties
For this entity the following properties are defined: identifier, name.
PropertyRangeCardinalityDefinitionUsageControlled vocabularyReference
identifierIdentifier0..nLinks a resource to an adms:Identifier class.Recommended to use ORCID for researchersP
namelangString1..nA name for some thing.This property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P

Policy

Definition
A non-empty group of Permissions and/or Prohibitions.
Reference
DCAT-AP [E]
Properties
For this entity the following properties are defined: alternative, description, identifier, legal code, title.

N.B. The properties defined for the class "Policy" here are the additional ones recommended for LanguageDCAT-AP. Further to these, the "Policy" class must include a group of Permissions and/or Prohibitions, as stated in the above definition.

PropertyRangeCardinalityDefinitionUsageControlled vocabularyReference
alternativelangString0..nAn alternative name for the resourceThis property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
descriptionlangString1..nAn account of the resourceThis property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
identifierIdentifier0..nLinks a resource to an adms:Identifier class.P
legal codeResource0..1The URL of the legal text of a Licence.P
titlelangString1..nA name given to the resourceThis property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P

Processing Resource

Definition
A set of requirements posed on the resource that is input for processing by a tool/service or that is output after the processing
Reference
No reference
Properties
For this entity the following properties are defined: detailed language, format, language, media type, modality type, processing resource type.
PropertyRangeCardinalityDefinitionUsageControlled vocabularyReference
detailed languageLanguage0..nSpecifies the language that is used in the resource or supported by the tool/service expressed according to the BCP47 standard [[rfc5646]]This property takes a value in compliance with the BCP47 standard [[rfc5646]] and, thus, allows for the detailed description of the language (e.g., British English, Brazilian Portuguese, Greek written with the Latin script, etc.). It must be conformant with the value of "dct:language".P
formatMedia Type Or Extent0..ndcatlds:mediaType SHOULD be used when the media type of the distribution is defined in IANA [IANA-MEDIA-TYPES], otherwise dcatlds:format MAY be used with different values. EU Vocabularies File Type Named Authority ListP
languageLinguistic System0..nA language of the resource.This property is used for the language of the contents of a dataset and takes a value only for the language proper, without any reference to script, regional variant, etc. It caters for semantic interoperability with other catalogues, yet fails to capture the full language details that are required for language data. For a more detailed description, the "ms:language" property is preferred, but the two values must be aligned.The EU Authority languages vocabularyP
media typeMedia Type0..nThe media type of the distribution as defined by IANA [IANA-MEDIA-TYPES].dcatlds:mediaType SHOULD be used when the media type of the distribution is defined in IANA [IANA-MEDIA-TYPES], otherwise dcatlds:format MAY be used with different values. IANA-MEDIA-TYPES VocabularyP
modality typeModality Type0 ..nSpecifies the media type of a language resource (the physical medium of the contents representation) or of the input/output of a language processing tool/service; each media type is described through a distinctive set of technical features; a language resource may consist of different media partsMETA-SHARE Modality Type VocabularyP
processing resource typeProcessing Resource Type1..1The type of the resource that a tool/service takes as input or produces as outputMETA-SHARE Processing Resource Types VocabularyP

Software Distribution

Definition
Any form with which software is distributed (e.g., web services, executable or code files, etc.)
Reference
No reference
Properties
For this entity the following properties are defined: access location, download location, execution location, has policy, licence, software distribution form.
PropertyRangeCardinalityDefinitionUsageControlled vocabularyReference
access locationany URI0..1A URL where the resource can be accessed from; it can be used for landing pages or for cases where the resource is accessible via an interface, i.e. cases where the resource itself is not provided with a direct link for downloadingP
download locationany URI0..1A URL where the language resource (mainly data but also downloadable software programmes or forms) can be downloaded fromP
execution locationany URI0..1A URL where the resource (mainly software) can be directly executedP
has policyPolicy1..1Identifies an ODRL Policy for which the identified Asset is the target Asset to all the RulesThis property is used for "policies", i.e. the machine-readable representation of the licensing terms under which a dataset/distribution is made available. The representation MUST be expressed in the ODRL vocabulary.

See also LDS notes for LDS-specific implementation
P
licenceLicence Document0..1A legal document giving official permission to do something with the resourceThis property refers to a "licence", i.e. a human readable text with legal code, with which the resource/distribution is made available. This property SHOULD refer to a concrete standard or proprietary licence, so that the data users can assess the licence conditions in human-readable format before using the data. See also LDS notes for LDS-specific implementationSee also LDS notes for LDS-specific implementationP
software distribution formSoftware Distribution Form1..1A form of softwareMETA-SHARE Vocabulary of Software Distribution FormsP

Tool/Service

Definition
A tool/service/any piece of software that performs language processing and/or any Language Technology related operation.
Reference
No reference.
Properties
For this entity the following properties are defined: alternative, conforms to, description, domain, function, has policy, identifier, input content resource, IPR holder, is documented by, keyword, language dependent, language resource type, licence, other identifier, output resource, publisher, software distribution, title, version.
PropertyRangeCardinalityDefinitionUsageControlled vocabularyReference
alternativelangString0..nAn alternative name for the resource.It is recommended to use "dct:title" for the full name of a resource and "dct:alternative" for the short name. This property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
conforms toStandard1..nAn established standard to which the described resource conformsMETA-SHARE Vocabulary of Standards and Best PracticesP
descriptionlangString1..nAn account of the resourceThis property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
domainDomain0..nIdentifies the domain according to which an entity is classifiedMETA-SHARE Vocabulary of DomainsP
functionOperation1..nSpecifies the operation/function/task that a software object performsOMTD-SHARE Vocabulary of OperationsP
has policyPolicy1..1Identifies an ODRL Policy for which the identified Asset is the target Asset to all the RulesThis property is used for "policies", i.e. the machine-readable representations of the licensing terms under which a dataset/distribution is made available. The representation MUST be expressed in the ODRL vocabulary.P
identifierLiteral0..1An unambiguous reference to the resource within a given contextThe main identifier for the resource, e.g. the URI or other unique identifier in the context of the Catalogue. It MUST be automatically assigned by the system when adding the resource to the Catalogue.P
input content resourceProcessing Resource1..nSpecifies the requirements set by a tool/service for the (content) resource that it processesP
IPR holderAgent0..nA person or an organisation who holds the full Intellectual Property Rights (Copyright, trademark, etc.) that subsist in the resource. The IPR holder could be different from the creator that may have assigned the rights to the IPR holder (e.g., an author as a creator assigns her rights to the publisher who is the IPR holder) and the distributor that holds a specific licence (i.e. a permission) to distribute the work via a specific distributor.The IPR holder may be identical in many cases with the Publisher of the resource (see property dct:publisher). In this case, the contact data MAY be copied from the corresponding Publisher entries. There might be also cases with non-identical entities: e.g., when one or several IPR Holders assign another entity as the Publisher responsible for producing, hosting and publishing a resource; such entities may be, for instance, a data distribution agency, or a specific partner representing a project consortium. The subproperty ms:iprHolder is preferred over dct:rightsHolder in order to differentiate with other types of rights (e.g. distribution rights).P
is documented byDocument0..nLinks a language resource to a document (e.g., research paper describing its contents or its use in a project, user manual, etc.) or any other form of documentation (e.g., a URL with support information) that is related to the resourceP
keywordlangString1..nA keyword or tag describing a resourceThis property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
language dependentboolean1..1Indicates whether the operation of the tool or service is language dependent or notP
language resource typeLanguage Resource Type1..1Specifies the type of a language resourceThis element allows for a more fine-grained categorisation of datasets into 'corpora' (collections of data files in text, audio, video and/or image modality), 'lexical/conceptual resources' (e.g., lexica, vocabularies, gazetteers, terminological lexica, etc.) and 'grammars'. Note: In the META-SHARE ontology, models and grammars are grouped as "language descriptions" and included as subclasses of datasets. Given the evolution of Machine Learning Models, this has been revisited and models are included as a separate subclass distinguished from datasets.META-SHARE Language Resource Type VocabularyP
licenceLicence Document0..1A legal document giving official permission to do something with the resourceThis property refers to a "licence", i.e. a human readable text with legal code, with which the resource/distribution is made available. This property SHOULD refer to a concrete standard or proprietary licence, so that the data users can assess the licence conditions in human-readable format before using the data. P
other identifierIdentifier0..nLinks a resource to an adms:Identifier class.This property MAY be used as an additional identifier for existing identifiers used for the same resource in other Catalogues (e.g. DOI, ISLRN, DataCite, Handle PIDs)P
output resourceProcessing Resource0..nSpecifies the output results of a tool/service, i.e. the features of the processed (content) resourceP
publisherAgent1..1An entity responsible for making the resource availableThis property refers to the entity that "publishes", i.e. makes available to the specific platform, the corresponding resource. The information may be identical to the property ms:iprHolder of the resource.P
software distributionSoftware Distribution1..nAn available distribution of the tool/serviceP
titlelangString1..nA name given to the resourceThis property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
versionstring1..1The version indicator (name or identifier) of a resourceThis property MUST contain a version number or other version designation of the dataset. It is recommended to follow W3C Data on the Web Best Practices [DWBP]. Version identifiers should enable comparison of versions and distinguishing major from minor versions, such as Semantic Versioning.P

Supportive entities

The supportive entities are supporting the main entities in the Application Profile. They are included in the Application Profile because they form the range of properties.

Annotation Type

Definition
Category/class of the annotations (metadata) that are added to the data that is processed
Reference
No reference
Properties
This specification does not impose any additional requirements to properties for this entity.

Anonymised

Definition
Indication of whether the language resource has been anonymised
Reference
No reference
Properties
This specification does not impose any additional requirements to properties for this entity.

Catalogue Record

Definition
A description of a Catalogued Resource's entry in the Catalogue.
Reference
DCAT-AP [A]
Properties
For this entity the following properties are defined: description, description version, modification date, primary topic.
PropertyRangeCardinalityDefinitionUsageControlled vocabularyReference
descriptionlangString0..*A free-text account of the record.This property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]].DCAT-AP [E]
description versionLiteral0..1It refers to the version of the description. '1' for original version.MLDCAT-AP [A]
modification dateTemporal Literal1..1The most recent date on which the Catalogue entry was changed or modified.DCAT-AP [A]
primary topicCatalogued Resource1..1A link to the Dataset, Data service or Catalog described in the record.A catalogue record will refer to one entity in a catalogue. This can be either a Dataset or a Data Service. To ensure an unambigous reading of the cardinality the range is set to Catalogued Resource. However it is not the intend with this range to require the explicit use of the class Catalogued Record. As abstract class, a subclass should be used.DCAT-AP [A]

Catalogued Resource

Definition
Resource published or curated by a single agent.
Reference
DCAT-AP [A]
Usage Note
Abstract class for DCAT-AP. Therefore only subclasses should be used in a data exchange.
Properties
This specification does not impose any additional requirements to properties for this entity.

Concept

Definition
An idea or notion; a unit of thought.
Reference
DCAT-AP [A]
Properties
This specification does not impose any additional requirements to properties for this entity.

Corpus Subclass

Definition
A classification of corpora into types (used for descriptive reasons)
Reference
No reference
Properties
This specification does not impose any additional requirements to properties for this entity.

Data Protection Principle

Definition
Foundational rule that dictates how personal data should be handled to ensure its integrity and confidentiality
Reference
No reference
Properties
This specification does not impose any additional requirements to properties for this entity.

Document

Definition
A document.
Reference
No reference
Properties
For this entity the following properties are defined: citation text, identifier.
PropertyRangeCardinalityDefinitionUsageControlled vocabularyReference
citation textlangString1..nThe text with which a document or language resource can be cited (typically the full citation, incl. title, authors, publisher, etc.)This property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
identifierIdentifier0..nLinks a resource to an adms:Identifier class.It is recommended to add the DOI or a URL where the document can be accessedP

Domain

Definition
A particular field of thought, activity, or interest related to a language resource, organisation or person activities, etc.
Reference
No reference
Properties
This specification does not impose any additional requirements to properties for this entity.

Identifier

Definition
This is based on the UN/CEFACT Identifier class. It consists of: a content string which is the identifier; an optional identifier for the identifier scheme; an optional identifier for the version of the identifier scheme; an optional identifier for the agency that manages the identifier scheme.
Reference
DCAT-AP [A]
Properties
For this entity the following properties are defined: notation, schema agency
PropertyRangeCardinalityDefinitionUsageControlled vocabularyReference
notationstring1..1A string that is an identifier in the context of the identifier scheme referenced by its datatype.It MAY be used to display the identifier as recommended in the original scheme (e.g. The CrossRef and DataCite display guidelines recommend displaying DOIs as full URL link in the form "https://doi.org/10.xxxx/xxxxx/"). The rdfs:Literal must be typed (e.g. ^^anyURI)P
schema agencylangString0..nThe name of the agency that issued the identifier.It MAY be used to represent the authority that defines the identifier scheme (e.g. the DOI foundation) when the authority has an IRI associated to it. This property can be repeated for parallel language versions. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P

Language Code

Definition
Language as defined for use as the first part in the BCP47 recommendation (taken from ISO 639)
Reference
No reference
Properties
This specification does not impose any additional requirements to properties for this entity.

Language Resource Type

Definition
Classification of a resource (part) based on the number of languages it includes
Reference
No reference
Properties
This specification does not impose any additional requirements to properties for this entity.

Lexical/Conceptual Resource Subclass

Definition
A classification of lexical/conceptual resources into types (used for descriptive reasons)
Reference
No reference
Properties
This specification does not impose any additional requirements to properties for this entity.

LingualityType

Definition
Classification of a resource (part) based on the number of languages it includes
Reference
No reference
Properties
This specification does not impose any additional requirements to properties for this entity.

Linguistic System

Definition
A system of signs, symbols, sounds, gestures, or rules used in communication, e.g. a language.
Reference
DCAT-AP [A]
Properties
This specification does not impose any additional requirements to properties for this entity.

Location

Definition
A spatial region or named place.
Reference
DCAT-AP [A]
Properties
This specification does not impose any additional requirements to properties for this entity.

Media Type

Definition
A file format or physical medium.
Reference
DCAT-AP [A]
Properties
This specification does not impose any additional requirements to properties for this entity.

Media Type Or Extent

Definition
A media type or extent.
Reference
DCAT-AP [A]
Properties
This specification does not impose any additional requirements to properties for this entity.

Microstructure Feature

Definition
Detailed account of the linguistic information and layout features contained in the dictionary microstructure
Reference
No reference
Properties
This specification does not impose any additional requirements to properties for this entity.

Modality Type

Definition
A classification of language resources or language resource parts based on the physical medium they are available in (e.g., text, video, audio)
Reference
No reference
Properties
This specification does not impose any additional requirements to properties for this entity.

Model Type

Definition
A classification of models into types based on their algorithm
Reference
No reference
Properties
This specification does not impose any additional requirements to properties for this entity.

Multilinguality Type

Definition
A classification of a bi/multilingual resource (part) into parallel, comparable or mixed
Reference
No reference
Properties
This specification does not impose any additional requirements to properties for this entity.

Operation

Definition
The operation/task/function performed by a tool/service or model
Reference
No reference
Properties
This specification does not impose any additional requirements to properties for this entity.

Period of Time

Definition
An interval of time that is named or defined by its start and end dates.
Reference
Properties
For this entity the following properties are defined: end date, start date.
PropertyRangeCardinalityDefinitionUsageControlled vocabularyReference
end datedate1..1The end of the periodDCAT-AP [E]
start datedate1..1The start of the periodPlease note that while both properties are recommended, one of the two must be present for each instance of the class dct:PeriodOfTime, if such an instance is present. NOTE: gYear or dateTime for LDSDCAT-AP [E]

Personal Data Included

Definition
Specification of whether the language resource contains personal data in the sense of article 4(1) of the General Data Protection Regulation (Regulation (EU) 2016/679)
Reference
No reference
Properties
This specification does not impose any additional requirements to properties for this entity.

Processing Resource Type

Definition
The type of the resource that a tool/service takes as input or produces as output
Reference
No reference
Properties
This specification does not impose any additional requirements to properties for this entity.

Resource

Definition
Anything described by RDF.
Reference
DCAT-AP [A]
Properties
This specification does not impose any additional requirements to properties for this entity.

Script

Definition
The alphabet (set of letters or characters in which a certain language is written) in compliance with the BCP47 recommendation
Reference
No reference
Properties
This specification does not impose any additional requirements to properties for this entity.

Size

Definition
The size of the resource with regard to the SizeUnit measurement in form of a number.
Reference
No reference
Properties
For this entity the following properties are defined: amount, size unit.
PropertyRangeCardinalityDefinitionUsageControlled vocabularyReference
amountfloat1..1Specifies the number of units that constitute anything that can be measured (e.g. size of a data resource or cost, etc.)P
size unitSize Unit1..1Specifies the unit that is used when providing information on the size of the resource or of resource partsMETA-SHARE Size Unit VocabularyP

Size Unit

Definition
The unit of measurement used for determining and describing the size of a resource (part)
Reference
No reference
Properties
This specification does not impose any additional requirements to properties for this entity.

Software Distribution Form

Definition
The medium, delivery channel or form (e.g., source code, API, web service, etc.) through which a software object is distributed
Reference
No reference
Properties
This specification does not impose any additional requirements to properties for this entity.

Special Category Data Included

Definition
Specification of whether the language resource contains special category data covered in Article 9(1) of the General Data Protection Regulation (Regulation (EU) 2016/679)
Reference
No reference
Properties
This specification does not impose any additional requirements to properties for this entity.

Standard

Definition
A reference point against which other things can be evaluated or compared.
Reference
DCAT-AP [A]
Properties
This specification does not impose any additional requirements to properties for this entity.

Technical Organisational Measure

Definition
Technical and Organisational measures used to safeguard and ensure good practices in connection with data and technologies
Reference
No reference
Properties
This specification does not impose any additional requirements to properties for this entity.

Variant

Definition
The variant for a language according to the BCP47 recommendation
Reference
No reference
Properties
This specification does not impose any additional requirements to properties for this entity.

Datatypes

The following datatypes are used within this specification.

ClassDefinition
anyURIanyURI represents an Internationalized Resource Identifier Reference (IRI). An anyURI value can be absolute or relative, and may have an optional fragment identifier (i.e., it may be an IRI Reference). This type should be used when the value fulfills the role of an IRI, as defined in [[rfc3987]] or its successor(s) in the IETF Standards Track.
booleanBoolean has the value space required to support the mathematical concept of binary-valued logic: {true, false}
dateThe value space of date consists of top-open intervals of exactly one day in length on the timelines of dateTime, beginning on the beginning moment of each day (in each timezone), i.e. '00:00:00', up to but not including '24:00:00' (which is identical with '00:00:00' of the next day).
floatIt represents IEEE single-precision 32-bit floating-point numbers. It supports decimal notation, scientific notation, and special values like positive and negative infinity (INF, -INF) and not-a-number (NaN) NaN.
integerInteger is derived from decimal by fixing the value of fractionDigits to be 0 and disallowing the trailing decimal point. This results in the standard mathematical concept of the integer numbers.
langStringThe datatype of language-tagged string values.
LiteralA literal value such as a string or integer; Literals may be typed, e.g. as a date according to xsd:date. Literals that contain human-readable text have an optional language tag as defined by BCP47[[[rfc5646]]]
nonNegativeIntegerNumber derived from integer by setting the value of minInclusive to be 0.
stringAny character string in XML.
Temporal Literalrdfs:Literal encoded using the relevant [[ISO8601]] Date and Time compliant string and typed using the appropriate XML Schema datatype (xsd:gYear, xsd:gYearMonth, xsd:date, or xsd:dateTime).

Controlled Vocabularies

Requirements for controlled vocabularies

LanguageDCAT-AP adopts the requirements set in DCAT-AP for the recommendation of controlled vocabularies. According to these, controlled vocabularies SHOULD:

The following two requirements identified for DCAT-AP are considered of less importance for LanguageDCAT-AP:

Controlled vocabularies to be used

LanguageDCAT-AP specifies a number of controlled vocabularies that MUST be used for specific properties in order to increase semantic interoperability. These vocabularies are selected from among those recommended in DCAT-AP and MLDCAT-AP, and those used in the language data/technologies and neighbouring communities. In the first case, they are controlled by an EU institution, while in the latter case, they are published and maintained by the LDS metadata working group. Requests for changes and addition to the LDS-controlled vocabularies can be made by adding a new issue at https://github.com/LanguageDCAT-AP/LanguageDCAT-AP/issues.

The following table lists the properties that take values from controlled vocabularies, as well as the classes for which these properties are used. The range of these properties is defined as a class (cf. column "Range") as well as a skos:concept belonging to the specified controlled vocabulary.

PropertyUsed for classRangeDefinition of Class (Range)Controlled vocabulary name
annotation type Corpus Annotation Type Category/class of the annotations (metadata) that are added to the data that is processed OMTD-SHARE Annotation Type Vocabulary
anonymised Corpus, LCR, Model Anonymised Indication of whether the language resource has been anonymised META-SHARE Anonymised Vocabulary
conforms to Corpus, LCR, Model Standard A reference point against which other things can be evaluated or compared. META-SHARE Vocabulary of Standards and Best Practices
corpus subclass Corpus Corpus Subclass A classification of corpora into types (used for descriptive reasons) META-SHARE Corpus Subclass Vocabulary
data protection principle applied Corpus, LCR, Model Data Protection Principle Foundational rule that dictates how personal data should be handled to ensure its integrity and confidentiality META-SHARE Data Protection Principle Vocabulary
domain Corpus, LCR, Model Domain A particular field of thought, activity, or interest related to a language resource, organisation or person activities, etc. META-SHARE Vocabulary of Domains
format Distribution, Processing Resource Media Type Or Extent A media type or extent. EU Vocabularies File Type Named Authority List
has technical organisational measure Corpus, LCR, Model Technical Organisational Measure Technical and Organisational measures used to safeguard and ensure good practices in connection with data and technologies META-SHARE Technical and Organisational Measure Vocabulary
language code Language Language Code Language as defined for use as the first part in the BCP47 recommendation (taken from ISO 639) META-SHARE Language Vocabulary
language Corpus, LCR, Model Linguistic System A system of signs, symbols, sounds, gestures, or rules used in communication, e.g. a language. The EU Authority languages vocabulary
language resource type Corpus, LCR, Model Language Resource Type Classification of a resource (part) based on the number of languages it includes META-SHARE Language Resource Type Vocabulary
lexical/Conceptual Resource subclass LCR Lexical/Conceptual Resource Subclass A classification of lexical/conceptual resources into types (used for descriptive reasons) META-SHARE Lexical Subclasses Vocabulary
linguality Type Corpus, LCR, Model Linguality Type Classification of a resource (part) based on the number of languages it includes META-SHARE Linguality Type Vocabulary
linguistic information LCR Microstructure Feature Detailed account of the linguistic information and layout features contained in the dictionary microstructure LEXMETA Vocabulary for Microstructure Features
media type Distribution Media Type A file format or physical medium. IANA-MEDIA-TYPES Vocabulary
modality type Corpus, LCR, Model Modality type A classification of language resources or language resource parts based on the physical medium they are available in (e.g., text, video, audio) META-SHARE Modality Type Vocabulary
model function Model Operation The operation/task/function performed by a tool/service or model OMTD-SHARE Vocabulary of Operations
model type Model Model Type A classification of models into types based on their algorithm META-SHARE Vocabulary of Model Types
multilinguality type Corpus, LCR Multilinguality Type A classification of a bi/multilingual resource (part) into parallel, comparable or mixed META-SHARE Multilinguality Type Vocabulary
package format Distribution Media type A file format or physical medium. OMTD-SHARE Vocabulary of Package Formats
personal data included Corpus, LCR, Model Personal Data Included Specification of whether the language resource contains personal data in the sense of article 4(1) of the General Data Protection Regulation (Regulation (EU) 2016/679) META-SHARE Personal Data Included Vocabulary
processing resource type Processing Resource Processing Resource Type The type of the resource that a tool/service takes as input or produces as output META-SHARE Processing Resource Types Vocabulary
region Language Location A spatial region or named place. META-SHARE Location Vocabulary
script Language Script The alphabet (set of letters or characters in which a certain language is written) in compliance with the BCP47 recommendation META-SHARE Script Vocabulary
size unit Size Size Unit The unit of measurement used for determining and describing the size of a resource (part) META-SHARE Size Unit Vocabulary
software distribution form Software Distribution Software Distribution Form The medium, delivery channel or form (e.g., source code, API, web service, etc.) through which a software object is distributed META-SHARE Vocabulary of Software Distribution Forms
spatial Corpus, LCR, Model Location A spatial region or named place. The EU Vocabularies Continents Named Authority List for continents, countries, places and, if not covered, GeoNames
special category data included Corpus, LCR, Model Special Category Data Included Specification of whether the language resource contains special category data covered in Article 9(1) of the General Data Protection Regulation (Regulation (EU) 2016/679) META-SHARE Special Category Data Included Vocabulary
variant Language Variant The variant for a language according to the BCP47 recommendation META-SHARE Variants Vocabulary

Support for implementation

The following section provides support for implementing the LanguageDCAT-AP.

JSON-LD context file

One common technical question is the format in which the data is being exchanged. For conformance with the LanguageDCAT-AP, it is not mandatory that this happens in an RDF serialisation, but the exchanged format SHOULD be unambiguously transformable into RDF. For the format JSON, a popular format to exchange data between systems, following the same approach used for DCAT-AP and MLDCAT-AP, a JSON-LD context file is provided. JSON-LD is a W3C Recommendation JSON-LD 1.1 that provided a standard approach to interpret JSON structures as RDF. The provided JSON-LD context file can be used by implementers. This JSON-LD context is not normative, i.e. other JSON-LD contexts are allowed.

The JSON-LD context file can be downloaded here.

Validation

To verify if the data is (technically) conformant to the LanguageDCAT-AP, the exchanged data can be validated using the provided SHACL shapes. SHACL is a W3C Recommendation to express constraints on an RDF knowledge graph.

To support the check whether or not a catalogue satisfies the expressed constraints in this Application Profile, the constraints in this specification are expressed using SHACL [[shacl]]. Each constraint in this specification that could be converted into a SHACL expression has been included. However, it should be noted that the SHACL shapes have been implemented and tested for the Language Data Space purposes, and may include constraints specific to the Data Spaces requirements (cf. Section Context). As such this collection of SHACL expressions can be used to build a validation check for data, yet with caution.

It is up to the implementers to define the validation they expect. Each implementation happens within a context, and that context is beyond the SHACL expressions here.

The shapes can be found here and they are structured in three files (in Turtle format) as follows:

N.B. The files contain also SHACL rules that are specific to the Language Data Space implementation (see Section Notes for the LDS implementation); the comment "Specific to LDS" is used to distinguish them.

Profile in Turtle format

All classes, properties and individuals used in the LanguageDCAT-AP profile are available in a file in Turtle format here.

Examples

Acknowledgements

The LanguageDCAT-AP profile builds upon work carried out in the framework of various initiatives, the most notable of which is the Linked Data for Language Technology(LD4LT) W3C Community Group, and infrastructural projects (META-SHARE, CLARIN, CLARIN-EL, OpenMinTeD, European Language Grid), and has been consolidated in the Language Data Space. We would like to acknowledge all persons that have contributed to this work:

Victoria Arranz, Sophie Aubin, Richard Eckart de Castilho, Khalid Choukri, Philipp Cimiano, Miltos Deligiannis, Elina Desypri, Victor Rodriguez Doncel, Richard Eckart de Castilho, Gil Francopoulo, Francesca Frontini, Dimitris Galanis, Maria Gavriilidou, Maria Giagkou, Katerina Gkirtzou, Jorge Gracia, the LD4LT Community Group contributors, the META-SHARE metadata working group, Christianne Klaes, Petr Knoth, David Lindemann, Valerie Mapelli, John P. McCrae, Monica Monacchini, Claire Nedellec, Stelios Piperidis, Claudia Soria, Kossay Talmoudi, Marta Villegas, Leon Voukoutis.

We also gratefully acknowledge the guidance and feedback provided by the SEMIC group, and especially Pavlina Fragkou, Anastasia Sofou, Emidio Stani, and Ine Weyts, in our most recent endeavours.

Notes for implementation in the Language Data Space

For reasons of compatibility with the Dataspace Protocol (DSP), the recommended standard for exchange of datasets for all data spaces, and the implementation supported by the EDC (Eclipse Dataspace Components) connector, which lies at the heart of the LDS infrastructure, a set of deviations from the profile are deemed indispensable. These concern (a) properties used exclusively in the LDS context, hence not presented in this documentation, and realised through distinct SHACL rules, and (b) properties that appear in the profile with specific features that are bypassed or handled differently in LDS. Discrepancies that are due to these specificities and the DCAT-AP and MLDCAT-AP are presented in the following table.

Property DCAT-AP & MLDCAT-AP Dataspace Protocol LanguageDCAT-AP LDS implementation Recommendations for import to LDS
access service Property attached to Distribution with cardinality 0..n, intended for data services that give access to the distribution of the dataset Each Distribution object MUST have at least one DataService which specifies where the distribution is obtained. Specifically, a DataService specifies the endpoint for initiating a Contract Negotiation and Transfer Process. A DataService.endpointURL property contains the URL of the service the Contract Negotiation endpoints extend. Property attached to Distribution with cardinality 0..n Following the DSP specifications, the DataService is used for the representation of the Connector endpoint and is automatically created and kept as is. When exporting outside LDS, this is not exported for security purposes. Other types of Data Services are not currently supported. n/a
distribution Dataset has cardinality 1..n for the property distribution A Dataset MUST hold at least one Distribution object in the distribution attribute. Subclasses of Dataset have cardinality 1..n for the property distribution Due to implementation restrictions of the EDC version currently used for the LDS Connector, only one Distribution is allowed per Dataset. For export outside LDS, Datasets with multiple Policies will be exported as one Dataset with distinct Distributions, each of them corresponding to the distinct Policies. Different Distributions of a Dataset (and its subclasses) must be represented as different Datasets. If the distinct Distributions correspond to different Policies used for the same Dataset, they can also be represented as a single Dataset with multiple Policies.
distribution Not on Catalogue Optional property on Catalogue Not added At export stage outside LDS, this is not exported. n/a
format dct:format with cardinality 0..1 for the file format The format property is a format specified by a Distribution for the Dataset associated with the Agreement. This is generally obtained from the Provider's Catalog dct:format used for the transfer protocol; examples are HttpData-PULL, HttpData-PUSH, AmazonS3-PULL A new property dcatlds:format is introduced which takes values from the EU vocabulary of filetypes with cardinality 0..n; dct:format is not used at all The dct:format is used internally according to the DSP specification. The dcatlds:format is used for the file format Use the property dcatlds:format instead of dct:format
hasPolicy Property attached to Distribution with cardinality 0..1 Property attached to Dataset with cardinality 1..n Property attached to both Dataset subclasses (as well as on Model and Tool/Service) and Distribution (and SoftwareDistribution) Due to implementation restrictions of the EDC version currently used for the LDS Connector, when connectors exchange catalogues of Datasets (Models and Tool/Services), Datasets are represented with multiple policies. At export stage outside LDS, these will be represented as distinct Datasets or a single Dataset with multiple Distributions, each with a single policy, and the policy will be copied on Distribution and SoftwareDistribution When importing to LDS, move or copy hasPolicy from Distribution and SoftwareDistribution to Dataset subclasses, Model and Tool/Service respectively. See also recommendation for Datasets with multiple Distributions above for the treatment of cardinality.
language only dct:language property used with controlled vocabulary EU authorities n/a Two properties: dct:language and ms:language with cardinality 1..n which must be aligned Only the ms:language/language is mandatory. The dct:language is automatically computed based on ms:language. If a value is not included in the EU vocabulary, it is mapped to the value "unmapped". n/a
licence Property attached to Distribution with cardinality 0..1 n/a Property attached to both Dataset subclasses (Corpus and LCR), Model and Tool/Service, as well as on Distribution, SoftwareDistribution attached with cardinality 1..1 At export, LDS aligns licence with hasPolicy and copies both properties on Distribution (and SoftwareDistribution respectively). In case of multiple policies and licences, these are represented as distinct entries (multiple Datasets or single Dataset with multiple Distributions) with a single policy-licence pair. When importing to LDS, move or copy license-policy pair from Distribution and SoftwareDistribution to Dataset subclasses, Model and Tool/Service respectively.
mediaType dcat:mediaType used for IANA media types with cardinality 0..1 n/a A new property dcatlds:mediaType is introduced with cardinality 0..n as a temporary solution for SEMIC interoperability purposes dcatlds: automatically created for IANA media types with cardinality 0..n Use dcatlds:mediaType instead of dcat:mediaType
service Property attached to Catalogue with cardinality 0..n, intended for the description of DataServices that are offered by a publisher. A Catalog MUST have one to many Data Services that reference a Connector where Datasets MAY be obtained. Property attached to Catalog with cardinality 0..n Following the DSP specifications, the DataService is used for the representation of the Connector endpoint and is automatically created and kept as is. n/a
softwareDistribution n/a n/a Tools/Services have cardinality 1..n for the property softwareDistribution Given that models and tools/services are treated as assets and, therefore, Datasets, the same mechanism as for distribution is used. Different SoftwareDistributions of a Tool/Service are represented as different Tools/Services.
type DCAT-AP caters for datasets only (dcat:Dataset); MLDCAT-AP introduces it6:Model as a subclass of dcat:Resource; services are included in both as dcat:DataService, subclass of dcat:Resource. Caters only for datasets (dcat:Dataset); dcat:DataService used for the documentation of connectors. Recognises corpora and lexica/conceptual resources as dcat:Dataset and models as distinct entities subclass of dcat:Resource, and introduces NLP and AI tools/services as distinct entities (subclass of dcat:Resource); dcat:DataService can be used for connectors and other types of data services (e.g., SPARQL endpoints) In order to use the EDC functionalities, all assets included in the catalogue must be of type dcat:Dataset. Thus, for import into LDS, we require the same for all resource types, including models and tools/services. The distinction is made with the lrtype property. Encode all resource types as "dcat:Dataset".

Overview of changes

Changes from the previous version are: