Abstract

LanguageDCAT-AP caters for the description of language data, language models and language technology services.

These are also referred to as "language resources" or "language resources and technologies". As defined in the META-SHARE ontology, a "language resouce" is "a resource composed of linguistic material used in the construction, improvement and/or evaluation of language processing applications, but also, in a broader sense, in language and language-mediated research studies and applications; the term is used with a broader meaning, encompassing (a) data sets (textual, multimodal/multimedia and lexical data, grammars, language models, etc.) in machine readable form, and (b) tools/technologies/services used for their processing and management."

Work on LanguageDCAT-AP was initiated in the Common European Language Data Space (LDS), but it is based on a long history of activities carried out in the context of infrastructures intended for the sharing of language resources. LanguageDCAT-AP is, thus, designed with a wider scope covering broader description needs in the Language Data, Language Technology and Language-centric Artificial Intelligence domain.

Introduction

LanguageDCAT-AP is conceived and implemented as an extension of (a) the DCAT-AP application profile for Catalogues containing Datasets and Data Services in Europe [[DCAT-AP]] and (b) the MLDCAT-AP profile for machine learning models and their datasets. It also builds upon the META-SHARE ontology, and the vocabularies associated with it, such as the OMTD-SHARE ontology, especially for domain-specific concepts and properties. To cover language data specificities, it adopts classes and properties from these vocabularies, introduces new ones, where needed, and proposes domain-specific controlled vocabularies

Τhe current version (v0.9.2) adds to the already supported resource types (datasets, and more specifically, corpora and lexical/conceptual resources (see Section on Terminology), and language models), the language technology (Natural Language Processing and Language-centric AI) services. All resource types are accompanied with a minimal set of attributes that help consumers discover them, while the full set of attributes are an ongoing work and will be added in future versions.

Licence

All material in this repository is published under the license CC-BY 4.0, unless explicitly otherwise mentioned.

Context

One of the main objectives of LanguageDCAT-AP is to cover the requirements of exchanging language data and models in the context of the Common European Language Data Space and, as such, takes into account the building principles and specifications of data spaces. The most notable ones are the Data Space Protocol (DSP) and the IDS Reference Architecture Model (RAM, version 4 at the time of writing). Other data spaces initiatives, especially Gaia-X and its specifications (cf., Gaia-X architecture document), and the Data Spaces Support Centre (DSSC), with its Blueprint (version 2.0 at the time of writing), bear a strong influence on the design of LanguageDCAT-AP.

For semantic interoperability purposes, data spaces initiatives recommend the use of DCAT [[vocab-dcat-3]] for the description of catalogues, datasets and data services, and [[ODRL]] for the representation of policies. Specific data spaces have further adopted and extend [[DCAT-AP]] to their domains needs (cf., for instance, mobilityDCAT-AP and HealthDCAT-AP). LDS carries on along these steps, and to this end, closely collaborates with the SEMIC team.

The language technology community has specific needs and demands that lead to the adaptation and extension of DCAT-AP and, for models, MLDCAT-AP. Compatibility with the aforemetioned specifications and guidelines, as well as implementation requirements impose further constraints on the way these two models are deployed in LanguageDCAT-AP. Differences are manifested in the data type used for properties, the cardinality of properties, the recommended controlled vocabularies as well as the introduction of new classes and properties. These differences are documented in the descriptions of classes and properties in the following sections, while this section presents a subset thereof for illustration purposes.

The "language" property is of utmost importance for the description of language data. Therefore, it is mandatory for all resources. It is also distinguished in various types of (sub)properties: referring to the language used for the contents of a dataset, the source/target language of bi-/multilingual datasets, the metalanguage used in lexica and grammars to describe a language, the language(s) of the dataset that a service can process, etc.; these properties can be used, together with the mandatory property, to provide a more detailed account of the resource that is described. Moreover, although the use of the vocabulary recommended in DCAT-AP (EU authority language vocabulary) for the assignment of language values is crucial for interoperability purposes, it does not capture the desired precision for the language community. Thus, in addition to the mandatory "dct:language" property, that takes values from this vocabulary, and fulfills the interoperability needs, the "ms:language" property, that implements the BCP 47 standard [[rfc5646]] is obligatorily used, together with the optional "ms:languageVarietyName" property that encodes language variants, dialects, jargons, etc. in free text.

Language equality is obviously a foundational principle for the language community, hence the requisite to ensure that metadata descriptions of language resources can be offered in all languages with the proper encoding. LanguageDCAT-AP supports multilinguality by dictating the use of "rdf:langString" for all free text properties.

The use of community-specific vocabularies, as, for instance, for size units that can be used to measure the size of datasets (e.g., words, sentences for datasets, terms for terminological glossaries, concepts for ontologies, etc.), for the functions of language models (e.g., annotation, sentiment analysis, machine translation, etc.) is combined with community-specific properties.

It is important to stress that LT services are represented with a distinct class from that of Data services as their description needs impose a richer and more complex set of properties. The fact that they are not linked to specific datasets but can be used to process different types of data and enrich them with new information, or retrieve from them information and/or generate new resources reinforces the arguments for their separate treatment.

Terminology

An Application Profile is a data specification aimed to facilitate the data exchange in a well-defined application context. It re-uses concepts from one or more semantic data specifications, while adding more specificity, by identifying mandatory, recommended, and optional elements, addressing particular application needs, and providing recommendations for controlled vocabularies to be used. More information can be found on the SEMIC Style Guide. [Definition from MLDCAT-AP]

A dataset represents a collection of data, published or curated by a single agent or identifiable community. [Definition from [[vocab-dcat-3]]]

A language resource is a resource composed of linguistic material used in the construction, improvement and/or evaluation of language processing applications, but also, in a broader sense, in language and language-mediated research studies and applications; the term is used with a broader meaning, encompassing (a) data sets (textual, multimodal/multimedia and lexical data, grammars, language models, etc.) in machine readable form, and (b) tools/technologies/services used for their processing and management. [Definition from META-SHARE]

A language resource is further distinguished into:

Initially, models were considered under the class of "language descriptions", but given the way they have evolved, especially with the advent of Large Language Models (LLMs), they have been recognised as a distinct category.

Used prefixes

LanguageDCAT-AP is a semantic specification, where each class and property has an URI. The prefixed form of the URI, can be found in part in the class diagram and in the quick reference table at the bottom, while, for ease of reading, full URI can be found in the hyperlink of each class and property. This specification uses the following prefixes to shorten the URIs for readability:

PrefixNamespace IRI
admshttp://www.w3.org/ns/adms#
athttp://publications.europa.eu/ontology/authority/
cchttp://creativecommons.org/ns#
dchttp://purl.org/dc/elements/1.1/
dcathttp://www.w3.org/ns/dcat#
dcatldshttp://w3id.org/lang-dcat-ap/
dcthttp://purl.org/dc/terms/
dpvhttps://w3id.org/dpv#
foafhttp://xmlns.com/foaf/0.1/
it6http://data.europa.eu/it6/
lexbibhttps://lexbib.elex.is/entity/
ldshttps://language-data-space.eu/entity/
lexmetahttp://w3id.org/meta-share/lexmeta#
mshttp://w3id.org/meta-share/meta-share/
odrlhttp://www.w3.org/ns/odrl/2/
omtdhttp://w3id.org/meta-share/omtd-share/
owlhttp://www.w3.org/2002/07/owl#
provhttp://www.w3.org/ns/prov#
rdfhttp://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfshttp://www.w3.org/2000/01/rdf-schema#
schemahttp://schema.org/
skoshttp://www.w3.org/2004/02/skos/core#
vannhttp://purl.org/vocab/vann/
wikidatahttp://www.wikidata.org/entity/
xmlhttp://www.w3.org/XML/1998/namespace
xsdhttp://www.w3.org/2001/XMLSchema#
**WARNING**: LanguageDCAT-AP v0.9.1 used the "http://www.nlpli.gr/dcat-lds#" namespace for its own defined elements, which has now been replaced with: "http://w3id.org/lang-dcat-ap/".

Overview

This document describes the usage of the following main entities for a correct usage of the Application Profile:
| Agent | Catalogue | Corpus | Dataset | Data Service | Distribution | Lexical/Conceptual Resource | Language | Licence | Model | Organisation | Person | Policy | Tool/Service |

The main entities are supported by the following entities:
| Catalogue Record | Catalogued Resource | Concept | Document | Identifier | Linguistic System | Location | Media Type | Media Type Or Extent | Size | Period of Time |

And supported by these datatypes:
| boolean | date | float | integer | langString | Literal | nonNegativeInteger | string | Temporal Literal |

**TODO: CHANGE DIAGRAM** The following diagram gives an overview of LanguageDCAT-AP. For simplicity sake, only classes whose properties are specified by the profile are displayed as separate entities.

Main entities

The main entities are those that form the core of the Application Profile.

The properties and their associated constraints that apply in the context of this profile are listed in a tabular form. Each row corresponds to one property. In addition to the constraints also cross-references are provided to DCAT-AP, and MLDCAT-AP. To save space, the following abbreviations are used:

This reuse qualification assessment is with respect to a specific version of the two profiles. Therefore it may vary over time when new versions thereof are created.

Agent

Definition
An agent (person or organisation) carrying out activities related to the Catalogue and Catalogued resources.
Reference
DCAT-AP [E]
Usage Note
Class used as an abstract class. Only the subclasses Organisation and Person should be used in a data exchange.
Properties
This specification does not impose any additional requirements to properties for this entity.

Catalogue

Definition
A catalogue or repository that hosts the Catalogued Resources being described.
Reference
DCAT-AP [A]
Properties
For this entity the following properties are defined: creator , dataset , description , homepage , identifier , keyword , licence , modification date , publisher , record , release date , service , title .
PropertyRangeCardinalityDefinitionUsageControlled vocabularyReference
creatorAgent0..nAn entity responsible for the creation of the Catalogue.DCAT-AP [A]
datasetDataset0..nA Dataset that is part of the Catalogue.DCAT-AP [A]
descriptionlangstring1..nA free-text account of the Catalogue.This property can be repeated for parallel language versions of the description.DCAT-AP [A]
homepageDocument0..1A web page that acts as the main page for the Catalogue.DCAT-AP [A]
identifierLiteral0..1A unique identifier of the resource being described or cataloged.MLDCAT-AP [A]
keywordlangstring0..nA keyword or tag describing the resource.MLDCAT-AP [A]
licenceLicence Document0..1A licence under which the Catalogue can be used or reused.DCAT-AP [A]
modification dateTemporal Literal0..1The most recent date on which the Catalogue was modified.DCAT-AP [A]
publisherAgent1An entity (organisation) responsible for making the Catalogue available.In case multiple organisations are considered the publishers of the catalogue, it is recommended to use foaf:Group to bundle them into one entity.DCAT-AP [A]
Catalogue RecordCatalogue Record0..nA Catalogue Record that is part of the Catalogue.DCAT-AP [A]
release dateTemporal Literal0..1The date of formal issuance (e.g., publication) of the Catalogue.DCAT-AP [A]
serviceData Service0..nA site or end-point (Data Service) that is listed in the Catalogue.DCAT-AP [A]
titlelangstring1..nA name given to the Catalogue.This property can be repeated for parallel language versions of the name.DCAT-AP [E]

Corpus

Definition
A structured collection of pieces of data (textual, audio, video, multimodal/multimedia, etc.) typically of considerable size and selected according to criteria external to the data (e.g., size, type of language, type of text producers or expected audience, etc.) to represent as comprehensively as possible the object of study.
Reference
No reference.
Subclass of
Dataset
Usage Note
Corpus is a subclass of dcat:Dataset. It is recommended to use this instead of the Dataset class.
Properties
For this entity the following properties are defined: alternative, annotation type, anonymization details, anonymized, conforms to, corpus subclass, data protection principle applied, description, detailed language, distribution, domain, has policy, has technical organisational measure, identifier, IPR holder, is documented by, keyword, language,license, linguality type, LR type, modality type, multilinguality type, other identifier, personal data details, personal data included, pivot language, publisher, special category data details, special category data included, source language, spatial, target language, temporal, title, version
PropertyRangeCardinalityDefinitionUsageControlled vocabularyReference
alternativelangString0..nAn alternative name for the resource.It is recommended to use "dct:title" for the full name of a dataset and "dct:alternative" for the short name. This property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
annotation typeAnnotationType0..nSpecifies the annotation type of the annotated version(s) of a resource or the annotation type a tool/ service requires or produces as an outputOMTD annotation type vocabularyP
anonymization detailslangString0..nIf the resource has been anonymised, this field can be used for entering more information, e.g., tool or method used for the anonymisation, by whom it has been performed, whether there was any check of the results, etc.This property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
anonymizedAnonymized1..1Indicates whether the language resource has been anonymised; anonymous data is information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiableMETA-SHARE anonymised vocabularyP
conforms toStandard0..nAn established standard to which the described resource conformsMETA-SHARE vocabulary of standards and best practicesDCAT-AP [E]
corpus subclassCorpusSubclass0..1Introduces a classification of corpora into types (used for descriptive reasons)META-SHARE corpus subclass vocabularyP
data protection principle appliedDataProtectionPrinciple0..nSpecifies the data protection principles that have been applied in compliance with the General Data Protection Regulation (Regulation (EU) 2016/679)META-SHARE Data Protection Principle vocabularyP
descriptionlangString1..nAn account of the resourceThis property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].DCAT-AP [E]
detailed languageLanguage1..nSpecifies the language that is used in the resource or supported by the tool/service expressed according to the BCP47 recommendationThis property takes a value in compliance with the BCP47 recommendation and, thus, allows for the detailed description of the language (e.g., British English, Brazilian Portuguese, Greek written with the Latin script, etc.). It must be conformant with the value of "dct:language".P
distributionDistribution1..nAn available distribution of the datasetDCAT-AP [E]
domainDomain0..nIdentifies the domain according to which an entity is classifiedMETA-SHARE vocabulary of domainsP
has policyPolicy1..nIdentifies an ODRL Policy for which the identified Asset is the target Asset to all the RulesThis property is used for "policies", i.e. the machine-readable representation of the licensing terms under which a dataset/distribution is made available. The representation MUST be expressed in the ODRL vocabulary.P
has Technical Organisational MeasureTechnicalOrganisationalMeasure0..nIndicates use or applicability of Technical or Organisational measureMETA-SHARE Technical and Organisational Measure vocabularyP
identifierLiteral0..1An unambiguous reference to the resource within a given contextThe main identifier for the resource, e.g. the URI or other unique identifier in the context of the Catalogue. It MUST be automatically assigned by the system when adding the resource to the Catalogue.DCAT-AP [A]
IPR holderAgent0..nA person or an organisation who holds the full Intellectual Property Rights (Copyright, trademark, etc.) that subsist in the resource. The IPR holder could be different from the creator that may have assigned the rights to the IPR holder (e.g., an author as a creator assigns her rights to the publisher who is the IPR holder) and the distributor that holds a specific licence (i.e. a permission) to distribute the work via a specific distributor.The IPR holder may be identical in many cases with the Publisher of the resource (see property dct:publisher). In this case, the contact data MAY be copied from the corresponding Publisher entries. There might be also cases with non-identical entities: e.g., when one or several IPR Holders assign another entity as the Publisher responsible for producing, hosting and publishing a resource; such entities may be, for instance, a data distribution agency, or a specific partner representing a project consortium. The subproperty ms:iprHolder is preferred over dct:rightsHolder in order to differentiate with other types of rights (e.g. distribution rights).P
is documented byDocument0..nLinks a language resource to a document (e.g., research paper describing its contents or its use in a project, user manual, etc.) or any other form of documentation (e.g., a URL with support information) that is related to the resourceP
keywordlangString1..nA keyword or tag describing a resourceThis property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].DCAT-AP [E]
languageLinguistic System1..nA language of the resource.This property is used for the language of the contents of a dataset and takes a value only for the language proper, without any reference to script, regional variant, etc. It caters for semantic interoperability with other catalogues, yet fails to capture the full language details that are required for language data. For a more detailed description, the "ms:languge" property is preferred, but the two values must be aligned.The EU Authority languages vocabulary (http://publications.europa.eu/resource/dataset/language) and, if not covered, the lexvo vocabulary for languages (http://lexvo.org/)DCAT-AP [A]
licenseLicenseDocument1..nA legal document giving official permission to do something with the resourceThis property refers to a "licence", i.e. a human readable text with legal code, with which the resource/distribution is made available. This property SHOULD refer to a concrete standard or proprietary licence, so that the data users can assess the licence conditions in human-readable format before using the data. P
linguality typeLingualityType1..1Indicates whether the resource includes one, two or more languagesMETA-SHARE linguality type vocabularyP
lrTypeLR type1..1Specifies the type of a language resourceThis element allows for a more fine-grained categorisation of datasets into 'corpora' (collections of data files in text, audio, video and/or image modality), 'lexical/conceptual resources' (e.g., lexica, vocabularies, gazetteers, terminological lexica, etc.) and 'grammars'. Note: In the META-SHARE ontology, models and grammars are grouped as "language descriptions" and included as subclasses of datasets. Given the evolution of Machine Learning Models, this has been revisited and models are included as a separate subclass distinguished from datasets.META-SHARE language resource type vocabularyP
modality typeModality Type1..nSpecifies the media type of a language resource (the physical medium of the contents representation) or of the input/output of a language processing tool/service; each media type is described through a distinctive set of technical features; a language resource may consist of different media partsMETA-SHARE media type vocabularyP
multilinguality typeMultilinguality Type0..1Indicates whether the resource (part) is parallel, comparable or mixedMETA-SHARE linguality type vocabularyP
other identifierIdentifier0..nLinks a resource to an adms:Identifier class.This property MAY be used as an additional identifier for existing identifiers used for the same resource in other Catalogues (e.g. DOI, ISLRN, DataCite, Handle PIDs)DCAT-AP [A]
personal data detailslangString0..nIf the resource includes personal data, this field can be used for entering more information, e.g., whether special handling of the resource is required (e.g., anonymisation, further request for use, etc.)This property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
personal data includedPersonal Data Included1..1Specifies whether the language resource contains personal data, i.e., any information relating to an identified or identifiable natural person (data subject); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person [Article 4(1) of the General Data Protection Regulation (Regulation (EU) 2016/679)]This property MUST be filled in for all non-anonymised datasets. If a dataset is anonymised, it is presumed that it contains no personal and sensitive dataMETA-SHARE personal data included vocabularyP
pivot languageLanguage0..nThe language acting as an intermediary for translations between many languagesP
publisherAgent1..1An entity responsible for making the resource availableThis property refers to the entity that "publishes", i.e. makes available to the specific platform, the corresponding resource. The information may be identical to the property ms:iprHolder of the resource.DCAT-AP [E]
special category data detailslangString0..nIf the resource includes special category data, this field can be used for entering more information, e.g., whether special handling of the resource is required (e.g., anonymisation, further request for use, etc.)This property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
special category data includedSpecial Category Data Included1..1Specifies whether the language resource contains special category data, i.e., personal data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, and the processing of genetic data, biometric data for the purpose of uniquely identifying a natural person, data concerning health or data concerning a natural person’s sex life or sexual orientation [Article 9(1) of the General Data Protection Regulation (Regulation (EU) 2016/679)]This property MUST be filled in for all non-anonymised datasets. If a dataset is anonymised, it is presumed that it contains no personal and sensitive dataMETA-SHARE Special Category Data Included vocabularyP
sourceLanguageLanguage0..nThe language from which a translation is made.P
spatialLocation0..nSpatial characteristics of the resourceA geographic region that is covered by the dataset. For language data, this refers to the geographic region where the language of the dataset is spoken/written and not where the dataset was implemented; for instance, in the case of a German institution creating a dataset of Cypriot Greek, the geographic region is "Cyprus".The EU Vocabularies Continents Named Authority List for continents (http://publications.europa.eu/resource/dataset/continent), countries (http://publications.europa.eu/resource/dataset/country), places (http://publications.europa.eu/resource/dataset/place) and, if not covered, GeoNames (https://www.geonames.org/)DCAT-AP [A]
target languageLanguage0..nThe language into which a translation is madeP
temporalPeriodOfTime0..nTemporal characteristics of the resourceA temporal period that the contents of a dataset cover. For language data, this can be the time period in which the language of a dataset is spoken.DCAT-AP [A]
titlelangString1..nA name given to the resourceThis property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].DCAT-AP [E]
versionstring1..1The version indicator (name or identifier) of a resourceThis property MUST contain a version number or other version designation of the dataset. It is recommended to follow W3C Data on the Web Best Practices [DWBP]. Version identifiers should enable comparison of versions and distinguishing major from minor versions, such as Semantic Versioning [SEMVER].DCAT-AP [E]

Data Service

Definition
A site or end-point providing operations related to the discovery of, access to, or processing functions on, data or related resources.
Reference
DCAT-AP [A]
Subclass of
Catalogued Resource
Properties
For this entity the following properties are defined: endpoint description, endpoint URL, title, type
PropertyRangeCardinalityDefinitionUsageControlled vocabularyReference
endpoint descriptionlangString0..nA description of the service end-point, including its operations, parameters etc.This property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].DCAT-AP [E]
endpoint URLResource1..1The root location or primary endpoint of the service (a web-resolvable IRI).DCAT-AP [E]
titlelangString0..nThe name of the data serviceThis property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].DCAT-AP [E]
typeConcept0..1Type of the data serviceThis property is specific to Data Spaces; it helps differentiate between connectors and real data services (e.g., SPARQL query services) and for the time being is restricted to a fixed value (http://w3id.org/lang-dcat-ap/connector)P

Dataset

Definition
A collection of data, published or curated by a single agent, and available for access or download in one or more representations.
Reference
DCAT-AP [A]
Usage Note
Class used as an abstract class. Only the subclasses Corpus and Lexical/Conceptual Resource should be used in a data exchange.
Properties
This specification does not impose any additional requirements to properties for this entity.

Distribution

Definition
A specific representation of a dataset. A dataset might be available in multiple serialisations that may differ in various ways, including natural language, media-type or format, schematic organisation, temporal and spatial resolution, level of detail or profiles (which might specify any or all of the above).
Reference
DCAT-AP [A]
Properties
For this entity the following properties are defined: access service, access URL, byte size, download URL, format, has policy, license, media type, package format, size, title
PropertyRangeCardinalityDefinitionUsageControlled vocabularyReference
access serviceDataService1..1A site or end-point that gives access to the distribution of the datasetDCAT-AP [E]
access URLResource0..nA URL of a resource that gives access to a distribution of the dataset, e.g. landing page, feed, SPARQL endpoint. Use for all cases except for direct download links, in which case dcat:downloadURL is preferred.DCAT-AP [E]
byte sizenonNegativeInteger1..1The size of a distribution in bytes.DCAT-AP [E]
download URLResource1..1The URL of the downloadable file in a given format. E.g. CSV file or RDF file. The format is indicated by the distribution's dct:format and/or dcat:mediaType.DCAT-AP [E]
formatMedia Type Or Extent1..nThe file format, physical medium, or dimensions of the resourcedcat:mediaType SHOULD be used when the media type of the distribution is defined in IANA [IANA-MEDIA-TYPES], otherwise dcterms:format MAY be used with different values.OMTD-SHARE vocabulary of data formatsDCAT-AP [E]
has policyPolicy1..nIdentifies an ODRL Policy for which the identified Asset is the target Asset to all the RulesThis property is used for "policies", i.e. the machine-readable representation of the licensing terms under which a dataset/distribution is made available. The representation MUST be expressed in the ODRL vocabulary.DCAT-AP [E]
licenseLicense Document1..nA legal document giving official permission to do something with the resourceThis property refers to a "licence", i.e. a human readable text with legal code, with which the resource/distribution is made available. This property SHOULD refer to a concrete standard or proprietary licence, so that the data users can assess the licence conditions in human-readable format before using the data. DCAT-AP [E]
media typeMedia Type0..nThe media type of the distribution as defined by IANA [IANA-MEDIA-TYPES].dcat:mediaType SHOULD be used when the media type of the distribution is defined in IANA [IANA-MEDIA-TYPES], otherwise dcterms:format MAY be used with different values.IANA-MEDIA-TYPES vocabularyDCAT-AP [E]
package formatMedia Type0..1The package format of the distribution in which one or more data files are grouped together, e.g. to enable a set of related files to be downloaded together.OMTD-SHARE vocabulary of package formatsDCAT-AP [E]
sizeSize1..nSpecifies the size of a countable entity with regard to the SizeUnit measurement in form of a numberP
titlelangString0..nA name given to the resourceThis property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].DCAT-AP [E]

Language

Definition
A linguistic system that follows for its encoding the BCP47 recommendation.
Reference
No reference
Subclass of
Linguistic System
Properties
For this entity the following properties are defined: language code, language tag, language variety name, region, script, variant
PropertyRangeCardinalityDefinitionUsageControlled vocabularyReference
language codeLanguage Code1..1Used to specify the first part of a language tag according to the BCP47 recommendation which indicates the languageMETA-SHARE language vocabularyP
language tagstring1..1The identifier of a language, according to the IETF BCP47 guidelinesThis is automatically built from the values of the four language subtags (language, script, region and variant) provided by the userP
language variety namelangString0..nA textual string used for referring to a language varietyThis property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
regionRegion0..1Specifies the geographical region where a language is used according to the BCP47 recommendationMETA-SHARE location vocabularyP
scriptScript0..1Specifies the script used for writing a languageMETA-SHARE script vocabularyP
variantVariant0..nSpecifies a variant for a language according to the BCP-47 recommendationMETA-SHARE variants vocabularyP

Lexical/Conceptual Resource

Definition
A resource organised on the basis of lexical or conceptual entries (lexical items, terms, concepts, etc.) with their supplementary information (e.g., grammatical, semantic, statistical information, etc.)
Reference
No reference.
Subclass of
Dataset
Usage Note
Lexical/Conceptual resource is a subclass of dcat:Dataset. It is recommended to use this instead of the Dataset class.
Properties
For this entity the following properties are defined: alternative, anonymization details, anonymized, conforms to, data protection principle applied, description, detailed language, distribution, domain, has original source, has policy, has technical organisational measure, identifier, IPR holder, is documented by, keyword, language, LCR subclass, license, linguality type, linguistic information, LR type, metaLanguage, modality type, multilinguality type, original source description, other identifier, personal data details, personal data included, pivot language, publisher, special category data details, special category data included, source language, spatial, target language, temporal, title, version
s
PropertyRangeCardinalityDefinitionUsageControlled vocabularyReference
alternativelangString0..nAn alternative name for the resource.It is recommended to use "dct:title" for the full name of a dataset and "dct:alternative" for the short name. This property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
anonymization detailslangString0..nIf the resource has been anonymised, this field can be used for entering more information, e.g., tool or method used for the anonymisation, by whom it has been performed, whether there was any check of the results, etc.This property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
anonymizedAnonymized1..1Indicates whether the language resource has been anonymised; anonymous data is information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiableMETA-SHARE anonymised vocabularyP
conforms toStandard0..nAn established standard to which the described resource conformsMETA-SHARE vocabulary of standards and best practicesDCAT-AP [E]
data protection principle appliedDataProtectionPrinciple0..nSpecifies the data protection principles that have been applied in compliance with the General Data Protection Regulation (Regulation (EU) 2016/679)META-SHARE Data Protection Principle vocabularyP
descriptionlangString1..nAn account of the resourceThis property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].DCAT-AP [E]
languageLanguage1..nSpecifies the language that is used in the resource or supported by the tool/service expressed according to the BCP47 recommendationThis property takes a value in compliance with the BCP47 recommendation and, thus, allows for the detailed description of the language (e.g., British English, Brazilian Portuguese, Greek written with the Latin script, etc.). It must be conformant with the value of "dct:language".P
distributionDistribution1..nAn available distribution of the datasetDCAT-AP [E]
domainDomain0..nIdentifies the domain according to which an entity is classifiedMETA-SHARE vocabulary of domainsP
has original sourceDataset0..nLinks a language resource to the original source that has been used for its creation, where it's derived or elicited fromP
has policyPolicy1..nIdentifies an ODRL Policy for which the identified Asset is the target Asset to all the RulesThis property is used for "policies", i.e. the machine-readable representation of the licensing terms under which a dataset/distribution is made available. The representation MUST be expressed in the ODRL vocabulary.P
has technical organisational measureTechnical Organisational Measure0..nIndicates use or applicability of Technical or Organisational measureMETA-SHARE Technical and Organisational Measure vocabularyP
identifierLiteral0..1An unambiguous reference to the resource within a given contextThe main identifier for the resource, e.g. the URI or other unique identifier in the context of the Catalogue. It MUST be automatically assigned by the system when adding the resource to the Catalogue.DCAT-AP [A]
IPR holderAgent0..nA person or an organisation who holds the full Intellectual Property Rights (Copyright, trademark, etc.) that subsist in the resource. The IPR holder could be different from the creator that may have assigned the rights to the IPR holder (e.g., an author as a creator assigns her rights to the publisher who is the IPR holder) and the distributor that holds a specific licence (i.e. a permission) to distribute the work via a specific distributor.The IPR holder may be identical in many cases with the Publisher of the resource (see property dct:publisher). In this case, the contact data MAY be copied from the corresponding Publisher entries. There might be also cases with non-identical entities: e.g., when one or several IPR Holders assign another entity as the Publisher responsible for producing, hosting and publishing a resource; such entities may be, for instance, a data distribution agency, or a specific partner representing a project consortium. The subproperty ms:iprHolder is preferred over dct:rightsHolder in order to differentiate with other types of rights (e.g. distribution rights).P
is documented byDocument0..nLinks a language resource to a document (e.g., research paper describing its contents or its use in a project, user manual, etc.) or any other form of documentation (e.g., a URL with support information) that is related to the resourceP
keywordlangString1..nA keyword or tag describing a resourceThis property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].DCAT-AP [E]
languageLinguisticSystem1..nA language of the resource.This property is used for the language of the contents of a dataset and takes a value only for the language proper, without any reference to script, regional variant, etc. It caters for semantic interoperability with other catalogues, yet fails to capture the full language details that are required for language data. For a more detailed description, the "ms:languge" property is preferred, but the two values must be aligned.The EU Authority languages vocabulary (http://publications.europa.eu/resource/dataset/language) and, if not covered, the lexvo vocabulary for languages (http://lexvo.org/)DCAT-AP [A]
LCR subclassLCR Subclass0..1Introduces a classification of lexical/conceptual resources into types (used for descriptive reasons)META-SHARE lexical subclasses vocabularyP
licenseLicense Document1..nA legal document giving official permission to do something with the resource.This property refers to a "licence", i.e. a human readable text with legal code, with which the resource/distribution is made available. This property SHOULD refer to a concrete standard or proprietary licence, so that the data users can assess the licence conditions in human-readable format before using the data. P
linguality typeLinguality Type1..1Indicates whether the resource includes one, two or more languagesMETA-SHARE linguality type vocabularyP
linguistic informationMicrostructure Feature0..nProvides a detailed account of the linguistic information contained in the lexical/conceptual resourceLexMeta vocabulary for microstructure featuresP
LR typeLR Type1..1Specifies the type of a language resourceThis element allows for a more fine-grained categorisation of datasets into 'corpora' (collections of data files in text, audio, video and/or image modality), 'lexical/conceptual resources' (e.g., lexica, vocabularies, gazetteers, terminological lexica, etc.) and 'grammars'. Note: In the META-SHARE ontology, models and grammars are grouped as "language descriptions" and included as subclasses of datasets. Given the evolution of Machine Learning Models, this has been revisited and models are included as a separate subclass distinguished from datasets.META-SHARE language resource type vocabularyP
metalanguageLanguage0..nSpecifies the language that is used as support for the resource (e.g., English for a grammar of French described in English or for a French dictionary with English definitions)P
modality typeModality Type1..nSpecifies the media type of a language resource (the physical medium of the contents representation) or of the input/output of a language processing tool/service; each media type is described through a distinctive set of technical features; a language resource may consist of different media partsMETA-SHARE media type vocabularyP
multilinguality typeMultilinguality Type0..1Indicates whether the resource (part) is parallel, comparable or mixedMETA-SHARE linguality type vocabularyP
original source descriptionlangString0..nA description in free text of the source material that has been used for the creation of a language data resourceThis property can be used to provide further information on the source resource. For instance, provide information such as mode and timespan of collection of a dataset.P
identifierIdentifier0..nLinks a resource to an adms:Identifier class.This property MAY be used as an additional identifier for existing identifiers used for the same resource in other Catalogues (e.g. DOI, ISLRN, DataCite, Handle PIDs)DCAT-AP [A]
personal data detailslangString0..nIf the resource includes personal data, this field can be used for entering more information, e.g., whether special handling of the resource is required (e.g., anonymisation, further request for use, etc.)This property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
personal data includedPersonal Data Included1..1Specifies whether the language resource contains personal data, i.e., any information relating to an identified or identifiable natural person (data subject); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person [Article 4(1) of the General Data Protection Regulation (Regulation (EU) 2016/679)]This property MUST be filled in for all non-anonymised datasets. If a dataset is anonymised, it is presumed that it contains no personal and sensitive dataMETA-SHARE personal data included vocabularyP
pivot languageLanguage0..nThe language acting as an intermediary for translations between many languagesP
publisherAgent1..1An entity responsible for making the resource availableThis property refers to the entity that "publishes", i.e. makes available to the specific platform, the corresponding resource. The information may be identical to the property ms:iprHolder of the resource.DCAT-AP [E]
special category data detailslangString0..nIf the resource includes special category data, this field can be used for entering more information, e.g., whether special handling of the resource is required (e.g., anonymisation, further request for use, etc.)This property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
special category data includedSpecial Category Data Included1..1Specifies whether the language resource contains special category data, i.e., personal data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, and the processing of genetic data, biometric data for the purpose of uniquely identifying a natural person, data concerning health or data concerning a natural person’s sex life or sexual orientation [Article 9(1) of the General Data Protection Regulation (Regulation (EU) 2016/679)]This property MUST be filled in for all non-anonymised datasets. If a dataset is anonymised, it is presumed that it contains no personal and sensitive dataMETA-SHARE Special Category Data Included vocabularyP
source languageLanguage0..nThe language from which a translation is made.P
spatialLocation0..nSpatial characteristics of the resourceA geographic region that is covered by the dataset. For language data, this refers to the geographic region where the language of the dataset is spoken/written and not where the dataset was implemented; for instance, in the case of a German institution creating a dataset of Cypriot Greek, the geographic region is "Cyprus".The EU Vocabularies Continents Named Authority List for continents (http://publications.europa.eu/resource/dataset/continent), countries (http://publications.europa.eu/resource/dataset/country), places (http://publications.europa.eu/resource/dataset/place) and, if not covered, GeoNames (https://www.geonames.org/)DCAT-AP [A]
target languageLanguage0..nThe language into which a translation is madeP
temporalPeriodOfTime0..nTemporal characteristics of the resourceA temporal period that the contents of a dataset cover. For language data, this can be the time period in which the language of a dataset is spoken.DCAT-AP [A]
titlelangString1..nA name given to the resourceThis property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].DCAT-AP [E]
versionstring1..1The version indicator (name or identifier) of a resourceThis property MUST contain a version number or other version designation of the dataset. It is recommended to follow W3C Data on the Web Best Practices [DWBP]. Version identifiers should enable comparison of versions and distinguishing major from minor versions, such as Semantic Versioning [SEMVER].DCAT-AP [E]

Licence

Definition
A legal document giving official permission to do something with a resource
Reference
DCAT-AP [A]
Properties
For this entity the following properties are defined: alternative, description, legal code, identifier, see also, title
PropertyRangeCardinalityDefinitionUsageControlled vocabularyReference
alternative langString 0..n An alternative name for the resource. This property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]]. P
description langString 1..n An account of the resource This property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]]. P
legal code Resource 1..1 The URL of the legal text of a License. P
identifier Identifier 0..n Links a resource to an adms:Identifier class. When a standard licence is used, this MUST be the identifier from the SPDX list of licences. P
see also Resource 0..n This property MUST be used for additional URLs that contain the licence text besides the official one. It can be used for identifying duplicates P
title langString 1..n A name given to the resource This property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]]. P

Model

Definition
The model artifact that is created through a training process involving an ML algorithm (that is, the learning algorithm) and the training data to learn from.
Reference
No reference
Subclass of
Catalogued Resource
Properties
For this entity the following properties are defined: alternative, anonymization details, anonymized, conforms to, context length, creation details, data protection principle applied, description, detailed language, distribution, domain, evaluation dataset, evaluation results, finetune dataset, has policy, has technical organisational measure, bias, limitations, trainedOn, identifier, IPR holder, is documented by, keyword, language, license, linguality type, LR type, modality type, model function, model type, other identifier, parameter precision, personal data details, personal data included, publisher, quantization process, quantized, special category data details, special category data included, source language, spatial, target language, temporal, title, variant of, version
PropertyRangeCardinalityDefinitionUsageControlled vocabularyReference
alternativelangString0..nAn alternative name for the resource.It is recommended to use "dct:title" for the full name of a dataset and "dct:alternative" for the short name. This property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
anonymization detailslangString0..nIf the resource has been anonymised, this field can be used for entering more information, e.g., tool or method used for the anonymisation, by whom it has been performed, whether there was any check of the results, etc.This property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
anonymizedAnonymized1..1Indicates whether the language resource has been anonymised; anonymous data is information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiableMETA-SHARE anonymised vocabularyP
http://data.europa.eu/it6/biaslangString0..nA description of the possible biases affecting the overall output of the modelThis property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].MLDCAT-AP [A]
conforms toStandard0..nAn established standard to which the described resource conformsMETA-SHARE vocabulary of standards and best practicesP
context lengthinteger0..1The maximum amount of text that an AI model can process and retain in memory at any given timeP
creation detailslangString0..nProvides additional information on the creation of a language resourceThis property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
data protection principle appliedData Protection Principle0..nSpecifies the data protection principles that have been applied in compliance with the General Data Protection Regulation (Regulation (EU) 2016/679)META-SHARE Data Protection Principle vocabularyP
descriptionlangString1..nAn account of the resourceThis property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].MLDCAT-AP [E]
distributionDistribution1..nAn available distribution of the datasetP
languageLanguage1..nSpecifies the language that is used in the resource or supported by the tool/service expressed according to the BCP47 recommendationThis property takes a value in compliance with the BCP47 recommendation and, thus, allows for the detailed description of the language (e.g., British English, Brazilian Portuguese, Greek written with the Latin script, etc.). It must be conformant with the value of "dct:language".P
domainDomain0..nIdentifies the domain according to which an entity is classifiedMETA-SHARE vocabulary of domainsP
evaluation datasetCorpus0..nThe dataset used for the evaluation of the machine learning modelP
evaluation resultslangString0..nA description of the evaluation results against the evaluation datasetThis property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
finetune datasetCorpus0..nThe dataset used to fine-tune the machine learning modelP
has policyPolicy1..nIdentifies an ODRL Policy for which the identified Asset is the target Asset to all the RulesThis property is used for "policies", i.e. the machine-readable representation of the licensing terms under which a dataset/distribution is made available. The representation MUST be expressed in the ODRL vocabulary.MLDCAT-AP [E]
has technical organisational measureTechnical Organisational Measure0..nIndicates use or applicability of Technical or Organisational measureMETA-SHARE Technical and Organisational Measure vocabularyP
identifierLiteral0..1An unambiguous reference to the resource within a given contextThe main identifier for the resource, e.g. the URI or other unique identifier in the context of the Catalogue. It MUST be automatically assigned by the system when adding the resource to the Catalogue.MLDCAT-AP [A]
IPR holderAgent0..nA person or an organisation who holds the full Intellectual Property Rights (Copyright, trademark, etc.) that subsist in the resource. The IPR holder could be different from the creator that may have assigned the rights to the IPR holder (e.g., an author as a creator assigns her rights to the publisher who is the IPR holder) and the distributor that holds a specific licence (i.e. a permission) to distribute the work via a specific distributor.The IPR holder may be identical in many cases with the Publisher of the resource (see property dct:publisher). In this case, the contact data MAY be copied from the corresponding Publisher entries. There might be also cases with non-identical entities: e.g., when one or several IPR Holders assign another entity as the Publisher responsible for producing, hosting and publishing a resource; such entities may be, for instance, a data distribution agency, or a specific partner representing a project consortium. The subproperty ms:iprHolder is preferred over dct:rightsHolder in order to differentiate with other types of rights (e.g. distribution rights).P
is documented byDocument0..nLinks a language resource to a document (e.g., research paper describing its contents or its use in a project, user manual, etc.) or any other form of documentation (e.g., a URL with support information) that is related to the resourceP
keywordlangString1..nA keyword or tag describing a resourceThis property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].MLDCAT-AP [E]
languageLinguisticSystem1..nA language of the resource.This property is used for the input language supported by a model and takes a value only for the language proper, without any reference to script, regional variant, etc. It caters for semantic interoperability with other catalogues, yet fails to capture the full language details that are required for language data. For a more detailed description, the "ms:languge" property is preferred, but the two values must be aligned.The EU Authority languages vocabulary (http://publications.europa.eu/resource/dataset/language) and, if not covered, the lexvo vocabulary for languages (http://lexvo.org/)P
licenseLicenseDocument1..nA legal document giving official permission to do something with the resourceThis property refers to a "licence", i.e. a human readable text with legal code, with which the resource/distribution is made available. This property SHOULD refer to a concrete standard or proprietary licence, so that the data users can assess the licence conditions in human-readable format before using the data.MLDCAT-AP [E]
limitationslangString0..nThe limited capabilities of the modelThis property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].MLDCAT-AP [A]
linguality typeLinguality Type1..1Indicates whether the resource includes one, two or more languagesMETA-SHARE linguality type vocabularyP
LR typeLR Type1..1Specifies the type of a language resourceThis element allows for a more fine-grained categorisation of datasets into 'corpora' (collections of data files in text, audio, video and/or image modality), 'lexical/conceptual resources' (e.g., lexica, vocabularies, gazetteers, terminological lexica, etc.) and 'grammars'. Note: In the META-SHARE ontology, models and grammars are grouped as "language descriptions" and included as subclasses of datasets. Given the evolution of Machine Learning Models, this is revisited and models are included as a separate subclass distinguished from datasets.META-SHARE language resource type vocabularyP
modality typeModality Type1..nSpecifies the media type of a language resource (the physical medium of the contents representation) or of the input/output of a language processing tool/service; each media type is described through a distinctive set of technical features; a language resource may consist of different media partsMETA-SHARE media type vocabularyP
model functionModel Function1..nThe function/task/operation a model performsOMTD-SHARE vocabulary of operationsP
model typeModel Type0..1A classification of models based on their algorithmMETA-SHARE vocabulary of model typesP
other identifierIdentifier0..nLinks a resource to an adms:Identifier class.This property MAY be used as an additional identifier for existing identifiers used for the same resource in other Catalogues (e.g. DOI, ISLRN, DataCite, Handle PIDs)MLDCAT-AP [A]
parameter precisionstring0..1The number of bits used to represent a model's parameters, which directly impacts the model's memory usage and computational efficiencyP
personal data detailslangString0..nIf the resource includes personal data, this field can be used for entering more information, e.g., whether special handling of the resource is required (e.g., anonymisation, further request for use, etc.)This property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
personal data includedPersonal Data Included1..1Specifies whether the language resource contains personal data, i.e., any information relating to an identified or identifiable natural person (data subject); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person [Article 4(1) of the General Data Protection Regulation (Regulation (EU) 2016/679)]This property MUST be filled in for all non-anonymised datasets. If a dataset is anonymised, it is presumed that it contains no personal and sensitive dataMETA-SHARE personal data included vocabularyP
publisherAgent1..1An entity responsible for making the resource availableThis property refers to the entity that "publishes", i.e. makes available to the specific platform, the corresponding resource. The information may be identical to the property ms:iprHolder of the resource.P
quantization processlangString0..nThe process used for the quantisationThis property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
quantizedboolean0..1Whether this version is quantisedP
special category data detailslangString0..nIf the resource includes special category data, this field can be used for entering more information, e.g., whether special handling of the resource is required (e.g., anonymisation, further request for use, etc.)This property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
special category data includedSpecial Category Data Included1..1Specifies whether the language resource contains special category data, i.e., personal data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, and the processing of genetic data, biometric data for the purpose of uniquely identifying a natural person, data concerning health or data concerning a natural person’s sex life or sexual orientation [Article 9(1) of the General Data Protection Regulation (Regulation (EU) 2016/679)]This property MUST be filled in for all non-anonymised datasets. If a dataset is anonymised, it is presumed that it contains no personal and sensitive dataMETA-SHARE Special Category Data Included vocabularyP
source languageLanguage0..nThe language from which a translation is made.P
spatialLocation0..nSpatial characteristics of the resourceA geographic region that is covered by the dataset. For language data, this refers to the geographic region where the language of the dataset is spoken/written and not where the dataset was implemented; for instance, in the case of a German institution creating a dataset of Cypriot Greek, the geographic region is "Cyprus".The EU Vocabularies Continents Named Authority List for continents (http://publications.europa.eu/resource/dataset/continent), countries (http://publications.europa.eu/resource/dataset/country), places (http://publications.europa.eu/resource/dataset/place) and, if not covered, GeoNames (https://www.geonames.org/)P
target languageLanguage0..nThe language into which a translation is madeP
temporalPeriodOfTime0..nTemporal characteristics of the resourceA temporal period that the contents of a dataset cover. For language data, this can be the time period in which the language of a dataset is spoken.P
titlelangString1..nA name given to the resourceThis property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
trained onCorpus0..nThe training dataset used to train the machine learning modelMLDCAT-AP [A]
variant ofModel0..nThe model on which this model has been based on or derived from or is a variation ofP
versionstring1..1The version indicator (name or identifier) of a resourceThis property MUST contain a version number or other version designation of the dataset. It is recommended to follow W3C Data on the Web Best Practices [DWBP]. Version identifiers should enable comparison of versions and distinguishing major from minor versions, such as Semantic Versioning [SEMVER].MLDCAT-AP [E]

Organisation

Definition
An organization.
Reference
No reference
Subclass of
Agent
Properties
For this entity the following properties are defined: identifier, name
PropertyRangeCardinalityDefinitionUsageControlled vocabularyReference
identifierIdentifier0..nLinks a resource to an adms:Identifier class.Recommended to use ROR for research institutionsP
namelangString1..nA name for some thing.This property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P

Person

Definition
A person.
Reference
No reference
Subclass of
Agent
Properties
For this entity the following properties are defined: identifier, name
PropertyRangeCardinalityDefinitionUsageControlled vocabularyReference
identifierIdentifier0..nLinks a resource to an adms:Identifier class.Recommended to use ORCID for researchersP
namelangString1..nA name for some thing.This property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P

Policy

Definition
A non-empty group of Permissions and/or Prohibitions.
Reference
DCAT-AP [E]
Properties
For this entity the following properties are defined: alternative, description, had primary source, legal code, identifier, title, was attributed to
PropertyRangeCardinalityDefinitionUsageControlled vocabularyReference
alternativelangString0..nAn alternative name for the resourceThis property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
descriptionlangString1..nAn account of the resourceThis property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
had primary sourceEntity0..1A primary source for a topic refers to something produced by some agent with direct experience and knowledge about the topic, at the time of the topic's study, without benefit from hindsight. Because of the directness of primary sources, they 'speak for themselves' in ways that cannot be captured through the filter of secondary sources. As such, it is important for secondary sources to reference those primary sources from which they were derived, so that their reliability can be investigated. A primary source relation is a particular case of derivation of secondary materials from their primary sources. It is recognised that the determination of primary sources can be up to interpretation, and should be done according to conventions accepted within the application's domain.P
legal codeResource0..1The URL of the legal text of a License.P
identifierIdentifier0..nLinks a resource to an adms:Identifier class.P
titlelangString1..nA name given to the resourceThis property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
was attributed toAgent0..1Attribution is the ascribing of an entity to an agent.P

Processing resource

Definition
A set of requirements posed on the resource that is input for processing by a tool/service or that is output after the processing
Reference
No reference
Properties
For this entity the following properties are defined: detailed language, format, language, media type, modality type, processing resource type
PropertyRangeCardinalityDefinitionUsageControlled vocabularyReference
detailed languageLanguage0..nSpecifies the language that is used in the resource or supported by the tool/service expressed according to the BCP47 recommendationThis property takes a value in compliance with the BCP47 recommendation and, thus, allows for the detailed description of the language (e.g., British English, Brazilian Portuguese, Greek written with the Latin script, etc.). It must be conformant with the value of "dct:language".P
formatMedia Type Or Extent0..nThe file format, physical medium, or dimensions of the resourcedcat:mediaType SHOULD be used when the media type of the distribution is defined in IANA [IANA-MEDIA-TYPES], otherwise dcterms:format MAY be used with different values.OMTD-SHARE vocabulary of data formatsP
languageLinguistic System0..nA language of the resource.This property is used for the language of the contents of a dataset and takes a value only for the language proper, without any reference to script, regional variant, etc. It caters for semantic interoperability with other catalogues, yet fails to capture the full language details that are required for language data. For a more detailed description, the "ms:languge" property is preferred, but the two values must be aligned.The EU Authority languages vocabulary (http://publications.europa.eu/resource/dataset/language) and, if not covered, the lexvo vocabulary for languages (http://lexvo.org/)P
media typeMedia Type0..nThe media type of the distribution as defined by IANA [IANA-MEDIA-TYPES].dcat:mediaType SHOULD be used when the media type of the distribution is defined in IANA [IANA-MEDIA-TYPES], otherwise dcterms:format MAY be used with different values.IANA-MEDIA-TYPES vocabularyP
modality typeModality Type0 ..nSpecifies the media type of a language resource (the physical medium of the contents representation) or of the input/output of a language processing tool/service; each media type is described through a distinctive set of technical features; a language resource may consist of different media partsMETA-SHARE media type vocabularyP
processing resource typeProcessing Resource Type1..1The type of the resource that a tool/service takes as input or produces as outputMETA-SHARE processing resource types vocabularyP

Software Distribution

Definition
Any form with which software is distributed (e.g., web services, executable or code files, etc.)
Reference
No reference
Properties
For this entity the following properties are defined: software distribution form, access location, download location, execution location
PropertyRangeCardinalityDefinitionUsageControlled vocabularyReference
access locationany URI0..1A URL where the resource can be accessed from; it can be used for landing pages or for cases where the resource is accessible via an interface, i.e. cases where the resource itself is not provided with a direct link for downloadingP
download locationany URI0..1A URL where the language resource (mainly data but also downloadable software programmes or forms) can be downloaded fromP
execution locationany URI0..1A URL where the resource (mainly software) can be directly executedP
software distribution formSoftware Distribution Form1..1The medium, delivery channel or form (e.g., source code, API, web service, etc.) through which a software object is distributedhttps://vocabularies.ilsp.gr/showvoc/#/datasets/Software_Distribution_Form_Taxonomy/dataP

Tool/Service

Definition
A tool/service/any piece of software that performs language processing and/or any Language Technology related operation.
Reference
No reference.
Properties
For this entity the following properties are defined: alternative, description, domain, function, has policy, identifier, input content resource, IPR holder, is documented by, keyword, language dependent, license, LR type, other identifier, output resource, publisher, software distribution, title, version
PropertyRangeCardinalityDefinitionUsageControlled vocabularyReference
alternativelangString0..nAn alternative name for the resource.It is recommended to use "dct:title" for the full name of a resource and "dct:alternative" for the short name. This property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
descriptionlangString1..nAn account of the resourceThis property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].DCAT-AP [E]
domainDomain0..nIdentifies the domain according to which an entity is classifiedMETA-SHARE vocabulary of domainsP
functionOperation1..n"Specifies the operation/function/task that a software object performs"OMTD-SHARE vocabulary of operationsP
has policyPolicy1..nIdentifies an ODRL Policy for which the identified Asset is the target Asset to all the RulesThis property is used for "policies", i.e. the machine-readable representation of the licensing terms under which a dataset/distribution is made available. The representation MUST be expressed in the ODRL vocabulary.P
identifierLiteral0..1An unambiguous reference to the resource within a given contextThe main identifier for the resource, e.g. the URI or other unique identifier in the context of the Catalogue. It MUST be automatically assigned by the system when adding the resource to the Catalogue.DCAT-AP [A]
input content resourceProcessing Resource1..n"Specifies the requirements set by a tool/service for the (content) resource that it processes"P
IPR holderAgent0..nA person or an organisation who holds the full Intellectual Property Rights (Copyright, trademark, etc.) that subsist in the resource. The IPR holder could be different from the creator that may have assigned the rights to the IPR holder (e.g., an author as a creator assigns her rights to the publisher who is the IPR holder) and the distributor that holds a specific licence (i.e. a permission) to distribute the work via a specific distributor.The IPR holder may be identical in many cases with the Publisher of the resource (see property dct:publisher). In this case, the contact data MAY be copied from the corresponding Publisher entries. There might be also cases with non-identical entities: e.g., when one or several IPR Holders assign another entity as the Publisher responsible for producing, hosting and publishing a resource; such entities may be, for instance, a data distribution agency, or a specific partner representing a project consortium. The subproperty ms:iprHolder is preferred over dct:rightsHolder in order to differentiate with other types of rights (e.g. distribution rights).P
is documented byDocument0..nLinks a language resource to a document (e.g., research paper describing its contents or its use in a project, user manual, etc.) or any other form of documentation (e.g., a URL with support information) that is related to the resourceP
keywordlangString1..nA keyword or tag describing a resourceThis property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].DCAT-AP [E]
language dependentboolean1..1Indicates whether the operation of the tool or service is language dependent or notP
licenseLicenseDocument1..nA legal document giving official permission to do something with the resourceThis property refers to a "licence", i.e. a human readable text with legal code, with which the resource/distribution is made available. This property SHOULD refer to a concrete standard or proprietary licence, so that the data users can assess the licence conditions in human-readable format before using the data. P
lrTypeLR type1..1Specifies the type of a language resourceThis element allows for a more fine-grained categorisation of datasets into 'corpora' (collections of data files in text, audio, video and/or image modality), 'lexical/conceptual resources' (e.g., lexica, vocabularies, gazetteers, terminological lexica, etc.) and 'grammars'. Note: In the META-SHARE ontology, models and grammars are grouped as "language descriptions" and included as subclasses of datasets. Given the evolution of Machine Learning Models, this has been revisited and models are included as a separate subclass distinguished from datasets.META-SHARE language resource type vocabularyP
other identifierIdentifier0..nLinks a resource to an adms:Identifier class.This property MAY be used as an additional identifier for existing identifiers used for the same resource in other Catalogues (e.g. DOI, ISLRN, DataCite, Handle PIDs)DCAT-AP [A]
output resourceProcessing Resource0..n"Specifies the output results of a tool/service, i.e. the features of the processed (content) resource"P
publisherAgent1..1An entity responsible for making the resource availableThis property refers to the entity that "publishes", i.e. makes available to the specific platform, the corresponding resource. The information may be identical to the property ms:iprHolder of the resource.DCAT-AP [E]
software distributionSoftware Distribution1..nAn available distribution of the tool/serviceP
titlelangString1..nA name given to the resourceThis property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].DCAT-AP [E]
versionstring1..1The version indicator (name or identifier) of a resourceThis property MUST contain a version number or other version designation of the dataset. It is recommended to follow W3C Data on the Web Best Practices [DWBP]. Version identifiers should enable comparison of versions and distinguishing major from minor versions, such as Semantic Versioning [SEMVER].DCAT-AP [E]

Supportive entities

The supportive entities are supporting the main entities in the Application Profile. They are included in the Application Profile because they form the range of properties.

In addition to the entities described in this section, there is also a number of classes used as range for properties that take values from controlled vocabularies. No properties are defined for these classes and are, therefore, not described here; instead, they are listed in the table of controlled vocabularies included in the Section Controlled vocabularies).

Catalogue Record

Definition
A description of a Catalogued Resource's entry in the Catalogue.
Reference
DCAT-AP [A]
Properties
For this entity the following properties are defined: description, description version, modification date, primary topic.
PropertyRangeCardinalityDefinitionUsageControlled vocabularyReference
descriptionlangString0..*A free-text account of the record.This property can be repeated for parallel language versions of the description.DCAT-AP [E]
description versionLiteral0..1It refers to the version of the description. '1' for original version.MLDCAT-AP [A]
modification dateTemporal Literal1The most recent date on which the Catalogue entry was changed or modified.DCAT-AP [A]
primary topicCatalogued Resource1A link to the Dataset, Data service or Catalog described in the record.A catalogue record will refer to one entity in a catalogue. This can be either a Dataset or a Data Service. To ensure an unambigous reading of the cardinality the range is set to Catalogued Resource. However it is not the intend with this range to require the explicit use of the class Catalogued Record. As abstract class, a subclass should be used.DCAT-AP [A]

Catalogued Resource

Definition
Resource published or curated by a single agent.
Reference
DCAT-AP [A]
Usage Note
Abstract class for DCAT-AP. Therefore only subclasses should be used in a data exchange.
Properties
This specification does not impose any additional requirements to properties for this entity.

Concept

Definition
An idea or notion; a unit of thought.
Reference
DCAT-AP [A]
Properties
This specification does not impose any additional requirements to properties for this entity.

Document

Definition
A document.
Reference
No reference
Properties
For this entity the following properties are defined: citation text, identifier
PropertyRangeCardinalityDefinitionUsageControlled vocabularyReference
citation textlangString1..nThe text with which a document or language resource can be cited (typically the full citation, incl. title, authors, publisher, etc.)This property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P
identifierIdentifier0..nLinks a resource to an adms:Identifier class.It is recommended to add the DOI or a URL where the document can be accessedP

Identifier

Definition
This is based on the UN/CEFACT Identifier class. It consists of: a content string which is the identifier; an optional identifier for the identifier scheme; an optional identifier for the version of the identifier scheme; an optional identifier for the agency that manages the identifier scheme.
Reference
DCAT-AP [A]
Properties
For this entity the following properties are defined: notation, schema agency
PropertyRangeCardinalityDefinitionUsageControlled vocabularyReference
notationstring1..1A string that is an identifier in the context of the identifier scheme referenced by its datatype.It MAY be used to display the identifier as recommended in the original scheme (e.g. The CrossRef and DataCite display guidelines recommend displaying DOIs as full URL link in the form https://doi.org/10.xxxx/xxxxx/). The rdfs:Literal must be typed (e.g. ^^anyURI)P
schema agencylangString0..nThe name of the agency that issued the identifier.It MAY be used to represent the authority that defines the identifier scheme (e.g. the DOI foundation) when the authority has an IRI associated to it. This property can be repeated for parallel language versions of the property. The language tag is mandatory and is defined by BCP47 [[rfc5646]].P

Linguistic System

Definition
A system of signs, symbols, sounds, gestures, or rules used in communication, e.g. a language.
Reference
DCAT-AP [A]
Properties
This specification does not impose any additional requirements to properties for this entity.

Location

Definition
A spatial region or named place.
Reference
DCAT-AP [A]
Properties
This specification does not impose any additional requirements to properties for this entity.

Media Type

Definition
A file format or physical medium.
Reference
DCAT-AP [A]
Properties
This specification does not impose any additional requirements to properties for this entity.

Media Type Or Extent

Definition
A media type or extent.
Reference
DCAT-AP [A]
Properties
This specification does not impose any additional requirements to properties for this entity.

Period of Time

Definition
An interval of time that is named or defined by its start and end dates.
Reference
DCAT-AP [A]
Properties
For this entity the following properties are defined: end date, start date
dsd
PropertyRangeCardinalityDefinitionUsageControlled vocabularyReference
end datedate1..1The end of the periodDCAT-AP [E]
start datedate1..1The start of the periodPlease note that while both properties are recommended, one of the two must be present for each instance of the class dct:PeriodOfTime, if such an instance is present. NOTE: gYear or dateTime for LDSDCAT-AP [E]

Size

Definition
The size of the resource with regard to the SizeUnit measurement in form of a number.
Refence
No reference
Properties
For this entity the following properties are defined: amount, size unit
PropertyRangeCardinalityDefinitionUsageControlled vocabularyReference
amountfloat1..1Specifies the number of units that constitute anything that can be measured (e.g. size of a data resource or cost, etc.)P
size unitSizeUnit1..1Specifies the unit that is used when providing information on the size of the resource or of resource partsMETA-SHARE size unit vocabularyP

Datatypes

The following datatypes are used within this specification.

ClassDefinition
booleanBoolean has the value space required to support the mathematical concept of binary-valued logic: {true, false}
dateThe value space of date consists of top-open intervals of exactly one day in length on the timelines of dateTime, beginning on the beginning moment of each day (in each timezone), i.e. '00:00:00', up to but not including '24:00:00' (which is identical with '00:00:00' of the next day).
floatIt represents IEEE single-precision 32-bit floating-point numbers. It supports decimal notation, scientific notation, and special values like positive and negative infinity (INF, -INF) and not-a-number (NaN) NaN.
integerInteger is derived from decimal by fixing the value of fractionDigits to be 0 and disallowing the trailing decimal point. This results in the standard mathematical concept of the integer numbers.
langStringThe datatype of language-tagged string values.
LiteralA literal value such as a string or integer; Literals may be typed, e.g. as a date according to xsd:date. Literals that contain human-readable text have an optional language tag as defined by BCP 47[[[rfc5646]]]
nonNegativeIntegerNumber derived from integer by setting the value of minInclusive to be 0.
stringAny character string in XML.
Temporal Literalrdfs:Literal encoded using the relevant [[ISO8601]] Date and Time compliant string and typed using the appropriate XML Schema datatype (xsd:gYear, xsd:gYearMonth, xsd:date, or xsd:dateTime).

Controlled Vocabularies

Requirements for controlled vocabularies

LanguageDCAT-AP adopts the requirements set in DCAT-AP for the recommendation of controlled vocabularies. According to these, controlled vocabularies SHOULD:

The following two requirements identified for DCAT-AP are considered of less importance for LanguageDCAT-AP:

Controlled vocabularies to be used

LanguageDCAT-AP specifies a number of controlled vocabularies that MUST be used for specific properties in order to increase semantic interoperability. These vocabularies are selected from among those recommended in DCAT-AP and MLDCAT-AP, and those used in the language data/technologies and neighbouring communities. In the first case, they are controlled by an EU institution, while in the latter case, they are published and maintained by the LDS metadata working group. Requests for changes and addition to the LDS-controlled vocabularies can be made by adding a new issue at https://github.com/LanguageDCAT-AP/LanguageDCAT-AP/issues.

The following table lists the properties that take values from controlled vocabularies, as well as the classes for which these properties are used. The range of these properties is defined as a class (cf. column "Range") as well as a skos:concept belonging to the specified controlled vocabulary.

PropertyUsed for classRangeControlled vocabulary nameUsage
annotation type Corpus Annotation Type Annotation Type Taxonomy
anonymized Corpus, LCR, Model Anonymized Anonymized Taxonomy
corpus subclass Corpus Corpus Subclass Corpus Subclass Taxonomy
format Distribution Media Type Or Extent Data Format Taxonomy
data protection principle applied Corpus, LCR, Model Data Protection Principle Data Protection Principle Taxonomy
domain Corpus, LCR, Model Domain Domain Taxonomy
LR type Corpus, LCR, Model LR Type Language Resource Type Taxonomy
lingualityType Corpus, LCR, Model Linguality Type Linguality Type Taxonomy
modality type Corpus, LCR, Model Modality type Media Type Taxonomy
linguistic information LCR Microstructure Feature Microstructure Feature Taxonomy
model type Model Model Type Model Type Taxonomy
multilinguality type Corpus, LCR Multilinguality Type Multilinguality Type Taxonomy
model function Model Operation Operation Taxonomy
package format Distribution Media type Package Format Taxonomy
personal data included Corpus, LCR, Model Personal Data Included Personal Data Included Taxonomy
processing resource type Processing resource Processing Resource Type Processing Resource Types Taxonomy
region Language Region Region Taxonomy
size unit Size Size Unit Size Unit Taxonomy
script Language Script Script Taxonomy
software distribution form Software Distribution Software Distribution Form Software Distribution Forms Taxonomy
special category data included Corpus, LCR, Model Special Category Data Included Special Category Data Included Taxonomy
conforms to Corpus, LCR, Model Standard Standards Best practices Taxonomy
LCR subclass LCR LCR Subclass Taxonomy of Subclasses of Lexical/Conceptual Resources
has technical organisational measure Corpus, LCR, Model Technical Organisational Measure Technical and Organisational Measure Taxonomy
variant Language Variant Variant Taxonomy
media type Distribution Media Type IANA-MEDIA-TYPES vocabulary
spatial Corpus, LCR, Model Location The EU Vocabularies Continents Named Authority List for continents, countries, places and, if not covered, GeoNames
language Corpus, LCR, Model Linguistic System The EU Authority languages vocabulary and, if not covered, the lexvo vocabulary for languages

Support for implementation

The following section provides support for implementing the LanguageDCAT-AP.

JSON-LD context file

One common technical question is the format in which the data is being exchanged. For conformance with the LanguageDCAT-AP, it is not mandatory that this happens in an RDF serialisation, but the exchanged format SHOULD be unambiguously transformable into RDF. For the format JSON, a popular format to exchange data between systems, following the same approach used for DCAT-AP and MLDCAT-AP, a JSON-LD context file is provided. JSON-LD is a W3C Recommendation JSON-LD 1.1 that provided a standard approach to interpret JSON structures as RDF. The provided JSON-LD context file can be used by implementers. This JSON-LD context is not normative, i.e. other JSON-LD contexts are allowed.

The JSON-LD context file can be downloaded here.

Validation

To verify if the data is (technically) conformant to the LanguageDCAT-AP, the exchanged data can be validated using the provided SHACL shapes. SHACL is a W3C Recommendation to express constraints on an RDF knowledge graph.

To support the check whether or not a catalogue satisfies the expressed constraints in this Application Profile, the constraints in this specification are expressed using SHACL [[shacl]]. Each constraint in this specification that could be converted into a SHACL expression has been included. However, it should be noted that the SHACL shapes have been implemented and tested for the Language Data Space purposes, and may include constraints specific to the Data Spaces requirements (cf. Section Context). As such this collection of SHACL expressions can be used to build a validation check for data, yet with caution.

It is up to the implementers to define the validation they expect. Each implementation happens within a context, and that context is beyond the SHACL expressions here.

The shapes can be found here.

Profile in Turtle format

All classes, properties and individuals used in the LanguageDCAT-AP profile are available in a file in Turtle format here.

Examples

Acknowledgements

The LanguageDCAT-AP profile builds upon work carried out in the framework of various initiatives, the most notable of which is the Linked Data for Language Technology(LD4LT) W3C Community Group, and infrastructural projects (META-SHARE, CLARIN, CLARIN-EL, OpenMinTeD, European Language Grid), and has been consolidated in the Language Data Space. We would like to acknowledge all persons that have contributed to this work:

Victoria Arranz, Sophie Aubin, Richard Eckart de Castilho, Khalid Choukri, Philipp Cimiano, Miltos Deligiannis, Elina Desypri, Victor Rodriguez Doncel, Richard Eckart de Castilho, Gil Francopoulo, Francesca Frontini, Dimitris Galanis, Maria Gavriilidou, Maria Giagkou, Katerina Gkirtzou, Jorge Gracia, the LD4LT Community Group contributors, the META-SHARE metadata working group, Christianne Klaes, Petr Knoth, David Lindemann, Valerie Mapelli, John P. McCrae, Monica Monacchini, Claire Nedellec, Stelios Piperidis, Claudia Soria, Kossay Talmoudi, Marta Villegas, Leon Voukoutis.

We also gratefully acknowledge the guidance and feedback provided by the SEMIC group, and especially Pavlina Fragkou, Anastasia Sofou, Emidio Stani, and Ine Weyts, in our most recent endeavours.