{
  "access": {
    "embargo": {
      "active": false,
      "reason": null
    },
    "files": "public",
    "record": "public",
    "status": "open"
  },
  "created": "2019-01-08T10:22:32.409500+00:00",
  "custom_fields": {},
  "deletion_status": {
    "is_deleted": false,
    "status": "P"
  },
  "files": {
    "count": 2,
    "enabled": true,
    "entries": {
      "JOCHRE_2015.zip": {
        "checksum": "md5:670f6b06177a947140dac65acb1ad123",
        "ext": "zip",
        "id": "9aea27ba-bdae-495b-b62f-d475d6332f09",
        "key": "JOCHRE_2015.zip",
        "metadata": null,
        "mimetype": "application/zip",
        "size": 5040156
      },
      "TESS_2015.traineddata": {
        "checksum": "md5:2496c68c163afe1a8e473d81ce0d802d",
        "ext": "bin",
        "id": "7ae3b72f-5606-442f-a800-10a3609605c0",
        "key": "TESS_2015.traineddata",
        "metadata": null,
        "mimetype": "application/octet-stream",
        "size": 3182964
      }
    },
    "order": [],
    "total_bytes": 8223120
  },
  "id": "2533877",
  "is_draft": false,
  "is_published": true,
  "links": {
    "access": "https://zenodo.org/api/records/2533877/access",
    "access_grants": "https://zenodo.org/api/records/2533877/access/grants",
    "access_links": "https://zenodo.org/api/records/2533877/access/links",
    "access_request": "https://zenodo.org/api/records/2533877/access/request",
    "access_users": "https://zenodo.org/api/records/2533877/access/users",
    "archive": "https://zenodo.org/api/records/2533877/files-archive",
    "archive_media": "https://zenodo.org/api/records/2533877/media-files-archive",
    "communities": "https://zenodo.org/api/records/2533877/communities",
    "communities-suggestions": "https://zenodo.org/api/records/2533877/communities-suggestions",
    "doi": "https://doi.org/10.5281/zenodo.2533877",
    "draft": "https://zenodo.org/api/records/2533877/draft",
    "files": "https://zenodo.org/api/records/2533877/files",
    "latest": "https://zenodo.org/api/records/2533877/versions/latest",
    "latest_html": "https://zenodo.org/records/2533877/latest",
    "media_files": "https://zenodo.org/api/records/2533877/media-files",
    "parent": "https://zenodo.org/api/records/2533876",
    "parent_doi": "https://zenodo.org/doi/10.5281/zenodo.2533876",
    "parent_html": "https://zenodo.org/records/2533876",
    "requests": "https://zenodo.org/api/records/2533877/requests",
    "reserve_doi": "https://zenodo.org/api/records/2533877/draft/pids/doi",
    "self": "https://zenodo.org/api/records/2533877",
    "self_doi": "https://zenodo.org/doi/10.5281/zenodo.2533877",
    "self_html": "https://zenodo.org/records/2533877",
    "self_iiif_manifest": "https://zenodo.org/api/iiif/record:2533877/manifest",
    "self_iiif_sequence": "https://zenodo.org/api/iiif/record:2533877/sequence/default",
    "versions": "https://zenodo.org/api/records/2533877/versions"
  },
  "media_files": {
    "count": 0,
    "enabled": false,
    "entries": {},
    "order": [],
    "total_bytes": 0
  },
  "metadata": {
    "creators": [
      {
        "affiliations": [
          {
            "name": "Universit\u00e9 de Poitiers"
          }
        ],
        "person_or_org": {
          "family_name": "Marianne Vergez-Couret",
          "name": "Marianne Vergez-Couret",
          "type": "personal"
        }
      }
    ],
    "description": "<p>This dataset provides trained Tesseract (<a href=\"https://github.com/tesseract-ocr/tesseract\">https://github.com/tesseract-ocr/tesseract</a>) and Jochre (<a href=\"https://github.com/urieli/jochre\">https://github.com/urieli/jochre</a>) OCR models for Occitan ( for the standard spelling and two dialects, Gascon and Lengadocian). These models were developed in the context of the RESTAURE project, funded by the French ANR.&nbsp;</p>\n\n<p>Two models are provided. They were presented in the following article <a href=\"http://hal.archives-ouvertes.fr/hal-01252241\">https://hal.archives-ouvertes.fr/hal-01252241</a> and also re-evaluated for the creation of another corpus in <a href=\"https://www.openscience.fr/Constitution-et-annotation-d-un-corpus-ecrit-de-contes-et-recits-en-occitan\">https://www.openscience.fr/Constitution-et-annotation-d-un-corpus-ecrit-de-contes-et-recits-en-occitan</a>.</p>\n\n<p>The first model for Jochre, JOCHRE_2015, has been trained for Jochre 1.1.2b. The training images and corresponding texts were manually annotated using a Jochre online platform (excerpts from 7 different printed works, totalling about 20,000 words)</p>\n\n<p>The second model for Tesseract, TESS_2015, was trained using the jTessBoxEditor tool (<a href=\"http://vietocr.sourceforge.net/training.html\">http://vietocr.sourceforge.net/training.html</a>), Version 1.4 (2 May 2015), based on images automatically generated from the training texts (the one used for Jochre). The generation of the images used a 36pt font size, and two fonts were used (Arial and Times New Roman), with their normal and italic variants. The Tesseract model can be used with Tesseract 3.0x.</p>\n\n<p>List of words was also used for those two trainings. We conflated Occitan words found in several lexicons, dictionaries and corpora for the two dialects, Gascon and Lengadocian:</p>\n\n<ul>\n\t<li>Lexicon extracted from 60 literary works (from 29 different authors) gathered in the BaTel&Ograve;c project.</li>\n\t<li>Dictonary entries from <em>Dictionnaire Fran&ccedil;ais/Occitan Gascon Toulousain</em> de Nicolau Rei B&egrave;thv&eacute;der, 2004, IEO Edicions</li>\n\t<li>Dictonary entries from <em>Dictionnaire Fran&ccedil;ais/Occitan</em> de Cristian Laus, 2004, IEO/IDECO</li>\n\t<li>Dictonary entries from <em>Dictionnaire Fran&ccedil;ais/Occitan (Gascon)</em> de Miqu&egrave;u Grosclaude, Gilab&egrave;rt Nari&ograve;o e Patric Guilhemjoan, 2007, Per Noste Edicions</li>\n\t<li>Conjugated forms from Verb&rsquo;&Ograve;c (designed by the <em>Congr&egrave;s permanent de la lenga occitana</em> (<a href=\"http://www.locongres.org\">http://www.locongres.org</a>))</li>\n\t<li>List of proper nouns extracted from the Apertium (free/open-source machine translation platform) Occitan lexicon.</li>\n</ul>\n\n<p>The jochre model can be used with the Jochre software (<a href=\"https://github.com/urieli/jochre\">https://github.com/urieli/jochre</a>). See also Jochre wiki (https://github.com/urieli/jochre/wiki).</p>\n\n<p>The Tesseract models can be used&nbsp; for instance using the gImageReader tool (<a href=\"https://github.com/manisandro/gImageReader\">https://github.com/manisandro/gImageReader</a>), which provides a graphical user interface for the Tesseract tool.&nbsp;</p>\n\n<p>When evaluated against the same test corpus (four extracts from four different authors from two dialects, Gascon and Lengadocian), the Jochre model achieves better performance levels.</p>\n\n<p>&nbsp;</p>",
    "publication_date": "2019-01-08",
    "publisher": "Zenodo",
    "resource_type": {
      "id": "other",
      "title": {
        "de": "Sonstige",
        "en": "Other"
      }
    },
    "rights": [
      {
        "description": {
          "en": "The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited."
        },
        "icon": "cc-by-icon",
        "id": "cc-by-4.0",
        "props": {
          "scheme": "spdx",
          "url": "https://creativecommons.org/licenses/by/4.0/legalcode"
        },
        "title": {
          "en": "Creative Commons Attribution 4.0 International"
        }
      }
    ],
    "subjects": [
      {
        "subject": "OCR module, Tesseract, Jochre, Occitan, OCR"
      }
    ],
    "title": "OCR models for Occitan (standard spelling)",
    "version": "1"
  },
  "parent": {
    "access": {
      "owned_by": {
        "user": "53563"
      }
    },
    "communities": {
      "default": "b9ed9d37-de6c-4bad-870d-502b029bfc9a",
      "entries": [
        {
          "access": {
            "member_policy": "open",
            "members_visibility": "public",
            "record_policy": "open",
            "review_policy": "open",
            "visibility": "public"
          },
          "children": {
            "allow": false
          },
          "created": "2018-02-09T14:01:43.797239+00:00",
          "custom_fields": {},
          "deletion_status": {
            "is_deleted": false,
            "status": "P"
          },
          "id": "b9ed9d37-de6c-4bad-870d-502b029bfc9a",
          "links": {},
          "metadata": {
            "curation_policy": "<p>The collection includes research output fully or partially funded by the RESTAURE project.</p>\r\n",
            "page": "<p>Resources and tools produced in the RESTAURE projet.</p>\r\n\r\n<p>Goals: The overall objective of the RESTAURE project is to provide computational resources and processing tools for three regional languages of France: Alsatian, Occitan and Picard.</p>\r\n\r\n<p>Full project title: RESsources informatis&eacute;es et Traitement AUtomatique pour les langues REgionales / Computational Resources and Processing for Regional Languages</p>\r\n\r\n<p>Funding: Project funded by the ANR, convention ANR-14-CE24-0003</p>\r\n\r\n<p>Project start: Janurary 1st, 2015</p>\r\n\r\n<p>Duration: 42 months</p>\r\n\r\n<p>&nbsp;</p>",
            "title": "RESTAURE project"
          },
          "revision_id": 0,
          "slug": "restaure",
          "updated": "2018-02-09T14:01:44.165187+00:00"
        }
      ],
      "ids": [
        "b9ed9d37-de6c-4bad-870d-502b029bfc9a"
      ]
    },
    "id": "2533876",
    "pids": {
      "doi": {
        "client": "datacite",
        "identifier": "10.5281/zenodo.2533876",
        "provider": "datacite"
      }
    }
  },
  "pids": {
    "doi": {
      "client": "datacite",
      "identifier": "10.5281/zenodo.2533877",
      "provider": "datacite"
    },
    "oai": {
      "identifier": "oai:zenodo.org:2533877",
      "provider": "oai"
    }
  },
  "revision_id": 2,
  "stats": {
    "all_versions": {
      "data_volume": 150667704.0,
      "downloads": 38,
      "unique_downloads": 30,
      "unique_views": 287,
      "views": 303
    },
    "this_version": {
      "data_volume": 150667704.0,
      "downloads": 38,
      "unique_downloads": 30,
      "unique_views": 285,
      "views": 301
    }
  },
  "status": "published",
  "updated": "2019-01-08T10:49:17.148181+00:00",
  "versions": {
    "index": 1,
    "is_latest": true
  }
}