Dataset Open Access

Unipen data set of on-line (vectorial) handwriting - train_r01_v07

Consortium

Other(s)
Several institutions and companies

/*****************************************************************************\
*                                                                             *
*                                                                             *
*   This is the first UNIPEN distribution of the iUF                          *
*                                                                             *
*   This distribution comprises NIST train_r01_v07                            *
*                                                                             *
*  http://www.unipen.org/                                                     *
*                                                                             *
*  Source code: C/Linux at                                                    *
*  http://www.sourcefiles.org/Scientific/Other_Sciences/uptools3.tar.gz       *
*                                                                             *
*                                                                             *
*           The International Unipen Foundation, December 1999                *
*                                                                             *
*                                                                             *
*******************************************************************************
*                                                                             *
*                                                                             *
*  DISCLAIMER AND COPYRIGHT NOTICE FOR ALL DATA CONTAINED ON THIS CDROM:      *
*                                                                             *
*                                                                             *
*  1) PERMISSION IS HEREBY GRANTED TO USE THE DATA FOR RESEARCH               *
*     PURPOSES. IT IS NOT ALLOWED TO DISTRIBUTE THIS DATA FOR COMMERCIAL      *
*     PURPOSES.                                                               *
*                                                                             *
*     Copyright 1999, International Unipen Foundation - All rights reserved   *
*                                                                             *
*  2) PROVIDER GIVES NO EXPRESS OR IMPLIED WARRANTY OF ANY KIND AND ANY       *
*     IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR PURPOSE ARE       *
*     DISCLAIMED.                                                             *
*                                                                             *
*  3) PROVIDER SHALL NOT BE LIABLE FOR ANY DIRECT, INDIRECT, SPECIAL,         *
*     INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF ANY USE OF THIS      *
*     DATA.                                                                   *
*                                                                             *
*  4) THE CONDITIONS OF USE REQUIRE PROPER REFERENCE TO THIS DATABASE         *
*     AS DESCRIBED IN ACCOMPANYING DOCUMENT 'unipen-conditions-of-use.html'   *
*                                                                             *
\*****************************************************************************/

Contents of the CDROM:
----------------------

1) This file, called CDROM-README
2) The nist distribution, of which part of the directory tree is listed here.

train_r01_v07
    include
        abm  apb  app  atu  bbd  ced  gmd  ibm  kai  lou  pap  pri  sta  uqb
        aga  apc  art  bba  cea  cee  hpb  imp  kar  mot  par  rim  syn  val
        anj  apd  ata  bbb  ceb  cef  hpp  imt  lav  nic  pcl  scr  tos
        apa  ape  att  bbc  cec  dar  huj  int  lex  not  phi  sie  ugi

    data
        1a
            aga  apb  art  ceb  gmd  imp  pri  tos  val
            apa  app  cea  ced  ibm  lou  syn  uqb
        1b  1c  1d  2   3   4   5   6   7   8

All files on the the CDROM were tested on UNIPEN integrity using uplib.
The description of the contents is given below:


Description of the contents:
----------------------------

For a description and examples of the UNIPEN format, see http://www.unipen.org/

The UNIPEN files contained in this release are organized in 10 categories, listed
below. The number of .SEGMENTS and number of files for each category are given:

 cat   nsegm  nfiles
  1a  15953     634  isolated digits
  1b  28069    1423  isolated upper case
  1c  61351    2145  isolated lower case
  1d  17286    1222  isolated symbols (punctuations etc.)
  2  122628    2735  isolated characters, mixed case
  3   67352    1949  isolated characters in the context of words or texts
  4      0        0  isolated printed words, not mixed with digits and symbols
  5      0        0  isolated printed words, full character set
  6  75529     3298  isolated cursive or mixed-style words (without digits and symbols)
  7  85213     3393  isolated words, any style, full character set
  8  14544     4563  text: (minimally two words of) free text, full character set

In each directory representing a category, e.g., data/1a, a number of
sub-directories are contained. The name of a subdirectory is a
three-letter word identifying the contributor of the data.

Consider for example the UNIPEN files contributed by 'aga' of category
1a (isolated digits). The files containing .SEGMENT entries are contained
in the 'data' directory:
    data/1a/aga

Most files in this distribution contain one or more .INCLUDE statements.
The corresponding files are found in the 'include' directory, in this case:
    include/aga
Some files (such as the 'imp' contributions) use nested .INCLUDE statements.
The software contained in the uptools3 distribution contains code to find
files to be included based on an environment variable.


Distribution of categories per contributor:
-------------------------------------------

         1a    |    1b    |    1c    |    1d    |     2     |    3      |     6      |     7      |     8
--------------------------------------------------------------------------------------------------------------
abm |          |          |          |          |           |           |   628    4 |   646    4 |    7   3 |
aga |  405  14 | 1115  14 | 1063  14 |  221  14 |  2804  14 |           |            |            |  605  14 |
anj |          |          |          |          |           |           |  1435    6 |  1435    6 |          |
apa |  692  74 | 2236 247 | 7414 391 | 1953 268 | 12295 527 | 12295 527 |            |            |  527 527 |
apb | 2033 138 | 3450 466 | 8869 434 |  946 233 | 15298 590 | 15298 590 |            |            |  590 590 |
apc |          |          |          |          |           |           |  1724  441 |  1798  444 |  444 444 |
apd |          |          |          |          |           |           |  1958  453 |  2448  507 |  507 507 |
ape |          |          |          |          |           |           |  1384  286 |  1848  322 |  322 322 |
app | 1046 115 | 3010 353 |10370 556 | 2886 400 | 17312 745 | 17312 745 |            |            |  745 745 |
art |  170   6 | 1042   6 | 2301   6 |  202   6 |  3715   6 |  3715   6 |   687    6 |   933    6 |  186   6 |
att |          |          |          |          |           |           |   932   29 |  2253   29 |  819  30 |
atu |          |          |          |          |           |           |            |            |   92  92 |
bba |          |          |          |          |           |           |            |            |   63  63 |
bbb |          |          |          |          |           |           |            |            |   51  51 |
bbc |          |          |          |          |           |           |            |            |   61  61 |
bbd |          |          |          |          |           |           |            |            |  858 858 |
cea |    7   3 |   57   6 | 1402   6 |   35   6 |  1501   6 |  1501   6 |   311    6 |   345    6 |   38   6 |
ceb |   16   2 |   30   4 |  488   4 |    8   3 |   542   4 |   542   4 |   116    4 |   129    4 |   22   4 |
cec |          |          |          |          |           |           |  4880   35 |  5625   35 |  604  35 |
ced | 1369  42 | 2691  42 | 2619  43 | 1077  43 |  7756  43 |  7756  43 |            |            | 1100  43 |
cee |          |          |          |          |           |           |  3977   29 |  3978   29 |          |
dar |          |          |          |          |           |           |   277    2 |   316    2 |   36   2 |
gmd | 1145   3 |          | 2921   3 |  832   3 |  4898   3 |           |            |            |          |
hpb |          |          |          |          |           |           |  1524    7 |  2292    7 | 1832  23 |
hpp |          |          |          |          |           |           |  8323   32 | 10820   32 | 2591  29 |
huj |          |          |          |          |           |           |   104    1 |   104    1 |          |
ibm | 1571  22 | 4264  22 | 4354  22 | 1994  22 | 12183  22 |           |  1196    9 |  1196    9 |          |
imp |  257  50 |  645  50 |  656  50 |  851  50 |  2409  50 |           |  1119   22 |  1119   22 |          |
imt |          |          |          |          |           |           |   242    1 |   242    1 |          |
int |          |          |          |          |           |           |  2012    4 |  2012    4 |          |
kai |          | 1961  28 | 8663  46 | 1585  22 | 12209  57 |  8933  28 |  1013   28 |  1663   28 |          |
kar |          |          |          |          |           |           |  1809   33 |  1860   33 |          |
lav |          |          | 1324   9 |          |  1324   9 |           |   213    5 |   213    5 |          |
lex |          |          |          |          |           |           |  5660   13 |  7235   13 | 1937  13 |
lou |    7   1 |   11   1 |   15   1 |    2   1 |    35   1 |           |  1538    7 |  1599    7 |          |
mot |          |          | 2701   8 |          |  2701   8 |           |            |            |          |
nic |          |          |          |          |           |           |  6813   66 |  6813   66 |          |
not |          |          |          |          |           |           |  1452    8 |  1452    8 |          |
pap |          |          |          |          |           |           |  2203   39 |  2213   41 |          |
par |          |          |          |          |           |           |   496    8 |   512    8 |          |
pcl |          |          |          |          |           |           |   616   21 |   616   21 |          |
phi |          |          |          |          |           |           |  2506   12 |  2506   12 |   91   4 |
pri |   78  15 |  212  15 |  191  15 |  230  15 |   711  15 |           |   106    3 |   110    3 |   49  18 |
rim |          |          |          |          |           |           |   277   21 |   277   21 |          |
scr |          |          |          |          |           |           |            |            |  211  44 |
sie |          |          |  377 377 |          |   377 377 |           |  1593 1593 |  1593 1593 |          |
sta |          |          |          |          |           |           | 15808   61 | 16415   61 |  156  29 |
syn | 4554  17 |  637   8 |  589   8 |  415   8 |  6195  17 |           |            |            |          |
tos |  543 108 | 1432 108 | 1381 108 | 1660 108 |  4985 108 |           |            |            |          |
ugi |          |          |          |          |           |           |   597    3 |   597    3 |          |
uqb |  598   4 | 1514   4 |          | 1327   4 |  3439   4 |           |            |            |          |
val | 1462  20 | 3762  49 | 3653  44 | 1062  16 |  9939 129 |           |            |            |          |
--------------------------------------------------------------------------------------------------------------
    |          |          |          |          |           |           |            |            |          |
tot |15953 634 |28069 1423|61351 2145|17286 1222|122628 2735|67352 1949 | 75529 3298 | 85213 3393 |14544 4563|
--------------------------------------------------------------------------------------------------------------
         1a    |    1b    |    1c    |    1d    |     2     |    3      |     6      |     7      |     8

UNIPEN Database Conditions of Use

UNIPEN Database Conditions of Use

The term user will refer to the person or institution who has obtained the UNIPEN data distribution.

Two major types of use can be identified:

  • I. Non-commercial use

  • II. Commercial use

I. Non-commercial use

Non-commercial use refers to university and institutional research which aims at public dissemination of research results. This type of usage of UNIPEN data is highly advocated by the International Unipen Foundation (iUF). However, there is a Publication Policy which must be taken into account (See below).

II. Commercial use

II.a Commercial use of UNIPEN data proper - the textual content and the point coordinates - is prohibited. An example would be the extraction of handwriting coordinates to sell 'script fonts'.

II.b The usage of UNIPEN data for the training of commercial handwriting recognition systems is allowed.

II.c The UNIPEN logo will be presented by the user in the final documentation of the resulting software product.

II.d Reference to individual writer identities or the identity of individual data donator companies from within the UNIPEN data distribution should be avoided at all times.

Note: Also in the case of commercial development, the user is kindly asked to present the results of the underlying research and development via an acknowledged science & technology forum (journal or conference).


Ad I. UNIPEN Publication Policy

I.1 - Reference

Users are required to mention the Unipen Release version in their publications, and are strongly urged to use the latest version available.
    Reference example: 

        "As a training set, we used UNIPEN [xx] Train-R01/V07, 
         benchmark ..., subsets ..... 
         As a test set, we used UNIPEN DevTest-R01/V02, 
         benchmark ..., subsets .... 
         To the raw UNIPEN data, the following pre-processing 
         was applied: ...."
             .
             .
             .

        [xx] Guyon, I., Schomaker, L., Plamondon, R., 
             Liberman, M. & Janet, S. (1994). 
             UNIPEN project of on-line data exchange and recognizer 
             benchmarks, Proceedings of the 12th International
             Conference on Pattern Recognition, ICPR'94, 
             pp. 29-33, Jerusalem, Israel, October 1994. IAPR-IEEE.

In this example we assume the release of the set DevTest-R01/V02, which will actually take place in the future.

In case your training set and test set are derived from within a single distribution such as Train-R01/V07, please explain in detail how your random selection of samples from within this distribution was produced. Was the process actually random? Was manual pruning involved? Improvements to the labels (truth values) can be submitted by the users in the form of .SEGMENT... entries via email to the iUF.

I.2 - Which data?

A proper distinction between training and test sets is necessary. The best possible training/test set distinction involves data randomly selected from two exclusive sets of writers for both sets, respectively.

Note that there is a problem in the use of test sets. Iterated use of a particular training / test set pair in a development process can be considered as indirect training! Even if a development set as such is not formally used for training, it is a well-known fact that all parameter adjustments, code improvements, etc., are a form of training, regardless of the type of pattern recognition algorithm which is used. Therefore, it is good practice to explain the effort spent in iterated testing in the publications. The tendency to iterate a single training/test set pair within a complete PhD project has led to inflated reported recognition rates in the past. It is good practice to generate a random selection of multiple sets at the start of such projects.

I.3 - Benchmark (eq. database subset) overview

Benchmark Description

1a

isolated digits

1b

isolated upper case

1c

isolated lower case

1d

isolated symbols (punctuations etc.)

2

isolated characters, mixed case

3

isolated characters in the context of words or texts

4

isolated printed words, not mixed with digits and symbols

5

isolated printed words, full character set

6

isolated cursive or mixed-style words (without digits and symbols)

7

isolated words, any style, full character set

8

text: (minimally two words of) free text, full character set

Note that only Benchmark #8 is a realistic, application-oriented test, because the word segmentation problem must also have been solved by the recognizer. No manual word segmentation is allowed in test Benchmark #8.


Lambert Schomaker, January 1997, October 2000.
Files (155.8 MB)
Name Size
unipen-CDROM-train_r01_v07.tgz
md5:1f9037c57b92592a79caa2c34ab82fdc
155.8 MB Download
  • Guyon, I., Schomaker, L., Plamondon, R., Liberman, M. & Janet, S. (1994). UNIPEN project of on-line data exchange and recognizer benchmarks, Proceedings of the 12th International Conference on Pattern Recognition, ICPR'94, pp. 29-33, Jerusalem, Israel, October 1994. IAPR-IEEE.

200
30
views
downloads
All versions This version
Views 200202
Downloads 3031
Data volume 4.7 GB4.8 GB
Unique views 180182
Unique downloads 3031

Share

Cite as