Unipen data set of on-line (vectorial) handwriting - train_r01_v07

Consortium

doi:10.5281/zenodo.1195803

Published December 1, 1999 | Version December 1999

Dataset Open

Unipen data set of on-line (vectorial) handwriting - train_r01_v07

Consortium¹

1. of several companies

Contributors

Other:

Several institutions and companies¹

1. abm,aga,anj,apa,apb,apc,apd,ape,app,art,att,atu,bba,bbb,bbc,bbd,cat,cea,ceb,cec,ced,cee,dar,gmd,hpb,hpp,huj,ibm,imp,imt,int,kai,kar,lav,lex,lou,mot,nic,not,pap,par,pcl,phi,pri,rim,scr,sie,sta,syn,tos,tot,ugi,uqb,val

/*****************************************************************************\
*                                                                             *
*                                                                             *
*   This is the first UNIPEN distribution of the iUF                          *
*                                                                             *
*   This distribution comprises NIST train_r01_v07                            *
*                                                                             *
* http://www.unipen.org/                                                     *
*                                                                             *
* Source code: C/Linux at                                                    *
* http://www.sourcefiles.org/Scientific/Other_Sciences/uptools3.tar.gz       *
*                                                                             *
*                                                                             *
*           The International Unipen Foundation, December 1999                *
*                                                                             *
*                                                                             *
*******************************************************************************
*                                                                             *
*                                                                             *
* DISCLAIMER AND COPYRIGHT NOTICE FOR ALL DATA CONTAINED ON THIS CDROM:      *
*                                                                             *
*                                                                             *
* 1) PERMISSION IS HEREBY GRANTED TO USE THE DATA FOR RESEARCH               *
*     PURPOSES. IT IS NOT ALLOWED TO DISTRIBUTE THIS DATA FOR COMMERCIAL      *
*     PURPOSES.                                                               *
*                                                                             *
*     Copyright 1999, International Unipen Foundation - All rights reserved   *
*                                                                             *
* 2) PROVIDER GIVES NO EXPRESS OR IMPLIED WARRANTY OF ANY KIND AND ANY       *
*     IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR PURPOSE ARE       *
*     DISCLAIMED.                                                             *
*                                                                             *
* 3) PROVIDER SHALL NOT BE LIABLE FOR ANY DIRECT, INDIRECT, SPECIAL,         *
*     INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF ANY USE OF THIS      *
*     DATA.                                                                   *
*                                                                             *
* 4) THE CONDITIONS OF USE REQUIRE PROPER REFERENCE TO THIS DATABASE         *
*     AS DESCRIBED IN ACCOMPANYING DOCUMENT 'unipen-conditions-of-use.html'   *
*                                                                             *
\*****************************************************************************/

Contents of the CDROM:
----------------------

1) This file, called CDROM-README
2) The nist distribution, of which part of the directory tree is listed here.

train_r01_v07
   include
       abm apb app atu bbd ced gmd ibm kai lou pap pri sta uqb
       aga apc art bba cea cee hpb imp kar mot par rim syn val
       anj apd ata bbb ceb cef hpp imt lav nic pcl scr tos
       apa ape att bbc cec dar huj int lex not phi sie ugi

   data
       1a
           aga apb art ceb gmd imp pri tos val
           apa app cea ced ibm lou syn uqb
       1b 1c 1d 2   3   4   5   6   7   8

All files on the the CDROM were tested on UNIPEN integrity using uplib.
The description of the contents is given below:

Description of the contents:
----------------------------

For a description and examples of the UNIPEN format, see http://www.unipen.org/

The UNIPEN files contained in this release are organized in 10 categories, listed
below. The number of .SEGMENTS and number of files for each category are given:

cat   nsegm nfiles
1a 15953     634 isolated digits
1b 28069    1423 isolated upper case
1c 61351    2145 isolated lower case
1d 17286    1222 isolated symbols (punctuations etc.)
2 122628    2735 isolated characters, mixed case
3   67352    1949 isolated characters in the context of words or texts
4      0        0 isolated printed words, not mixed with digits and symbols
5      0        0 isolated printed words, full character set
6 75529     3298 isolated cursive or mixed-style words (without digits and symbols)
7 85213     3393 isolated words, any style, full character set
8 14544     4563 text: (minimally two words of) free text, full character set

In each directory representing a category, e.g., data/1a, a number of
sub-directories are contained. The name of a subdirectory is a
three-letter word identifying the contributor of the data.

Consider for example the UNIPEN files contributed by 'aga' of category
1a (isolated digits). The files containing .SEGMENT entries are contained
in the 'data' directory:
data/1a/aga

Most files in this distribution contain one or more .INCLUDE statements.
The corresponding files are found in the 'include' directory, in this case:
include/aga
Some files (such as the 'imp' contributions) use nested .INCLUDE statements.
The software contained in the uptools3 distribution contains code to find
files to be included based on an environment variable.

Distribution of categories per contributor:
-------------------------------------------

         1a    |    1b    |    1c    |    1d    |     2     |    3      |     6      |     7      |     8
--------------------------------------------------------------------------------------------------------------
abm |          |          |          |          |           |           |   628    4 |   646    4 |    7   3 |
aga | 405 14 | 1115 14 | 1063 14 | 221 14 | 2804 14 |           |            |            | 605 14 |
anj |          |          |          |          |           |           | 1435    6 | 1435    6 |          |
apa | 692 74 | 2236 247 | 7414 391 | 1953 268 | 12295 527 | 12295 527 |            |            | 527 527 |
apb | 2033 138 | 3450 466 | 8869 434 | 946 233 | 15298 590 | 15298 590 |            |            | 590 590 |
apc |          |          |          |          |           |           | 1724 441 | 1798 444 | 444 444 |
apd |          |          |          |          |           |           | 1958 453 | 2448 507 | 507 507 |
ape |          |          |          |          |           |           | 1384 286 | 1848 322 | 322 322 |
app | 1046 115 | 3010 353 |10370 556 | 2886 400 | 17312 745 | 17312 745 |            |            | 745 745 |
art | 170   6 | 1042   6 | 2301   6 | 202   6 | 3715   6 | 3715   6 |   687    6 |   933    6 | 186   6 |
att |          |          |          |          |           |           |   932   29 | 2253   29 | 819 30 |
atu |          |          |          |          |           |           |            |            |   92 92 |
bba |          |          |          |          |           |           |            |            |   63 63 |
bbb |          |          |          |          |           |           |            |            |   51 51 |
bbc |          |          |          |          |           |           |            |            |   61 61 |
bbd |          |          |          |          |           |           |            |            | 858 858 |
cea |    7   3 |   57   6 | 1402   6 |   35   6 | 1501   6 | 1501   6 |   311    6 |   345    6 |   38   6 |
ceb |   16   2 |   30   4 | 488   4 |    8   3 |   542   4 |   542   4 |   116    4 |   129    4 |   22   4 |
cec |          |          |          |          |           |           | 4880   35 | 5625   35 | 604 35 |
ced | 1369 42 | 2691 42 | 2619 43 | 1077 43 | 7756 43 | 7756 43 |            |            | 1100 43 |
cee |          |          |          |          |           |           | 3977   29 | 3978   29 |          |
dar |          |          |          |          |           |           |   277    2 |   316    2 |   36   2 |
gmd | 1145   3 |          | 2921   3 | 832   3 | 4898   3 |           |            |            |          |
hpb |          |          |          |          |           |           | 1524    7 | 2292    7 | 1832 23 |
hpp |          |          |          |          |           |           | 8323   32 | 10820   32 | 2591 29 |
huj |          |          |          |          |           |           |   104    1 |   104    1 |          |
ibm | 1571 22 | 4264 22 | 4354 22 | 1994 22 | 12183 22 |           | 1196    9 | 1196    9 |          |
imp | 257 50 | 645 50 | 656 50 | 851 50 | 2409 50 |           | 1119   22 | 1119   22 |          |
imt |          |          |          |          |           |           |   242    1 |   242    1 |          |
int |          |          |          |          |           |           | 2012    4 | 2012    4 |          |
kai |          | 1961 28 | 8663 46 | 1585 22 | 12209 57 | 8933 28 | 1013   28 | 1663   28 |          |
kar |          |          |          |          |           |           | 1809   33 | 1860   33 |          |
lav |          |          | 1324   9 |          | 1324   9 |           |   213    5 |   213    5 |          |
lex |          |          |          |          |           |           | 5660   13 | 7235   13 | 1937 13 |
lou |    7   1 |   11   1 |   15   1 |    2   1 |    35   1 |           | 1538    7 | 1599    7 |          |
mot |          |          | 2701   8 |          | 2701   8 |           |            |            |          |
nic |          |          |          |          |           |           | 6813   66 | 6813   66 |          |
not |          |          |          |          |           |           | 1452    8 | 1452    8 |          |
pap |          |          |          |          |           |           | 2203   39 | 2213   41 |          |
par |          |          |          |          |           |           |   496    8 |   512    8 |          |
pcl |          |          |          |          |           |           |   616   21 |   616   21 |          |
phi |          |          |          |          |           |           | 2506   12 | 2506   12 |   91   4 |
pri |   78 15 | 212 15 | 191 15 | 230 15 |   711 15 |           |   106    3 |   110    3 |   49 18 |
rim |          |          |          |          |           |           |   277   21 |   277   21 |          |
scr |          |          |          |          |           |           |            |            | 211 44 |
sie |          |          | 377 377 |          |   377 377 |           | 1593 1593 | 1593 1593 |          |
sta |          |          |          |          |           |           | 15808   61 | 16415   61 | 156 29 |
syn | 4554 17 | 637   8 | 589   8 | 415   8 | 6195 17 |           |            |            |          |
tos | 543 108 | 1432 108 | 1381 108 | 1660 108 | 4985 108 |           |            |            |          |
ugi |          |          |          |          |           |           |   597    3 |   597    3 |          |
uqb | 598   4 | 1514   4 |          | 1327   4 | 3439   4 |           |            |            |          |
val | 1462 20 | 3762 49 | 3653 44 | 1062 16 | 9939 129 |           |            |            |          |
--------------------------------------------------------------------------------------------------------------
    |          |          |          |          |           |           |            |            |          |
tot |15953 634 |28069 1423|61351 2145|17286 1222|122628 2735|67352 1949 | 75529 3298 | 85213 3393 |14544 4563|
--------------------------------------------------------------------------------------------------------------
         1a    |    1b    |    1c    |    1d    |     2     |    3      |     6      |     7      |     8

Notes

UNIPEN Database Conditions of Use

The term user will refer to the person or institution who has obtained the UNIPEN data distribution.

Two major types of use can be identified:

I. Non-commercial use

II. Commercial use

I. Non-commercial use

Non-commercial use refers to university and institutional research which aims at public dissemination of research results. This type of usage of UNIPEN data is highly advocated by the International Unipen Foundation (iUF). However, there is a Publication Policy which must be taken into account (See below).

II. Commercial use

II.a Commercial use of UNIPEN data proper - the textual content and the point coordinates - is prohibited. An example would be the extraction of handwriting coordinates to sell 'script fonts'.

II.b The usage of UNIPEN data for the training of commercial handwriting recognition systems is allowed.

II.c The UNIPEN logo will be presented by the user in the final documentation of the resulting software product.

II.d Reference to individual writer identities or the identity of individual data donator companies from within the UNIPEN data distribution should be avoided at all times.

Note: Also in the case of commercial development, the user is kindly asked to present the results of the underlying research and development via an acknowledged science & technology forum (journal or conference).

Ad I. UNIPEN Publication Policy

I.1 - Reference

Users are required to mention the Unipen Release version in their publications, and are strongly urged to use the latest version available.

    Reference example: 

        "As a training set, we used UNIPEN [xx] Train-R01/V07, 
         benchmark ..., subsets ..... 
         As a test set, we used UNIPEN DevTest-R01/V02, 
         benchmark ..., subsets .... 
         To the raw UNIPEN data, the following pre-processing 
         was applied: ...."
             .
             .
             .

        [xx] Guyon, I., Schomaker, L., Plamondon, R., 
             Liberman, M. & Janet, S. (1994). 
             UNIPEN project of on-line data exchange and recognizer 
             benchmarks, Proceedings of the 12th International
             Conference on Pattern Recognition, ICPR'94, 
             pp. 29-33, Jerusalem, Israel, October 1994. IAPR-IEEE.

In this example we assume the release of the set DevTest-R01/V02, which will actually take place in the future.

In case your training set and test set are derived from within a single distribution such as Train-R01/V07, please explain in detail how your random selection of samples from within this distribution was produced. Was the process actually random? Was manual pruning involved? Improvements to the labels (truth values) can be submitted by the users in the form of .SEGMENT... entries via email to the iUF.

I.2 - Which data?

A proper distinction between training and test sets is necessary. The best possible training/test set distinction involves data randomly selected from two exclusive sets of writers for both sets, respectively.

Note that there is a problem in the use of test sets. Iterated use of a particular training / test set pair in a development process can be considered as indirect training! Even if a development set as such is not formally used for training, it is a well-known fact that all parameter adjustments, code improvements, etc., are a form of training, regardless of the type of pattern recognition algorithm which is used. Therefore, it is good practice to explain the effort spent in iterated testing in the publications. The tendency to iterate a single training/test set pair within a complete PhD project has led to inflated reported recognition rates in the past. It is good practice to generate a random selection of multiple sets at the start of such projects.

I.3 - Benchmark (eq. database subset) overview

Benchmark	Description
1a	isolated digits
1b	isolated upper case
1c	isolated lower case
1d	isolated symbols (punctuations etc.)
2	isolated characters, mixed case
3	isolated characters in the context of words or texts
4	isolated printed words, not mixed with digits and symbols
5	isolated printed words, full character set
6	isolated cursive or mixed-style words (without digits and symbols)
7	isolated words, any style, full character set
8	text: (minimally two words of) free text, full character set

Note that only Benchmark #8 is a realistic, application-oriented test, because the word segmentation problem must also have been solved by the recognizer. No manual word segmentation is allowed in test Benchmark #8.

Lambert Schomaker, January 1997, October 2000.

Files

Files (155.8 MB)

Name	Size	Download all
unipen-CDROM-train_r01_v07.tgz md5:1f9037c57b92592a79caa2c34ab82fdc	155.8 MB	Download

Additional details

Guyon, I., Schomaker, L., Plamondon, R., Liberman, M. & Janet, S. (1994). UNIPEN project of on-line data exchange and recognizer benchmarks, Proceedings of the 12th International Conference on Pattern Recognition, ICPR'94, pp. 29-33, Jerusalem, Israel, October 1994. IAPR-IEEE.

	All versions	This version
Views	2,034	2,030
Downloads	387	386
Data volume	65.4 GB	65.3 GB

Unipen data set of on-line (vectorial) handwriting - train_r01_v07

Creators

Contributors

Other:

Description

Notes

UNIPEN Database Conditions of Use

I. Non-commercial use II. Commercial use

Ad I. UNIPEN Publication Policy

I.1 - Reference

I.2 - Which data?

I.3 - Benchmark (eq. database subset) overview

1a

1b

1c

1d

2

3

4

5

6

7

8

Files

Files (155.8 MB)

Additional details

References

I. Non-commercial use

II. Commercial use