The Canada Trademarks Dataset

Jeremy Sheff

doi:10.5281/zenodo.7567076

Published January 27, 2023 | Version 2.0

Dataset Open

The Canada Trademarks Dataset

Jeremy Sheff

The Canada Trademarks Dataset

18 Journal of Empirical Legal Studies 908 (2021), prepublication draft available at https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3782655, published version available at https://onlinelibrary.wiley.com/share/author/CHG3HC6GTFMMRU8UJFRR?target=10.1111/jels.12303

Dataset Selection and Arrangement (c) Jeremy Sheff

Original Python and Stata Scripts (c) Jeremy Sheff

Contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office.

### VERSION 2.0: JANUARY 2023 UPDATES ###

The January 2023 update brings the Canada Trademarks Dataset up to date with weekly application data published by CIPO through January 24, 2023, and includes a total of 1,916,950 application records. The python scripts for constructing the dataset have been rewritten to allow for regular updates with new weekly application data. The new versions of these scripts require access to a mySQL server to store and update the data. Those who simply wish to use the current dataset rather than keep it updated on their own can simply download the .csv and/or .dta files included in this distribution.

Details of Repository Contents:

This repository includes a number of .zip archives which expand into folders containing either scripts for construction of the dataset or data files comprising the dataset itself. These folders are as follows:

/csv: contains the .csv versions of the version 2.0 data files, current through January 24, 2023
/dta: contains the .dta versions of the version 2.0 data files, current through January 24, 2023
/py: contains the python scripts used to construct and update the version 2.0 dataset

The repository also contains 3 additional files:

mysql.zip: a compressed archive containing a mySQL database dump for the Version 2.0 dataset, current through January 24, 2023.
CA_TM_csv_cleanup_2023.do: this Stata do-file will convert .csv files generated by the python installation scripts into .dta files. (Users should perform a search-and-replace on the partial path "/mypath" to direct the script to the appropriate local directory before running the do-file.)
downloadedupdates.txt: a text file listing all CIPO weekly update files included in the Version 2.0 dataset.

If users wish to construct rather than download the Version 2.0 datafiles, they should run the script /py/CA_TM.py. This script will prompt the user to enter their IP Horizons SFTP credentials; these can be obtained by registering with CIPO at https://ised-isde.survey-sondage.ca/f/s.aspx?s=59f3b3a4-2fb5-49a4-b064-645a5e3a752d&lang=EN&ds=SFTP. Users may need to log in to this server with an SFTP client prior to running the script in order to validate the server's SSH certificate on their machine. The script will also prompt the user to enter their mySQL database credentials and identify a local directory for the data downloads and output files. Because the data archives are quite large, users are advised to create a target directory in advance and ensure they have at least 200GB of available storage on the media in which the directory is located.

The CA_TM.py script can also be used to check for new weekly updates at CIPO, download them, add them to the mySQL database, and generate new .csv files. Users will be prompted to select either a clean install of the complete dataset (including the historical snapshot) or an update of their existing installation. Once the mySQL database is created and the historical snapshot is processed, users may run this script as often as they like to keep their installation of the dataset current with CIPO's weekly releases.

Users who wish to regularly update their installation of the dataset but wish to avoid the lengthy initial installation process may instead wish to copy the mySQL database dump included in this release and run CA_TM.py periodically to keep it current. To do so, take the following steps:

Download and expand the mysql.zip archive in the Version 2.0 repository, and use the mysql command from your command line to copy the extracted file (CA_TM_mysql_2023-01-24.sql) to your mySQL instance (replace bracketed variables with your actual credentials and path):
```
mysql --host=[yourhost] --user=[username] -p[password] --port=3306 CA_TM < [path_to_file]/CA_TM_mysql_2023-01-24.sql
```
Run the CA_TM.py script, provide the requested credentials and path, and then select option 3 ("Cancel installation and exit") when prompted.
Download the file "downloadedupdates.txt" included in the Version 2.0 repository and copy it to the /XML_updates subfolder that was created by the CA_TM script in the filepath you provided.
Run the CA_TM.py script again any time you wish to update your installation of the database, and select option 2 ("Update an existing dataset with the latest weekly files") when prompted.

Users who update their dataset installation frequently may wish to edit the config.py script to hard-code their SFTP and mySQL credentials and their local file path, and to remove or comment out the commands that install nonstandard python libraries after the first installation. Such users should also be aware that the update script begins by creating a backup of the existing mySQL database in the "/mysql_backups" folder; users who do not require backups may wish to remove these files to save space once they have confirmed that their update was successful. Such users should also take care not to delete or modify the files generated by the update script to keep track of which weekly CIPO update files have already been incorporated into their installation of the dataset. These are:

XML_updates/downloadedupdates.txt
XML_updates/updatestobeconcatenated.txt
XML_updates/updatestobeparsed.txt

Additional terms of use are set forth in the release notes for Version 1.0.

### VERSION 1.0 RELEASE NOTES ###

This individual-application-level dataset includes records of all applications for registered trademarks in Canada since approximately 1980, and of many preserved applications and registrations dating back to the beginning of Canada’s trademark registry in 1865, totaling over 1.6 million application records. It includes comprehensive bibliographic and lifecycle data; trademark characteristics; goods and services claims; identification of applicants, attorneys, and other interested parties (including address data); detailed prosecution history event data; and data on application, registration, and use claims in countries other than Canada. The dataset has been constructed from public records made available by the Canadian Intellectual Property Office. Both the dataset and the code used to build and analyze it are presented for public use on open-access terms.

Scripts are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/. Data files are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/, and also subject to additional conditions imposed by the Canadian Intellectual Property Office (CIPO) as described below.

Terms of Use:

As per the terms of use of CIPO's government data, all users are required to include the above-quoted attribution to CIPO in any reproductions of this dataset. They are further required to cease using any record within the datasets that has been modified by CIPO and for which CIPO has issued a notice on its website in accordance with its Terms and Conditions, and to use the datasets in compliance with applicable laws. These requirements are in addition to the terms of the CC-BY-4.0 license, which require attribution to the author (among other terms). For further information on CIPO’s terms and conditions, see https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html. For further information on the CC-BY-4.0 license, see https://creativecommons.org/licenses/by/4.0/.

The following attribution statement, if included by users of this dataset, is satisfactory to the author, but the author makes no representations as to whether it may be satisfactory to CIPO:

The Canada Trademarks Dataset is (c) 2021 by Jeremy Sheff and licensed under a CC-BY-4.0 license, subject to additional terms imposed by the Canadian Intellectual Property Office. It contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office. For further information, see https://creativecommons.org/licenses/by/4.0/ and https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html.

Details of Repository Contents:

This repository includes a number of .zip archives which expand into folders containing either scripts for construction and analysis of the dataset or data files comprising the dataset itself. These folders are as follows:

/csv: contains the .csv versions of the data files
/do: contains Stata do-files used to convert the .csv files to .dta format and perform the statistical analyses set forth in the paper reporting this dataset
/dta: contains the .dta versions of the data files
/py: contains the python scripts used to download CIPO’s historical trademarks data via SFTP and generate the .csv data files

If users wish to construct rather than download the datafiles, the first script that they should run is /py/sftp_secure.py. This script will prompt the user to enter their IP Horizons SFTP credentials; these can be obtained by registering with CIPO at https://ised-isde.survey-sondage.ca/f/s.aspx?s=59f3b3a4-2fb5-49a4-b064-645a5e3a752d&lang=EN&ds=SFTP. The script will also prompt the user to identify a target directory for the data downloads. Because the data archives are quite large, users are advised to create a target directory in advance and ensure they have at least 70GB of available storage on the media in which the directory is located.

The sftp_secure.py script will generate a new subfolder in the user’s target directory called /XML_raw. Users should note the full path of this directory, which they will be prompted to provide when running the remaining python scripts. Each of the remaining scripts, the filenames of which begin with “iterparse”, corresponds to one of the data files in the dataset, as indicated in the script’s filename. After running one of these scripts, the user’s target directory should include a /csv subdirectory containing the data file corresponding to the script; after running all the iterparse scripts the user’s /csv directory should be identical to the /csv directory in this repository. Users are invited to modify these scripts as they see fit, subject to the terms of the licenses set forth above.

With respect to the Stata do-files, only one of them is relevant to construction of the dataset itself. This is /do/CA_TM_csv_cleanup.do, which converts the .csv versions of the data files to .dta format, and uses Stata’s labeling functionality to reduce the size of the resulting files while preserving information. The other do-files generate the analyses and graphics presented in the paper describing the dataset (Jeremy N. Sheff, The Canada Trademarks Dataset, 18 J. Empirical Leg. Studies (forthcoming 2021)), available at https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3782655). These do-files are also licensed for reuse subject to the terms of the CC-BY-4.0 license, and users are invited to adapt the scripts to their needs.

The python and Stata scripts included in this repository are separately maintained and updated on Github at https://github.com/jnsheff/CanadaTM.

This repository also includes a copy of the current version of CIPO's data dictionary for its historical XML trademarks archive as of the date of construction of this dataset.

Files

csv.zip

Files (2.5 GB)

Name	Size	Download all
CA_TM_csv_cleanup_2023.do md5:17fd1a9672dcb3497d14897472bf40cc	8.8 kB	Download
csv.zip md5:e5c018f0d42a1c9cde2de0cdf96d02cb	715.9 MB	Preview Download
DataDictionary_TM_XML_ST96_v2_2-e.pdf md5:6f1a3412c7f403ec717f1d2129216712	2.2 MB	Preview Download
downloadedupdates.txt md5:c572a1dd8fcb9ae36293b10d1511f16e	6.0 kB	Preview Download
dta.zip md5:f939ab63e31c9e2df4bcb57d7decedd7	854.3 MB	Preview Download
mysql.zip md5:1a67a0ad8e63f7a09322d970367fbea7	917.3 MB	Preview Download
py.zip md5:23e67ed7250dee83dff44e69a28f21aa	26.7 kB	Preview Download

Additional details

Has part: Software: https://github.com/jnsheff/CanadaTM (URL)
Is documented by: Journal article: https://github.com/jnsheff/CanadaTM (URL)

	All versions	This version
Views	2,478	1,191
Downloads	8,980	5,112
Data volume	4.9 TB	3.2 TB

The Canada Trademarks Dataset

Authors/Creators

Description

Files

csv.zip

Files (2.5 GB)

Additional details

Related works