There is a newer version of the record available.

Published November 26, 2023 | Version v1.1.0
Software Open

ID_Extractor (ID_Ex) for extracting IDs and references from .jats article files

Authors/Creators

Description

Introductory remarks

Several scientific journals edited by the German Archaeological Institute use `jats xml` to be displayed in an instance of the eLife Lens 2.0.0 (for example _Archäologischer Anzeiger_, see: https://publications.dainst.org/journals/aa).

The articles are enhanced with bibliographic and geographic authority data as well as other references to specific information resources of the institute´s information infrastructure.

Approach

ID_Ex browses the `.xml` files stored in the article repository folder and extracts the pre-defined references. The results are stored in separate `sqlite3` tables reflecting the relation of a specific record to the `doi` of the article, e. g. from

ID_Ex is based on `Python 3.12.0` using `bs4` from `BeautifulSoup` library, so it can be easily modified for own purposes.

Mode of operation - and things to be done

If not existing, ID_Ex generates the required `sqlite3` tables in a subfolder ("db_folder") when starting the tool for the first time. In the initial version of ID_Ex you have to enter the path to the repository folder in which the `.jats` files are stored manually. ID_Ex extracts the data and saves them in mentioned `sqlite3` tables.

To avoid duplicates ID_Ex checks if an article is already recorded using the `doi` and skipps in this case further actions.

Additionally ID_Ex generates a detailed `.txt` log file containing the file names and the IDs extracted from them in a subfolder ("_ID_Ex_LOG").

With minor modifications ID_Ex can be run at certain intervalls (using a CronJob for example) to keep the corpus up to date automatically.

New in v1.1.0:

  • A menue allows to export the records of a selected table into a `.txt` file in the log subfolder, not only after the extraction process but also in form of a request to a previous generated database
  • Improved handling of the parameters needed for `sqlite3` operations using a `dict` that contains all necessary informations to minimize repetitions

To be done:

  • Enable automatical scraping of scattered repositories containing `.jats` article files.
  • Adding step by step features to export the records as `.json` files or in other formats.
  • Enable ID_Ex to handle more complex queries and requests
  • Implement a mode of running autonomously to make ID_Ex usable within a CronJob

Technical remarks

  • `Python 3.12.0`
  • `bs4` from `BeautifulSoup`
  • `sqlite3`
  • Tested for Windows (not for Linux yet)

See also

In this context see following repositories for preparing the `.jats` files of the journals mentioned above:

Files

README.md

Files (23.6 kB)

Name Size Download all
md5:0581e8ef07091843dd334282ac53bbed
394 Bytes Download
md5:058b364b45af2dfa14e880e71fc68e8f
9.0 kB Download
md5:d89b1de05ad9d368b25ffd39938592ef
11.0 kB Download
md5:3fed3c2abaa527d1d3b0b1583edb3299
3.2 kB Preview Download

Additional details

Related works

Is supplement to
Software: https://github.com/pBxr/ID_Extractor (URL)