Published April 24, 2023 | Version 3.0.0
Software Open

gwu-libraries/sfm-ui: Version 3.0.0

  • 1. Stanford University Libraries
  • 2. George Washington University Libraries
  • 3. The George Washington University
  • 4. @gwu-libraries
  • 5. @googlers
  • 6. Geospatial Training Solutions
  • 7. Smith College/Libraries/Special Collections
  • 8. @commoncrawl
  • 9. Royal Library of Belgium (KBR)

Description

Bug/security fixes

  • Django upgraded to 3.2.18 (supported until 2024)
Support for Twitter API v.2

See sfm-twitter-harvester

  • Added support for v.2 API credentials, including the bearer token (recommended) and the combination of consumer key/secret and access token/secret
  • Added support (with twarc2) for harvesting and exporting from v.2 endpoints
  • Due to changes in the Twitter API access model, only the v.2 search_recent and user_timeline endpoints (accessible on the new Basic Access tier) are available in production. A new environment variable, TWITTER_COLLECTION_TYPES, specifies which of the supported Twitter API endpoints are available in the app.
  • Twitter v. 1.1 endpoints have been disabled, but collections previously created via these endpoints are still available for export.
Outstanding issues Streaming API
  • Streaming rules are handled as seeds; because the Streaming API supports multiple rules per request, an SFM stream collection can have multiple seeds. However, the functionality to limit exports to a subset of active/deleted seeds does not work for these collections. (The logic in SFM for seed-based export applies only to user-timeline collections.)
  • During testing, a long-running stream harvest encountered a "Read timed out" error from the Twitter API, as a result of which, no further Tweets could be collected until the harvest was voided in the UI and restarted. Consulted with the twarc developers; the cause of the error remains unclear, but it may be related to the following:
    • Streaming harvests involve a periodic restart of the twarc.stream() process (every 30 minutes). This logic is designed to prevent excessively large WARC files (since a new WARC is created only at the start of the twarc.stream() process).
    • The twarc developers posit that this regular interruption of the twarc stream could cause problems. The stream is designed to be run continuously. Apparently, the v.2 API is less responsive than the v.1 API, so it's possible that the API might be giving a timeout error if the previous connection hasn't fully closed by the time twarc tries to open a new one.
    • If that is the problem – and it's hard to know for sure – then introducing a sleep before restarting could be effective; however, that could result in missed Tweets (a risk already posed by restarting the stream every 30 minutes).

Files

gwu-libraries/sfm-ui-3.0.0.zip

Files (3.0 MB)

Name Size Download all
md5:a12758df01761e3ff14f05a9bbe31e3a
3.0 MB Preview Download

Additional details

Related works