Skip to content
Snippets Groups Projects

LIBTD-1314: Parse SEAMUS db dump into usable format

Merged Janice J Kim requested to merge LIBTD-1314 into dev

Created simple rake task and Wordpress class for parsing a Wordpress xml dump from SEAMUS and extracting authors and work items.

$ bin/rake seamus:extract_items["input.xml"]
$ bin/rake seamus:extract_authors["input.xml"]

This will eventually be used later as a starting point for importing data into COMPEL.

Note: Example wordpress dump can be found here: https://webapps.es.vt.edu/jira/browse/LIBTD-1313

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Author Contributor

    NOTE about Instrumentation parsing -- The code is basically only attempting to parse the Instrumentation when there are no parenthesis present. Otherwise, it leaves it as a string with no parsing. I did this because of the free-form nature of the field, compared to COMPEL (where instruments are broken up). If we want to clean up the 26 fields that we're currently seeing with parenthesis (compared to ~490 item works), I thought we could potentially do this afterwards if there is time.

  • Author Contributor

    I'll make some changes to replace the puts with an output file.

    Originally, I was thinking I wouldn't need to output the data in a different format. The puts were just an example of how I could use the parsed data. I thought that I would just do something similar to import the data into COMPEL at a later time without creating additional files.

    However, thinking about it more, there probably is some value in writing out what we're parsing. I'll try to rework the code to create json files of what I'm seeing for the authors and items. This way, it'll be clearer how we're parsing things. At this point, I still plan to import data from the wordpress xml dump file in the future, but this json file could make it easier to look at the data.

  • Author Contributor

    I just updated the code to spit out json to file rather than use puts. New usage is:

    $ bin/rake seamus:extract_items["input.xml", "output.json"]
    $ bin/rake seamus:extract_authors["input.xml", "output.json"]
  • Author Contributor

    NOTE: There's a lack of folks available for PR reviews. I'm going to merge these changes into dev for now. Once more folks are back, we can open up new PRs for any issues/feedback.

Please register or sign in to reply
Loading