LIBTD-1314: Parse SEAMUS db dump into usable format
Created simple rake task and Wordpress class for parsing a Wordpress xml dump from SEAMUS and extracting authors and work items.
$ bin/rake seamus:extract_items["input.xml"]
$ bin/rake seamus:extract_authors["input.xml"]
This will eventually be used later as a starting point for importing data into COMPEL.
Note: Example wordpress dump can be found here: https://webapps.es.vt.edu/jira/browse/LIBTD-1313
Merge request reports
Activity
NOTE about Instrumentation parsing -- The code is basically only attempting to parse the Instrumentation when there are no parenthesis present. Otherwise, it leaves it as a string with no parsing. I did this because of the free-form nature of the field, compared to COMPEL (where instruments are broken up). If we want to clean up the 26 fields that we're currently seeing with parenthesis (compared to ~490 item works), I thought we could potentially do this afterwards if there is time.
I'll make some changes to replace the puts with an output file.
Originally, I was thinking I wouldn't need to output the data in a different format. The puts were just an example of how I could use the parsed data. I thought that I would just do something similar to import the data into COMPEL at a later time without creating additional files.
However, thinking about it more, there probably is some value in writing out what we're parsing. I'll try to rework the code to create json files of what I'm seeing for the authors and items. This way, it'll be clearer how we're parsing things. At this point, I still plan to import data from the wordpress xml dump file in the future, but this json file could make it easier to look at the data.