Skip to Main Content

UF Library MAI Process: Batch Process

A guide to help MAI team to organize and track major steps and to provide an overview of the whole process.

Steps

Upon receiving the data from the Metadata Librarian, we need to create a spreadsheet that lists only bibid and vid for further work in the Steward, which means we need to copy the bibid and vid to a new spreadsheet. Keep the original data for analyzing the whole batch. 

If bibid and vid come together as one ID, use the following step to split the ID.  

Splitting the ID 

In Excel, 

→ use "Text to Columns", make sure to have both columns set as TEXT. The second column has to be manually clicked before the field type can be set as shown in below screenshot. 

→ Insert the headers: add a row at the top, column 1: bibid; column 2, vid;

save the file as xlsx. 

Login to the Steward

→ MAW Admin

Dps → Batch Items → Import

Choose the file

Import the spreadsheet prepared during the last step. 

Choose the Format displayed by MAW: xlsx

→ submit

If there are any errors, the interface returns the error and lists the error. Report to Xiaoli the issue with a screenshot of the error message. For unknown issues, Xiaoli will report to Robert and copy the group member on that email so the solution could be provided timely.  

if no error and get, preview the results the batch set number (eg, 3776), on top of the page, in green, show the status. 

eg,. “Import finished, with 20 new and 0 updated batch items.”

→ log the batch ID and other batch information in MAI Job Track Form. 

Login to the Steward

→ MAW Admin

Dps → Api term jobs → Add API Term Job → Batch set (eg, 3776)--> 

  • Low 1, leave High empty if wants to process the whole batch, if not, set the high number (eg. 1000)

  • Thesaurus choose one from the below

floridageo_201910

floridathes

→ Save

Get the Api term jobs, eg. 136, 

→ log the api term job ID and other api term job information in MAI Job Track Form 

To know the Api Term Job process, check the log by Api Term Job number at 

Y:\Main\MARSHALING_WORK\api_terms

When the “end datetime” is populated for Api term jobs, the whole job is done. The results can be checked in Api Terms. 

MAW Admin

Api term jobs

→ put in api term job ID in the filter on the right

→ log the api term job start and end time, calculate the duration and status in MAI Job Track Form. 

Make good use of the filter on the right. 

Multiple filters can be used at the same time. To remove the filter, single click the check mark underneath the filter. 

To filter by Thesaurus, please use the following names: 

floridageo201910 (the current version of Floridageo)

floridathes (the current version of Jstor Thesaurus)

Sample Size

We choose roughly 1% of the batch, for instance 24 records for a batch of 2459 bibs, we receive to evaluate and to know what can be done to improve the compatibility of the Thesauri and the data. 

Preparing Testing Material

If PDFs are available, we can directly copy and paste PDF into MAIstro to test in Test MAI. 

If PDFs are not available, we will need to go to resource directory to either fetch an all page text file or gather txt files and then to create an all page txt file in Adobe Acrobat.

Steps are below: 

login to UFDC

→Swap the bibid in the following link with the bibid you pick to work on and put the link in the browser

https://ufdc.ufl.edu/l/AA00048060/00001/directory

Copy the file path like below from the top of the webpage. Please note you have to login to UFDC to see the path below. 

\\flvc.fs.osg.ufl.edu\flvc-ufdc\resources\AA\00\00\29\73\00001

Copy the file path into a window explorer, after the folder loads

→ Copy the txt file that holds all page content, usually, the txt file name that has “_pdf” at the end of the txt file, to your test material folder on Shared Project drive: SharedProjects\Taxonomy\Testing Materials 2020

For example, after the folder loads at \\flvc.fs.osg.ufl.edu\flvc-ufdc\resources\AA\00\00\29\73\00001

Do a search with “pdf”, as shown above, the all page txt file could be easily spotted in the results. 

 

Agront_pdf. txt, should then be saved at 

SharedProjects\Taxonomy\Testing Materials 2020

→ rename the txt file with the bibid. In this case, Agront_pdf. txt needs to be renamed as AA00002973.txt

If no all page txt files are available,

choose all single page txt files

and right click the mouse to choose "Combine File in Acrobat" (Please make sure Acrobat Pro is installed.)

After all files display in Acrobat, click "Combine" 

→ After all files combined, choose File -- Save As -- Choose the your folder -- Choose "Save as type" Text (Plain) from the Type list

→ Name the files as follow [bibid]_xxxx, xxxx is the 4-digit year of the time when the test takes place,  as a way to claim this is the txt file created in the MAI evaluation process. 

 

Decision Flow

The decision flow chart below helps to go through the whole evaluation process. 

Common Issues

The following lists common issues found during the evaluation and the way of investigating and resolving the issue.

Suggested Terms Not Found in the PDF

Poor OCR Quality

Missing Core Concepts

Terms Are Too Generic

Terms Look Out of Place

Result Discrepancy between MAIstro and the Steward 

 

Issue 1
 
Suggested Terms Not Found in the PDF
Description This happens because ttm (text to match) and the actual term are not the same, for instance, ttm could be “ache”, but the suggested term is “Pain”. 
Investigation Methods

MAIstro → Thesaurus Master Tab

Edit→ Search (paste in the suggested term in question)

On the term session →  “rules for the term” to see all the ttm of the searched term. Then double click to open the rules to check in the rule body to see if any issues exist there. 

 

If rules do need work, please then log “Y” in “Term Test Log” in the column “Rule Fix (Y/N)” and then discuss the rule change with Xiaoli.

 

If the rule looks fine, please do a quick search with the ttm in the PDF to see if they indeed appear in PDF. 

 

If the quick search cannot confirm the conclusion, copy the full PDF and then paste it into MAIstro → MAI Test to see what MAI returns and then evaluate further if any issue exists. 

 

If MAIstro returns a totally different set of terms from Api Term Jobs, this in fact is a Poor OCR Quality issue, details see below. 

 

Issue 2
 
Poor OCR Quality
Description
  • Api terms look all irrelevant to the actual topics.  E.g., “Research tools, Customers, Abstracting and indexing services” were returned for an article about “diseases”; 

  • Api terms returned from full text look largely different from the ones returned from Abstract; 

  • MAI suggested terms look quite different from the results produced by MAIstro. E.g., Api term jobs return “Dengue, Climate models, Housing conditions” while MAIstro returns “Rain, Modeling, Epidemics, Parametric models”. 

Investigation Methods

MAIstro →Test MAI tab

Copy the full PDF and then paste it into MAIstro → MAI Test to see the suggested terms. If MAIstro returns a set of different terms, then the issue lies in the OCR quality of the text files stored in the DSS Resource Directory. This is because Api term jobs use text files and XMLs from the Resource Directory to do the MAI. After confirm the issue, log “Y” in “Term Test Log”in the column “Resources Directory Issue (Y/N)”.

Issue 3
 
Missing Core Concepts
Description Topics/concepts that are obviously covered by the materials are not reflected in the suggested MAI terms. A google search confirms that it is a generally useful concept. Google scholar is a good place to confirm this piece of information. 
Investigation Methods

MAIstro → Thesaurus Master Tab

Edit→ Search (paste/type in the term/concept in question)

Check to see if any of the included terms could be a synonym of this topic/concept.

If there is one, identify the ttm should be used in the current materials to pull out the term in question and then add the new ttm to the existing taxonomy term.

If a similar concept exists in thesaurus, the term can also be added as a non-preferred term to the existing term. 

If the Concept/Topic doesn’t exist in the Thesaurus, add the new term to be under BT: UF Addition (see screenshot below). 

Before adding a term, google the topic/term. It’s best we can find it in a known authority. Add the source information, eg, the url to the wikidata page and also the materials’ bib id to "Editorial Notes". Add the definition to Definition (see screenshot below).    

                        

Issue 4
 
Terms Are Too Generic
Description Terms appear as one of the top 3 suggested ones, but they are too broad to be useful, for instance “Schools”, “Counties” etc. More granular terms are expected. 
Investigation Methods

MAIstro → Thesaurus Master Tab

Edit→ Search (paste/type in the generic term)

Click to choose the term → check if the term in question has any NT.  

If so, check “Candidate”. This is to keep the term still in the Thesaurus, but remove it from the MAI process (see screenshot below) and put a note in “Scope Note”: “Too generic, remove from MAI process”. Log “Y” in  in “Term Test Log”in the column “Too Generic (Y/N)”, one term one row. 

 

                    

Issue 5
 
Terms Look Out of Place
Description Usually these terms have rule issues. Some may look too generic, explained in the “Too Generic” session. 
Investigation Methods

MAIstro → Thesaurus Master Tab

Edit→ Search (paste/type in the term in question)

Click to choose the term →  click “rules for the term” to see all the ttm of the target term. 

→ Check individual ttm/rules to see if any rules entailing too many counts of the term. 

e.g.,the term “Disease risks” used to have the rule under ttm “risk” so whenever risk is near illness etc “Disease risks” will be returned. This rule generates too many counts of “Disease risks”. For this case, “risk” was removed as the ttm of “Disease risks”. 

 

e.g.,the term “Writing tablet” appears because “tablet” is the ttm while “tablet” often means the medical pills in the text. This rule needs to be changed to avoid generating the wrong counts. 

 

→ If rules should be changed, log “Y” in  “Term Test Log” column “Rule Fix (Y/N)” and discuss rule change with Xiaoli. 

 

Issue 6
 
Result Discrepancy between MAIstro and the Steward
Description MAIstro and the Steward produce different but similar results, for instance, the similar concepts with different frequency count results, such as the top 3 are different terms; if MAIstro and the Steward present totally two groups of terms that don't overlap by all means, this is more likely Poor OCR Quality issue. 
Investigation Methods

MAIstro → Test MAI → paste in txt or load a txt → run MAI Check the results

The Steward Api Terms → search by bibid → Check the results

 

Compare the two sets of results, if the top 3 are totally different, please log "Y" in Term Test Log” column "Big Results Discrepancy Between MAIstro and the Steward (Y/N)" and describe the details of the discrepancy in the "Discrepancy Note" column. Then follow up with Xiaoli in the meeting. 

After fixing the issues identified in Step 5, Api Term Jobs need to re-run over the same batch job. That is, to repeat step 3 to get a new batch of MAI results. 

Only re-run the Api Term Jobs when the changes are made in MAistro. If nothing needs to be fixed for the Api Term Job in question, this step should be skipped. 

This function writes the terms to XMLs and prepares the updated XMLs to be re-ingested by Sobek. 

Mets Edit Term Job

Put in Apiterm job ID obtained in Step 6

→ Max top terms: set the numbers of terms need to be written to each (usually 3) XMLs

→ Save

→ log the Mets Edit Term job id in in MAI Job Track Form.

Check XMLs Results

→ Locate XMLs using Mets edit terms jobs id, two digits

DLC\Main\MARSHALING_WORK\mets_edit_term_job

In the window explorer search box, type in XML to have all the XMLs from all folders listed with the size information as shown below.

 

Based on the size, choose the biggest XMLs to check. Depending on the batch size, please check 1-5 XMLs, the bigger the batch, the more XMLs. 

In Oxygen

→ Open the largest XML file

→ Ctrl + F to open up the search window

→ search with “jstor” or “local” to locate the newly added MAI terms. (if “local” pulls out other content besides floridageo terms, use “next” to go down to the Subject section.)

→ check if the right number of MAI terms have been written to XMLs at the right place

Copy XMLs to Metadata Inbound Folder @ DLC\Main\INCOMING\Metadata

→ Make sure Metadata Inbound Folder is empty

→ Copy XML folders there

→ Check error folder @DLC\Main\INCOMING\MetadataFailures after Metadata folder is empty, which means all folders have been processed

→ Report unsolvable errors to Xiaoli

Check if the XML displays on UFDC pages right as shown below. Floridageo now uses "local" as the code in the parenthesis. 

 

Xiaoli will backup the thesaurus terms and rules biweekly. Rule backup for one thesaurus usually takes about 1 hour machine time.  

Quick Form Access

MAI Process with Step Number

University of Florida Home Page

This page uses Google Analytics - (Google Privacy Policy)

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.