Massive Newspaper Migration -- Moving 22 Million Records from CONTENTdm to Solphal

find -iname "config.txt" -exec sh -c 'echo {}; cat {} | wc -l' \;

<dmcreated>2004-02-27</dmcreated>
<dmmodified>2004-02-27</dmmodified>
<dmrecord>0</dmrecord>
<title>American Eagle, 1897-05-08 Page 1</title>
<subjec></subjec>
<descri></descri>
<creato>American Eagle Publishing Co.</creato>
<publis>Digitized by: Univ. of Utah</publis>
<contri></contri>
<dateor>1897-05-08</dateor>
<date></date>
<type>page</type>
<format>text/PDF</format>
<identi></identi>
<source></source>
<langua>eng</langua>
<relati></relati>
<covera></covera>
<rights>Material in the public domain. No restrictions on use.</rights>
<itemye>1897</itemye>
<itemmo>May</itemmo>
<itemda>08</itemda>
<itempa>Page 1</itempa>
<itemtr></itemtr>
<genre>newspaper</genre>
<fullrs></fullrs>
<find>2.pdf</find>
<dmaccess></dmaccess>
<dmimage></dmimage>
<dmad1></dmad1>
<dmad2></dmad2>
<dmoclcno></dmoclcno>
<dmcreated>2004-02-27</dmcreated>
<dmmodified>2004-02-27</dmmodified>
<dmrecord>1</dmrecord>
<title>American Eagle, 1897-05-08 Page 2</title>
<subjec></subjec>
<descri></descri>

find -iname "desc.all" -exec grep -P "(?<=<title>)(Mt. Pleasant Pyramid|.+?)(?=[.,:(])" {} -o \; | sed 's/ *$//' | sort | uniq

gawk 'BEGIN {FS=OFS="\t"} NR > 1 { print $1, $2 }' desc.all.tsv | egrep -v $'\t'"(Utah Enquirer|Provo Daily 
Enquirer|Territorial Enquirer)" -C 2

------

5637    Provo Daily Enquirer, 1892-11-16, The World Enriched
5638    Provo Daily Enquirer, 1892-11-16, Mission Notes
5639    The Prizes of Literary Work
5640    Provo Daily Enquirer, 1892-11-16, The India Rubber Worm
5641    Provo Daily Enquirer, 1892-11-16, Too Powerful
--
108185  Provo Daily Enquirer, 1896-02-06, Masthead
108187  Provo Daily Enquirer, 1896-02-06, Age of Consent
108189  A Happy New Year
108190  Provo Daily Enquirer, 1896-02-06, Notice
108191  Provo Daily Enquirer, 1896-02-06, Foreign Gatherings
--
111431  Provo Daily Enquirer, 1896-04-24, Combative Congressmen
111432  Provo Daily Enquirer, 1896-04-24, Utah Public Buildings
111433  Again in Public
111434  Provo Daily Enquirer, 1896-04-24, Utah Men Get Positions
111435  Provo Daily Enquirer, 1896-04-24, Gets a Third Term

gawk 'BEGIN {FS=OFS="\t"} NR > 1 { print $1, $2 }' desc.all.tsv | egrep -v $'\t'"(Utah Enquirer|Provo Daily 
Enquirer|Territorial Enquirer)" | sed 's|^|http://udn6.lib.utah.edu:81/cgi-
bin/admin/edittxt.exe\?CISOROOT=/de2\&CISOPTR=|g'

------

http://udn6.lib.utah.edu:81/cgi-bin/admin/edittxt.exe?CISOROOT=/de2&CISOPTR=5639    The Prizes of Literary Work
http://udn6.lib.utah.edu:81/cgi-bin/admin/edittxt.exe?CISOROOT=/de2&CISOPTR=108189  A Happy New Year
http://udn6.lib.utah.edu:81/cgi-bin/admin/edittxt.exe?CISOROOT=/de2&CISOPTR=111433  Again in Public

cat desc.all | perl -pe 's|<title>(?!Page)(.+) (Provo Daily Enquirer), ([0-9]{4}-[0-9]{2}-[0-9]{2})</title>|<title>\2, \3,
 \1</title>|g' > desc.all.fixed

find -iname "desc.all" -exec sh -c 'echo {}; egrep "(<title>|<dateor>)" {} | egrep "[0-9]{4}-[0-9]{2}-[0-9]{2}" -o | sed "N;s/\n/\t/g"
 | gawk "BEGIN{FS=\"\t\"} { if(\$1 != \$2) { print \$0 } }" ' \;

article; local performances; technology
article; local performances; theater
article; local performances; theater; music
article; loca news
article; Logan Leader
article; logging; accidents, injuries
article; logging
article; logging; local businesses
article; logging;local businesses
article; logging; mining; colonization and settlement
article;l technology
article;l theater
...

<field name="id">1</field>
<field name="thumb_s">/28/35/283593ad79aebad753f81b700f46bff7179b5d7d.jpg</field>
<field name="file_s"></field>
<field name="parent_i">0</field>
</doc>
<doc>
<field name="paper_t">American Eagle</field>
<field name="title_t">American Eagle, 1897-05-08 Page 1</field>
<field name="creator_t">American Eagle Publishing Co.</field>
<field name="publisher_t">Digitized by: Univ. of Utah</field>
<field name="year_t">1897</field>
<field name="month_t">May</field>
<field name="day_t">08</field>
<field name="date_tdt">1897-05-08T00:00:00Z</field>
<field name="type_t">page</field>
<field name="rights_t">Material in the public domain. No restrictions on use.</field>
<field name="page_t">Page 1</field>
<field name="oldid_t">americaneagle 1</field>  
<field name="id">2</field>
<field name="thumb_s">/28/35/283593ad79aebad753f81b700f46bff7179b5d7d.jpg</field>
<field name="file_s">/6a/59/6a59096aa1b9d06b821625e030bec919a3be43c0.pdf</field>
<field name="parent_i">1</field>
</doc>
<doc>
<field name="paper_t">American Eagle</field>
<field name="title_t">American Eagle, 1897-05-08 Page 2</field>
<field name="creator_t">American Eagle Publishing Co.</field>

fastcgi_cache_path /var/cache/NGINX levels=1:2 keys_zone=DEFAULT:1000m inactive=5000m;
fastcgi_cache_key "$scheme$request_method$host$request_uri";

server {

	...

	set $fastcgi_skipcache 0;
	if ($uri ~ "^/login") {
		set $fastcgi_skipcache 1;
	}

	location ~ \.php$ {

		...

		fastcgi_cache DEFAULT;
		fastcgi_cache_valid 200 24h;
		fastcgi_cache_bypass $cookie_PHPSESSID $arg_nocache $fastcgi_skipcache;
		fastcgi_no_cache $cookie_PHPSESSID $fastcgi_skipcache;
	}

	add_header X-Cache $upstream_cache_status;
}

#!/usr/bin/env python3
import re
import glob
import csv
import sys
import os

# list of fields to use
fields = ['dmrecord', 'title', 'find', 'identi', 'fullrs', 'ark']

# parse index.xml or newsindex.xml in supp directory and grab node value
def get_field(path, field):
    for line in open(path, 'r'):
        match = re.search("<" + field + ">([^>]*)</" + field + ">", line)
        if match:
            return match.group(1)

    return "-1"

# check for supp record and return parent dmrecord
def get_parent(dmrecord):
    global cache_path
    for supp_bucket in supp_buckets:
        index_path = cache_path + "supp/" + supp_bucket + dmrecord + "/index.xml"
        if os.path.exists(index_path):
            return get_field(index_path, "parent")

        # check for newsindex.xml
        newsindex_path = cache_path + "supp/" + supp_bucket + dmrecord + "/newsindex.xml"
        if os.path.exists(newsindex_path):

            # Check type
            if get_field(newsindex_path, "itemtype") == "Page":
                return get_field(newsindex_path, "issue")
            elif get_field(newsindex_path, "itemtype") == "Article":
                page = get_field(newsindex_path, "page")
                pageindex_path = cache_path + "supp/" + supp_bucket + page + "/newsindex.xml"
                if os.path.exists(pageindex_path):
                    return get_field(pageindex_path, "issue")

    return "-1"

def write_buffer():
    global record_buffer, file_out, line_count
    column = 0

    # get field data
    field_data = {}
    for field in fields:
        match = re.search(">" + field +"<([^>]*)</" + field + ">", record_buffer)

        if match:
            data = re.sub('[\r\n\t]', '', match.group(1))
            field_data[field] = data
        else:
            field_data[field] = ""

        column += 1

    # check for a dmrecord
    if field_data['dmrecord'] == "":
        record_buffer = ""
        return

    # write out line
    column = 0
    for field in fields:
        if column > 0:
            file_out.write("\t")
        file_out.write(field_data[field])

        column += 1

    # try to find the parent object
    find_data = field_data['find']
    find_match = re.search("\.cpd$", find_data)

    # make sure it's not a cpd record
    has_parent = "-1"
    if not find_match:
        has_parent = get_parent(field_data['dmrecord'])
    file_out.write("\t" + has_parent)

    # end line
    file_out.write("\n")

    record_buffer = ""

# check arguments
if len(sys.argv) < 2:
    print("Usage: ./desc2table.py col_path [extra_fields]")
    exit(1)

# get filenames
col_path = sys.argv[1]
cache_path = "col" + col_path + "/"
file_xml = cache_path + "desc.all"
file_xml_new = cache_path + "desc.all.new"
if os.path.isfile(file_xml_new):
    file_xml = file_xml_new

# add extra fields
if len(sys.argv) > 2:
    for field in sys.argv[2 :]:
        fields.append(field)

# open files for read/write
file_in = open(file_xml, 'r', encoding='utf-8')
file_out = open(file_xml + '.tsv', 'w', encoding='utf-8')

# write header
column = 0
for field in fields:
    if column > 0:
        file_out.write("\t")
    file_out.write(field)
    column += 1

Record Type	Title Format
Issue	Paper_Name, Issue_Date
Page	Paper_Name, Issue_Date, Page_Number
Article	Paper_Name, Issue_Date, Article_Title

Suffix	Field Type
_t	General text fields that are tokenized by spaces and other characters in the index.
_s	String fields that aren't tokenized at all in the index.
_tdt	Date fields that can be compared and used in ranged searches.
_i	Integer fields.

[1]	Arlitsch, K., Yapp, L., Edge, K. (2003). The Utah Digital Newspapers Project. D-Lib Magazine, 9 (3). https://doi.org/10.1045/march2003-arlitsch
[2]	Herbert, J., and Arlitsch, K. digitalnewspapers.org. (2004) The Serials Librarian, 47 (1-2). 99-115. https://doi.org/10.1300/J123v47n01_07
[3]	Masood, K. and Neatrour, A. (2014). Digital Asset Management Systems Options: Report of the University of Utah Libraries Dam Review Task Force. Mountain West Digital Library Webinar.
[4]	Neatrour, A., Morrow, A., Rockwell, K., Witkowski, A. (2011). Automating the Production of Map Interfaces for Digital Collections Using Google APIs. D-Lib Magazine, 17 (9/10). https://doi.org/10.1045/september2011-neatrour

D-Lib Magazine

Massive Newspaper Migration — Moving 22 Million Records from CONTENTdm to Solphal

Abstract

1 History of Digital Newspapers at the J. Willard Marriott Library

2 Scalability Issues, Configuration Issues, and Customizations Required to Host Newspapers Content on CONTENTdm

3 Preparing UDN for Migration

3.1 Metadata template assessment and standardization

3.2 CONTENTdm metadata overview

3.3 Metadata enhancement

3.4 Metadata assessment and manual enhancement

3.5 Scripting metadata enhancements

3.6 Fixing title/date inconsistencies

3.7 Fixing miscellaneous fields

4 Migrating to Solphal

4.1 Build desc.all.tsv and determine parent record

4.2 File storage architecture

4.3 Converting metadata into Solr format

4.4 Page caching with NGINX

5 Performance

6 Future Directions

7 Conclusion

Bibliography

Appendix A

About the Authors