Digitisation of textual data
ESDS Qualidata does not generally preserve non-digital material. However some paper
collections of key social science data collections that are acquired are
digitised. This is done through the use of scanning and Optical Character
Recognition (OCR) software. Criteria for what is appropriate for digitisation
include:
-
the suitability of the material for digitisation (paper condition; colour;
size; quality; type face; content; etc.)
-
the size of the paper collection to be digitised
-
how the collection should be prepared with respect to physical, organisational
or intellectual considerations
-
the extent to which text should be made machine-readable (i.e. choose some
level of OCR or simply scan as image files)
The most critical consideration is whether files should be created, stored and
disseminated as simple raw images or whether they should be converted to fully
searchable text. Due to the complex nature of some data collections, which can
include printed paper questionnaires and interview schedules with typed and
hand-written comments, some materials may not be suited to OCR.
Preparing paper for scanning
If digital versions of paper documents are to be OCR'ed, correct preparation is
important. It is important to work with the material following accepted
conservation standards that respect the original media. Key stages of
preparation include:
-
assessment of material and its condition
-
selection of material suitable for scanning
-
photocopying of papers to improve readability and protect originals (where
necessary)
-
resizing of pages when copying (if non-metric sized)
-
removal of items such as staples and paperclips
-
collation of original papers using archive quality paper clips and folders post
scanning
-
creation of inventory of materials (if absent)
Photocopying of certain pages or sections before they are scanned has a dual
purpose. It can safeguard fragile papers since scanning, which may involve
several passes, is not done on the paper originals. Secondly, time and
expense in adjusting poor quality bulk scans may be avoided if improvements in
tone or brightness are introduced when photocopying. If it is intended that the
paper copies will also be made available as a resource, there is the
opportunity to copy onto high quality conservation-grade paper.
Scanning
The copy (not the original) is then scanned and saved in Tagged Image File
Format (TIFF) format. The original is retained at this stage for reference
purposes during proofing and processing. Similarly if any editing is required -
perhaps due to confidentiality issues - the changes are made to the working
copies. In most cases, an approach to editing is recommended that clearly
shows something has been removed, rather than the seamless removal of text.
Following image scanning comes full capture of text. OCR packages are used to
convert the scanned images to text.
Searchable PDF
The use of Portable Document Format files (PDF) allows the look of the original
papers to be preserved. The usefulness of a PDF version may be further extended
by creating one which is text searchable. This can be a relatively quick
process.
For each document, all the constituent TIFFs are collated and converted to a PDF
using 'Paper Capture' in Adobe Acrobat. This process can be performed in batch
mode running in the background behind other operations, and hence is reasonably
quick and efficient. Of course, the degree of searchability is much lower
than fully OCR'ed text but it is secured at a much lower cost of time and
labour.
Nevertheless there is a drawback in using PDFs. When a document is fully
digitised it is straightforward to anonymise through simple text editing.
However, with PDFs other strategies are necessary to ensure anonymisation. One
option is to edit copies of documents, then create the PDFs. However, in cases
where PDFs are created from un-anonymised interviews, attempting to anonymise
them afterwards can be problematic. In most cases the location and nature of
any disclosure information need to be considered to determine the most
appropriate and efficient strategy.
Bookmarking the PDF
An additional strength of working with PDF documents is the ability to bookmark
them. This creates a series of embedded links within a document which can, of
course, be very useful for navigating through long documents.
Creating a rich text version
Even with advanced OCR software it is still time-consuming to produce an
accurate text document. Where original materials are of poor print quality,
contain handwriting, tables or drawings, the Rich Text Format (RTF) files
produced through OCR require considerable editing and quality control.
Creating an XML marked-up version
This level of enhancement builds on the previous steps of digitisation. Once a
file is in a clean electronic 'good transcript' format, it goes through one of
two procedures. If the file has speaker tags, then it still requires a manual
check for multiple speakers or any other variation. Secondly pre-processing
scripts can be applied to the text document to add Extensible Markup Language
(XML tags). However, these scripts are very quick to proof - a few seconds per
file - so even with the manual review, processing files that have speaker
identifications with XML tags is relatively quick.
Currently, ESDS Qualidata uses a very limited set of XML elements. Essentially,
the mark-up consists of distinguishing turns of speech. Tags are also used to
identify basic demographic characteristics of an interviewee (pseudonym,
gender, year of birth, residence, occupation, and so forth). The Text Encoding
Initiative (TEI) is used for this. In addition, an XML
schema, QuDEx, is being developed that enables the representation of coding, researcher
annotations and other linkages between segments of data.
For further advice or specific queries, contact acquisitions@esds.ac.uk
See also
digitisation of audio materials