ESDS has moved to the UK Data Service. Users registered with ESDS - what you need to know
| ESDS | Home | A-Z index | Site map | Contact | Login | Search: 


ESDS Qualidata logo - link to ESDS Qualidata home page
Print-friendly page

Digitisation of textual data

ESDS Qualidata does not generally preserve non-digital material. However some paper collections of key social science data collections that are acquired are digitised. This is done through the use of scanning and Optical Character Recognition (OCR) software. Criteria for what is appropriate for digitisation include:

  • the suitability of the material for digitisation (paper condition; colour; size; quality; type face; content; etc.)
  • the size of the paper collection to be digitised
  • how the collection should be prepared with respect to physical, organisational or intellectual considerations
  • the extent to which text should be made machine-readable (i.e. choose some level of OCR or simply scan as image files)

The most critical consideration is whether files should be created, stored and disseminated as simple raw images or whether they should be converted to fully searchable text. Due to the complex nature of some data collections, which can include printed paper questionnaires and interview schedules with typed and hand-written comments, some materials may not be suited to OCR.

Preparing paper for scanning

If digital versions of paper documents are to be OCR'ed, correct preparation is important. It is important to work with the material following accepted conservation standards that respect the original media. Key stages of preparation include:

  • assessment of material and its condition
  • selection of material suitable for scanning
  • photocopying of papers to improve readability and protect originals (where necessary)
  • resizing of pages when copying (if non-metric sized)
  • removal of items such as staples and paperclips
  • collation of original papers using archive quality paper clips and folders post scanning
  • creation of inventory of materials (if absent)

Photocopying of certain pages or sections before they are scanned has a dual purpose. It can safeguard fragile papers since scanning, which may involve several passes, is not done on the paper originals. Secondly, time and expense in adjusting poor quality bulk scans may be avoided if improvements in tone or brightness are introduced when photocopying. If it is intended that the paper copies will also be made available as a resource, there is the opportunity to copy onto high quality conservation-grade paper.

Scanning

The copy (not the original) is then scanned and saved in Tagged Image File Format (TIFF) format. The original is retained at this stage for reference purposes during proofing and processing. Similarly if any editing is required - perhaps due to confidentiality issues - the changes are made to the working copies. In most cases, an approach to editing is recommended that clearly shows something has been removed, rather than the seamless removal of text.

Following image scanning comes full capture of text. OCR packages are used to convert the scanned images to text.

Searchable PDF

The use of Portable Document Format files (PDF) allows the look of the original papers to be preserved. The usefulness of a PDF version may be further extended by creating one which is text searchable. This can be a relatively quick process.

For each document, all the constituent TIFFs are collated and converted to a PDF using 'Paper Capture' in Adobe Acrobat. This process can be performed in batch mode running in the background behind other operations, and hence is reasonably quick and efficient. Of course, the degree of searchability is much lower than fully OCR'ed text but it is secured at a much lower cost of time and labour.

Nevertheless there is a drawback in using PDFs. When a document is fully digitised it is straightforward to anonymise through simple text editing. However, with PDFs other strategies are necessary to ensure anonymisation. One option is to edit copies of documents, then create the PDFs. However, in cases where PDFs are created from un-anonymised interviews, attempting to anonymise them afterwards can be problematic. In most cases the location and nature of any disclosure information need to be considered to determine the most appropriate and efficient strategy.

Bookmarking the PDF

An additional strength of working with PDF documents is the ability to bookmark them. This creates a series of embedded links within a document which can, of course, be very useful for navigating through long documents.

Creating a rich text version

Even with advanced OCR software it is still time-consuming to produce an accurate text document. Where original materials are of poor print quality, contain handwriting, tables or drawings, the Rich Text Format (RTF) files produced through OCR require considerable editing and quality control.

Creating an XML marked-up version

This level of enhancement builds on the previous steps of digitisation. Once a file is in a clean electronic 'good transcript' format, it goes through one of two procedures. If the file has speaker tags, then it still requires a manual check for multiple speakers or any other variation. Secondly pre-processing scripts can be applied to the text document to add Extensible Markup Language (XML tags). However, these scripts are very quick to proof - a few seconds per file - so even with the manual review, processing files that have speaker identifications with XML tags is relatively quick.

Currently, ESDS Qualidata uses a very limited set of XML elements. Essentially, the mark-up consists of distinguishing turns of speech. Tags are also used to identify basic demographic characteristics of an interviewee (pseudonym, gender, year of birth, residence, occupation, and so forth). The Text Encoding Initiative (TEI) is used for this. In addition, an XML schema, QuDEx, is being developed that enables the representation of coding, researcher annotations and other linkages between segments of data.

For further advice or specific queries, contact acquisitions@esds.ac.uk

See also digitisation of audio materials

ESDS is now part of the
UK Data Service
.

These ESDS web pages will remain during the transition, but may not be up to date.


UK Data Service logo

Here are some links to get started with the new service:


Pioneers of Qualitative Research
Pioneers of Qualitative Research

British social research experienced an unprecedented flourishing from the 1940s to the 1970s.

This site takes a look behind the scenes through interviews with researchers who pioneered various qualitative methods.

ESDS Home Page > ESDS Qualidata Home Page > Create > Digitisation of qualitative data: text
_