| ESDS | Home | A-Z index | Site map | Contact | Login | Search: 


ESDS Qualidata logo - link to ESDS Qualidata home page

Data enhancement procedures

During this year, ESDS Qualidata has successfully piloted procedures for enhancing qualitative data and has brought several collections through (or partway through) these new procedures. In the context of our efforts to archive qualitative data, "enhancement" has two meanings that are useful to clarify. The first form of enhancement we do is electronic: we digitise these collections, that is, convert them from paper to some form of electronic record. Many of our most valuable collections (e.g., classic sociology studies) exist only in paper format and require this type of enhancement to become web-enabled. Currently, we are digitising work to three different levels: searchable PDF, digitised for download, and XML-tagged for online access. The second form of enhancement of contextual material (described in detail below) involves augmenting a data collection with additional content to make the collection more useful to potential researchers.


Electronic enhancement


Searchable PDF

Creating a searchable PDF is a relatively quick process. First, materials are copied onto conservation-grade paper, and then they are scanned. Once we have the TIFF files from this scanning, we use Paper Capture in Adobe to create searchable PDFs. This process can be done in batch mode running in the background behind other operations, and hence is reasonably quick and efficient. Of course, the degree of "searchability" is much lower than with complete digitisation. Adobe indicates a maximum accuracy of 70% and even this assumes high quality originals. We haven't tested for precise levels because the time cost of preparing materials this way is so low that even if the search accuracy is limited, it is still better than no enhancement at all.

The other main issue with producing PDFs is anonymisation. When an interview is fully digitised (scanned and OCR, see below), it is straightforward to anonymise because the entire text is editable. With PDFs, other strategies are necessary because a PDF is not digitised text, thus word editing isn't possible and automatic search is not reliable enough for anonymisation purposes. One option is to apply anonymisation to copies of original interviews, then create the PDFs. However, in cases where we create PDFs from un-anonymised interviews, then attempting to anonymise the PDF can be problematic. The simplest case is when the only disclosing information (often interviewee name) is located somewhere on the page where it can be easily cropped, such as in a header. The location and nature of disclosing information determines the most appropriate and efficient strategy.


Digitised for web download

This level of enhancement is what is most typically associated with the term, digitisation. In this process, materials are first scanned, and then put through OCR (optical character recognition). As the OCR is never perfect, there is still often a great deal of labour required to proof and format these files into readable text. Although some OCR software is trainable, it often requires quantities of material greater than a typical qualitative data set. We are still experimenting with whether we get efficiency gains from OCR training. Once materials are fully edited, we produce both Word and RTF formats available for download through the UK Data Archive Download Service, via the ESDS authenticated web-access system.


XML mark-up for online access

This level of enhancement builds on the previous steps of digitisation. Once a file is in a clean electronic "good transcript" format, then it will go through one of two procedures. If the file has speaker identifications, then it still requires a manual check for multiple speakers. We apply perl scripts to add XML tags, but the identifications of all speakers in the interview must be available to the script before it can be applied to the interview. However, these scripts are very quick (a few seconds per file), so even with the manual review, processing files that have speaker identifications with XML tags is relatively fast.

Our remaining problem case is files without speaker identifications (and sadly, we have hundreds of them). For these files, we have to apply an only partially automated excel macro that tests (using obvious cues such as sentences ending in "?" vs. ".") to distinguish subject and interviewee utterances. Again, the macros speed this test, but extensive manual review is still required to produce XML-tagged text when the originals lacked speaker identifications. Estimates vary greatly, but it takes approximately 10-12 hours to process one 75 page interview of medium quality through the entire process of scanning to XML mark-up.

Currently, we are using a very limited set of XML elements. Essentially, the mark-up consists of distinguishing turns of speech. We are also using tags to identify basic demographic characterises of an interviewee (pseudonym, gender, year of birth, residence, occupation, and so on). We are continuing to develop a DTD, based on integrating the DDI and TEI standards that will enable much more complex mark-up, such as thematic codes, researcher annotations, geo-spatial references, etc. The DTD is still under development.


Contextual enhancement

Once all this digital enhancement is complete and we are preparing to add a collection to the UK Data Archive catalogue, we also identify certain selected collections for contextual enhancement. Typically, these materials are assembled into a user guide and made available for download via the catalogue with the data. The extent of this enhancement varies greatly as it depends both on the nature of the collection (complexity of the methodology, for example) and on what materials we have available from depositors of the research. However, in general, we are focusing on materials that reveal both the context and the process of the original research. Content that enriches context or explains in detail how the original research was actually done is extremely valuable to researchers embarking on secondary analysis. Several examples of contextual enhancement are provided below.

Because it was used as a pilot collection, the additional materials for the Edwardians collection are most comprehensive and they include:

  • Original funding application
  • Interview schedule
  • Methodology description (including notes to interviewers)
  • Classifications schemes (geographic and thematic)
  • Researcher biography
  • In-depth interviews with depositors
  • Press and book reviews

For the most recently released study, Blaxter's Mothers and Daughters, customised materials were prepared. ESDS Qualidata holds a collection of interviews done by Paul Thompson of many of the sociologists whose data collections we are processing. Extracts of his interview with Prof. Blaxter cover topics such as how she funded the study, how she created the study design, and how she analysed large quantities of qualitative data. Extracts from her book that describe methods and sampling are also included. Finally, a brief Scots dialect glossary has also been prepared to help researchers translate terminology that appears in these interviews, all with Scottish grandmothers.

For the Marsden and Jackson collection, Education and the Working Class, we have exemplars of the researchers' analytical notes, such as a memo revealing their thoughts on how sibling relations factored into the study. We also have several examples of correspondence between the two authors revealing their thoughts and plans as the research progressed.



ESDS Home Page > ESDS Qualidata Home Page > Online > Data enhancement procedures
_