| ESDS | Home | A-Z index | Site map | Contact | Login | Search: 


ESDS Qualidata logo - link to ESDS Qualidata home page

XML application for qualitative data

This development work was started in 2001. The following provides an introduction to the earlier work.

Aims and objectives
Why is a uniform format required?
Why an XML format?
Advantages of using standards
Development of an XML application for qualitative data
Making use of available XML applications
Applying the TEI and DDI in an application for qualitative data
The Text Encoding Initiative (TEI)
TEI guidelines and DTDs for encoding texts
Edwardians Phase I: applying the TEI guidelines to transcriptions of speech
Edwardians Phase II
The Data Documentation Initiative (DDI)
An XML schema for qualitative data


Aims and objectives

The main aim of this development work is to produce a comprehensive application in an Extensible Markup Language (XML) format appropriate for interchange that will enable sophisticated online searching and information retrieval from encoded texts, and which is potentially applicable to other qualitative datasets. Ideally, the application should meet a number of specific objectives, namely:

  • support the encoding of the content of various types of documents produced in qualitative research, including: interview transcriptions; research diaries; survey questionnaires; case notes; transcriptions of focus groups; and contextual documentation such as newspaper clippings, letters of correspondence, and researcher's notes
  • support the encoding of the researcher's original analysis of the datasets and any annotations that may have been added to the primary materials
  • provide formalised links between the texts and associated audio and video materials, with a view to providing, in the long term, integrated, multimedia resources
  • represent metadata at the individual file, or interview level, and for the entire collection

Why is a uniform format required?

A uniform format for encoding the content of datasets is useful for both data providers and users because it:

  • ensures consistency across datasets
  • supports the development of common publishing and search tools
  • facilitates data interchange and comparison between datasets

Why an XML format?

The internationally defined standard for data exchange, XML, is a potentially useful technology for making the features of qualitative data explicit in machine-readable form.

XML is, in proper terms, a 'meta-language' that formally comprises the rules for defining a mark-up system. The standard allows for the specification of mark-up languages that make structural elements explicit in a document using a system of ordinary textual tags that are embedded in the text in an ordered hierarchy or 'tree' system. In other words, XML permits descriptive mark-up systems in which nested pairs of mark-up codes are used simply to provide names to categorise or classify parts of a document. The power of XML lies in the fact that it is extensible. That is, it allows for the creation of descriptive mark-up systems based upon a common vocabulary, but this vocabulary can be extended to accommodate the special requirements of a particular user or domain.

Various programs designed for different processing purposes can be applied to a descriptively marked-up document. Furthermore, processing can be restricted to certain designated sections of the marked-up document, according to the individual requirements of the user, so the same document can be re-used for different purposes in different ways.


Advantages of using standards

Using a standard such as XML has a number of advantages for data developers, data providers and users in general. Firstly, there are a wide range of non-proprietary tools and related languages for manipulating and processing text from XML documents and new tools continue to appear on an almost daily basis. Among the more established tools for processing XML are style sheets; the standard being developed for XML is the Extensible Stylesheet Language or XSL. Style sheets are popular among publishers because they allow for the notion of 'writing once and publishing everywhere'. Thus it is relatively straightforward to provide different audiences with different 'views' of an XML encoded text. Secondly, other standards, frameworks, protocols and applications make use of generic standards, such as XML. Therefore, by adopting XML, ESDS Qualidata is joining a wider community of users, and increasing opportunities for compatibility and interchange.


Development of an XML application for qualitative data

XML and related tools for creating and processing documents in XML have rapidly been adopted by communities of users for whom semantic tagging for their own application areas is essential. Examples where XML tag sets are specially adapted to allow mark-up of the types of information specific to the user community, include the Data Documentation Initiative (DDI) for the social sciences and the Text Encoding Initiative (TEI) .

With increasing recognition of the benefits of XML in creating non-proprietary, cross-platform applications, there has been interest in, and calls for, the development of a qualitative data XML mark-up language from members of the social science research community who are eager to encourage the re-use of data. Indeed, there has been some progress in applications for exporting coded data produced by Computer Assisted Qualitative Data Analysis (CAQDAS) software in an XML format (specifically Atlas.ti). However, further work is required in the definition of a common XML framework and associated Document Type Definitions (DTDs).

The development of an XML application for marking up the content of qualitative datasets ideally requires support and contributions from various members of the social science community: data creators; CAQDAS software developers; data providers and end users.

Areas in which a community effort is of particular importance include:

  • agreement on the types of documents and structures to be marked up
  • formal definition of an XML vocabulary and DTD for describing these structures
  • specification of publishing and analysis tools
  • test applications based upon 'real' datasets

The Edwardians Online pilot, undertaken in 2001-03 and which formed the basis of what is now ESDS Qualidata Online, aimed to provide the foundations for such an initiative.


Making use of available XML applications

The option of 'going back to the drawing board' to create a new application of XML - specifically for the purpose of marking up transcribed interviews and other types of qualitative material - was initially considered. There are, however, two already existing XML applications, TEI and the DDI, which are particularly relevant to these objectives.

The TEI and DDI are used in a wide range of projects, in the UK, the US and Europe, and an application for qualitative data would benefit from the expertise and experiences of these user communities. Furthermore, both applications have detailed documentation, and making use of these standards would create opportunities for using existing and forthcoming application tools.

A cost-effective and generally advantageous option, is to develop an application tailored for qualitative data, but one which is compliant with these models. In Edwardians Online, it made sense to adapt and integrate the TEI and DDI DTDs for a prototype DTD for qualitative data.


Applying the TEI and DDI in an application for qualitative data

The Text Encoding Initiative (TEI)

The TEI, now an international consortium, provides a Standard Generalized Markup Language (SGML), more recently, a sophisticated and comprehensive XML application and guidelines for the mark-up of different types of texts, including prose, drama, dictionaries and verse and transcriptions of speech.

The guidelines, DTDs, a list of international scholarly projects using TEI-conformant mark-up and links to TEI software are available online at the TEI home page.


TEI guidelines and DTDs for encoding texts

The TEI guidelines are based upon the concept of an all-inclusive DTD for encoding all types of text. Users may select elements from the full TEI DTD in a customised application, thus compiling their own TEI compliant DTD.

Different tag sets are provided for different document types. Thus, in any particular application, users may select one of the main subsets of the full DTD, a 'base' tag set, according to the particular type of text they are interested in encoding, for example a drama, a dictionary, verse or a transcribed record of speech.

Of course, a document may contain different types of text, for example, a book may contain verse and drama. To encode this type of work, users would select a 'mixed base', in other words, a combination of the basic tag sets.

TEI applications will also make use of a 'core' tag set, which includes a number of mandatory elements for encoding a TEI document, for example, the TEI header for encoding metadata and elements for describing features common to documents in general.

In addition, a number of specialised elements or 'additional' tag sets can be selected according to the specific content requirements of an application. These include tags for identifying features for specific analytical purposes or tags for marking up specific content features such as person names, place names, dates, monetary amounts, tables or figures. The TEI is therefore useful for encoding datasets with a mixed format, for example, collections comprising text, tabular and graphical data.

The generality of the TEI is thus particularly useful for ESDS Qualidata's purposes since it can accommodate the encoding of the various types of qualitative data as described in the Aims and objectives section.

Moreover, the flexibility of the TEI is such that a TEI-conformant DTD can be easily extended or modified to include other specialised element sets from the TEI, for different research requirements. For example, analytical elements for linguistic analysis could be added to the DTD for the mark-up of datasets in which a more detailed grammatical and transcription analysis is required. This might be important to a secondary study interested in say, conversation analysis. It is therefore especially relevant for ESDS Qualidata's objective of providing qualitative data in a format suitable for re-use.


Edwardians Phase I: applying the TEI guidelines to transcriptions of speech

Interview transcripts are perhaps the most common document type within the class of qualitative data. In the Edwardians Online project, an early stage of ESDS Qualidata Online, the aim was to develop a prototype DTD for qualitative data - by working with the example of the Edwardians collection of interview texts. Of particular interest, are the guidelines and DTD components for transcribed spoken material. Initial research has shown that these can be used to represent the main structural elements in the text, and the content of qualitative interview texts in general, in a transparent and straightforward manner.

Accordingly, basic mark-up of the main interview texts was carried out using a TEI-conformant DTD that was created and documented for the project. The web-based system that was built handled one data collection and focused on systematic seaching, retrieval and browsing of textual transcripts.

Edwardians Phase II

Phase II of the project enabled searching across collections of documents and expanding the level of mark-up. This phase aimed to generalise from the exemplars to specify a framework for a full DTD for qualitative data.

Selecting the full set of TEI elements for encoding features in transcribed speech, in this DTD, allows for any future encoding of a more sophisticated transcription.


The Data Documentation Initiative (DDI)

The DDI is a framework that aims to "establish an international criterion and methodology for the content, presentation, transport, and preservation of 'metadata' about datasets in the social and behavioral sciences". A useful introduction to the DDI and a list of projects implementing this framework are available at the DDI home page.

With the DDI's XML-based DTD for the mark-up of social science metadata or 'codebooks', metadata can now be created in a uniform, highly structured format that is easily and precisely searchable via web-based interfaces.

The DDI has received support across the social science community and is already in place in major European and US social science archives. For example, the Council of European Social Science Data Archives (CESSDA) Data Portal utilises the DDI format. The DDI is also supported by data tools such as Nesstar.


An XML schema for qualitative data

To date neither the DDI nor TEI has addressed specifically the problem of describing qualitative material, although it would be advantageous for the social science community if guidelines and an appropriate format were established within this general framework.

ESDS Qualidata has investigated how elements from the DDI, TEI and other metadata schemas may be adapted for a qualitative model, using the example of the Edwardians collection.

Using the TEI, this work also involved research into the integration and mapping of elements from the two DTDs. In this way the prototype application for qualitative data is compliant with the different DTDs, making use of particular elements in both whilst eliminating redundancy or 'overlap'.

Phase III is enhancing the ESDS Qualidata Online system and working to specify and test a data exchange standard and model. A dedicated project funded by ESRC from 2005 to 2006, Smart Qualitative Data: Methods and Community Tools for Data Mark-Up (SQUAD), worked on helping research and test a draft DTD and also explored the use of natural language processing tools to help semi-automate mark-up. A more recent JISC grant entitled Data Exchange Tools (DExT) is developing, refining and testing an XML schema for data exchange and is developing tools for data import and export.



ESDS Home Page > ESDS Qualidata Home Page > Online > XML application for qualitative data
_