September 17, 2009, 8:44 AM — by Neal Fishman -- Changing a data structure can be one of the simplest activities to physically perform. However, when considering the number of services and other programs that act upon a data structure, along with the physical amount of data that an organization persists (potentially, terabytes and petabytes), this can make any type of change to a data structure one of the most complex activities IT has to orchestrate.
To help illustrate this point, Figure 1 shows a sample definition for a relational table. Each name, length, and data type is essentially hard-coded. When a new business requirement warrants a change to the concept, length, or data type, workarounds are often sought -- which may increase the potential for introducing viral data; a condition associated with some aspect of poor data quality in a shared service.
Because lengths and data types generally represent hard-coded information, developing a technique to add flexibility can be essential to prevent viral data. Generally speaking, each discrete value of structured business information fits onto a single line -- whether that line appears on a screen or a printable report. For example, each of the columns in Figure 1 readily fit onto a single line.
Using the premise that when new lengths for business information are required, the business information can still fit onto a single line as the basis for an axiom, the following heuristic can be developed. A landscaped legal-size piece of paper can generally accommodate up to 260 characters of 8-point font (depending on the actual font and combination of letters and numbers used). Therefore, as a generalized starting point, all character-based columns may be defaulted to 254 characters or VARCHAR(254).
The length of 254 would accommodate most changes in length to a business field. As a default, if the length of 254 seems somewhat too long for all alphanumeric-based fields, a shorter length can be substituted. The value 254 represents a length that can accommodate the vast majority of structured concepts requiring an alphanumeric data type and is suggested as means to artificially create an unbound length.
When a concept requires a longer length, adjustments should be made. Figure 2 shows an example where default lengths are set to 2000 bytes, and certain columns, such as Character, represent denormalized concepts (in this case, a comma-separated list of character roles in a movie). In most cases, however, a business concept can safely use a shorter length than 254. Regardless, the chosen length should equal or exceed any length that might ever be required for a field.
The value of 254 represents a reasonable maximum length for a single field printed on a single line of landscaped U.S. legal-size paper. The rationale for adopting a larger length than a business concept appears to warrant is based on the maxim that changing the length of a column is simple in principle, but is usually extremely difficult and costly in practice.
Changing the length of a column can also have reverberations on fine-grained services that are often associated with CRUD (create, retrieve, update, and destroy) operations. For some organizations, the effort to change a single VARCHAR2(18) column to VARCHAR2(20) can be insurmountable. The expense and effort associated with the year 2000 rewrites and replacements attests to the complexities involved with field length changes.
OSAPI, pronounced oh-sap-ee, is an acronym that represents a heuristic intended to assist a data administrator, data modeler, or a data designer in the applicability of using an artificial field length. OSAPI is an acronym for
Ownership, stability, abstraction, performance, and interfaced are the five characteristics that form the artificial field heuristic. Each characteristic is associated with a simple yes or no question. Each yes or no answer is then associated with a numeric value that, when summed up, provides rudimentary guidance to model a data type length.
The characteristic ownership explores an organization's control over all aspects of an attribute or column. Stability inquires into the mutability of the column's domain over time. Abstraction explores the degree of modeled generalization. Performance analyzes latency issues and interfaced is used to look at whether the column is shared.
The OSAPI questions are phrased as follows. The use of the term concept covers elements, attributes, columns, and fields:
Ownership - Is the concept owned, controlled, and governed by our organization?
Stability - Has the domain set for the concept been stable over time (past, present, and the anticipated future)?
Abstraction - Is the concept represented in an abstract form (not concrete -- i.e., last name is a concrete concept, and a party name that accommodates both people and organizations would be an abstract concept)?
Performance - Does this concept have distinct performance requirements?
Interfaced - Is this concept shared beyond this system or application?
Each question should be answered yes or no for each concept in a table. Each yes or no answer is awarded a point value (see Figure 3). By summing the points awarded for each concept, the total can be evaluated against the gauge shown in Figure 4.
The closer an OSAPI score is to zero, the stronger the recommendation is to make use of an artificial length. For example, all automobiles are identified through a vehicle identification number (VIN). The 17-digit VIN uniquely identifies a particular vehicle and also contains information that describes the vehicle itself.
VINs are a global concept used by automobile manufacturers, insurance companies, government agencies, law enforcement agencies, vehicle distributors, banks, identification and credentialing verification service providers, and others.
The VIN structure was designated by the Society of Automotive Engineers (SAE) and is currently controlled by the International Standards Organization (ISO). Therefore, a VIN is not owned by any particular community that has business operations to create or maintain the number. In addressing the OSAPI ownership question for a VIN, the answer would be no and would be scored as a zero.
Historically, the VIN structure has never been a stable concept. Prior to 1980, different manufacturers used different formats, and in October 1980, the SAE published a standard to guarantee unique numbers for 30 years. Soon after the year 2000, the SAE was obliged to readdress the standard because duplicate VIN numbers would be unacceptable to many organizations, especially insurance companies and law enforcement agencies.
In 1980, the VIN standard was organized around a 17-byte composite field. While exploring options to revise the standard, the SAE discovered that extending the length was not an option -- not even by a single character. The change in length was insurmountable. A substantial number of organizations would be confronted with too many systems to change.
The resolution adopted by the SAE involved adjusting some of the embedded characteristics of the 17-byte composite field, which extended the VIN life span by another 30 years. In the lingua franca of IT, the SAE simply applied a temporary patch. Ultimately, the VIN problem remains unresolved and cannot be viewed as a stable concept. The OSAPI stability question for VIN, therefore, would be answered in the negative and would therefore be scored as a zero.
Germane to abstraction, if a VIN is modeled and instantiated as a vehicle identification number, the concept is concrete. A VIN is a real-world business concept, and an element called VIN_id would simply reflect a real-world view. In addressing the abstraction question, the answer would be no and would also be scored as a zero.
Performance requirements for using a VIN would in all likelihood depend on whether the VIN was going to be used as a single field or decomposed into its constituent or atomic parts. Even when processing the VIN in its atomic parts, additional latencies or overhead are unlikely should the VIN be cast with an artificial length that exceeds 17 bytes. Additional performance considerations can be taken into account if the column was included as part of an index. Therefore, for the performance question for VIN, the answer would be no and would also score a zero.
The final OSAPI question addresses whether a concept is actively shared between services. As a major identifier, the VIN is likely to be shared by many services and applications. Consequently, addressing the interfaced OSAPI question, the answer would be yes and would score as zero. In the example of VIN, all five OSAPI questions are scored as zero. Summing each score obviously results in the grand total of zero. Grading the final score against the scoring gauge denotes a recommendation that an artificial length is used for implementation.
The gauge values range from zero to five. The gauge values are associated with a proposition. Zero and one are associated with a rating of recommended. Two and three are associated with a rating of suggested, and four or five are associated with a rating of indifferent.
A gauge reading of recommended means that a default length value should exceed the maximum length used or currently known. A gauge reading of indifferent indicates that fixing the data type length to the current maximum is acceptable, whereas a gauge reading of suggested implies that using a larger length is prudent.
Should a service require knowledge of a concept's actual length, a meta-driven framework consisting of a persistent data store, data, and an access method can be instantiated.
This tip is adapted from the book, Viral Data in SOA: An Enterprise Pandemic, authored by Neal Fishman, published by IBM Press, July 2009, ISBN 0137001800; a complete sample chapter is available on the publisher site, www.ibmpressbooks.com/title/0137001800, including a direct-access link for Safari Books Online subscribers