Gartner’s 6 Rs model proposed the notion that each application could have a different cloud migration path and suggested six possible migration strategies. Can we apply the same principle to data migration? Gilad David Maayan, CEO and founder, Agile SEO, shares how to evaluate and how to migrate each of your datasets or databases to the cloud – by rehosting, replatforming, refactoring, repurchasing, retiring, or retaining.
Data migration is the activity of transferring data between data formats, computer systems, or data storage systems.
A data migration project is carried out for different purposes, including:
- Moving data to a third-party cloud provider
- Upgrading or replacing servers or storage equipment
- Website consolidation
- Database or application migration
- Infrastructure maintenance
- Company merger
- Datacenter relocation
- Software upgrades
In this article, I’ll take a fresh look at data migration, identify key types and approaches to data migration, and discuss how Gartner’s 6 Rs model, which has become popular for analyzing workload migration to the cloud, can be applied to data migration as well. For example, I’ll discuss what it means to rehost, replatform, or repurchase on-premise data while moving it to the cloud.
4 Types of Data
1. Structured data
Structured data also known as quantitative data, is well organized and readily understood by machines. Structured query language (SQL) is the common term used to work with structured data. Through the use of a relational SQL database, business users can search, input, and manipulate structured data quickly and efficiently.
The following are examples of structured data: names, addresses, dates, and credit card numbers. The advantages of this type of data include:
- Readily used by automated systems and machine learning (ML) algorithms—the organized nature of structured data makes querying and manipulating data automatically easier.
- Readily used by business users—structured data does not need an in-depth appreciation of various types of data and the way they function. Basic knowledge of the topic related to the data allows users to interpret and access the data easily.
- Rich ecosystem of tools—structured data has been around for many years, and there are more tools at hand to process and analyze it.
2. Unstructured data
Unstructured data features an internal structure, yet it does not have a schema or predefined data model. It can be non-textual or textual, machine-generated or human-generated. It may be stored in a non-relational database such as a NoSQL database or in an unstructured data store like a data lake.
There are many unstructured data formats. Unstructured data is typically stored in its native format (“original” format). It can include- video files (MP4, WMV, etc.), audio files (WAV, MP3, OGG, etc.), images (JPEG, PNG, etc.), PDF or Microsoft Word documents, social media posts, emails, and sensor data.
The advantages of unstructured data include:
- Native format—unstructured data that is stored in its native format stays undefined until it is required. This makes more data readily available for analysis by analysts and data scientists using a data lake model.
- Faster ingestion—as there is no requirement to predefine the data, it may be collected easily and quickly.
- Reduced costs—unstructured data can be stored using cloud-based object storage, which has a low cost, is easily scalable, and is billed per actual usage.
3. Semi-structured data
Semi-structured data doesn’t have a clear structural framework or cannot be organized in relational databases. However, it does have certain loose organizational frameworks or structured properties. Semi-structured data is, for example, a text which is organized by a hierarchical system of topics. However, apart from classification into topics, the data has no structure and is open-ended.
For instance, emails are semi-structured by subject, date, sender, recipient or the like, or with the assistance of machine learning, can be automatically assigned folders such as spam, inbox, or promotions.
4. Sensitive data
Sensitive data is information that is classified and must be protected. Third parties cannot access it unless special permission is granted. The data may be in electronic or physical form. There may be legal, ethical, or business reasons to protect sensitive data and place stricter limits on who can access it, and in which context. This protection is particularly important for intellectual property (IP), which can have very high value, for personally identifiable information (PII) and protected health information (PHI), which are covered by many regulations.
Learn More: Public Cloud Total Cost of Ownership: What You Should Know
Two Approaches to Data Migration
There are two common approaches to data migration: big bang migration and trickle migration.
1. Big bang data migration
In a big bang approach, you transfer all data assets, in a single operation, from source to target environment. This is done in a short window of time.
While the data transfers and undergoes transformations to fulfill the requirements of the target system, the application is down and cannot be accessed by users. This type of migration is often carried out during off times when customers are probably not using the application.
The big bang scenario lets you carry out-migration in the shortest time possible and prevents having to work across new and old systems simultaneously. Yet, in the age of big data, even smaller companies can accumulate huge volumes of data, and the throughput of networks and API gateways is limited. This restriction should be considered from the outset.
When is it suitable? The big bang approach suits businesses or companies dealing with small volumes of data. It isn’t feasible for mission-critical applications that have to be available 24/7.
2. Trickle data migration
Also called iterative or phased migration, this strategy brings the agile approach to data transfer. The entire process is broken down into sub-migrations, each having its own timelines, goals, quality checks, and scope.
Trickle migration requires the parallel running of new and old systems and transferring data in discrete amounts. This provides the advantage of zero downtime and 24/7 application availability.
However, the iterative approach involves more time and increases the complexity of the project. Your migration team should track which information has already been moved to ensure that users can access both systems.
Another aspect of trickle migration is that the old application remains fully operational until migration ends. Customers can continue using the old system and switch to the new application once all the data has been successfully transferred to the target environment.
Yet, this approach is more difficult for engineers. They need to ensure that data is synchronized in real-time over the two platforms after it has been created or modified. In this way, all modifications in the source system should prompt updates in the target system.
When is it suitable? Trickle migration is a solid choice for medium to large businesses that can’t incur long downtime but have the expertise needed to meet the technical challenge.
Learn More: 4 Reasons Why Data Virtualization Might Not Solve Your Migration Problem
Gartner’s 6 Rs
A decade ago, Gartner defined the “5 Rs”, five ways to migrate applications to the cloudOpens a new window . Stephan Orban from AWS added a 6th R. Together, they are:
Let’s see how each of these approaches can be applied to data migration.
Rehosting, also known as “lift and shift”, involves moving resources as-is to the cloud. Here is how rehosting can be applied to each type of data:
- Structured data—the simplest way to rehost structured data is to move a database, as is, to the cloud, either in the form of a VM or an export-import migration to an identical, cloud-based database.
- Unstructured data—data lakes are compatible with the rehosting approach because they let you dump any type of unstructured data directly to cloud storage in its original format.
- Sensitive data—rehosting is usually not appropriate for sensitive data because the cloud is usually assumed to be less secure than on-premises. Typically, sensitive data will be masked or otherwise treated before being transferred to the cloud.
Replatform refers to taking a resource, modifying it to make it more suitable to the cloud, and then migrating it. Here is how to apply it to each type of data:
- Structured data—structured data can be optimized to take advantage of cloud-based data processing services. For example, SQL tables can be moved from a local database to a cloud-based data warehouse like Amazon Redshift to enable massive-scale distributed processing.
- Unstructured data—unstructured data often includes data or metadata that is less useful. It is a good idea to remove this data and copy a clean version of structured data to the cloud, to conserve storage costs. Unlike on-premise storage, cloud-based data lakes have an ongoing per-GB storage cost.
- Sensitive data—sensitive data can be masked or filtered to ensure that only a subset of the data that is suitable for a cloud environment is copied over.
Repurchasing involves buying a cloud-based service that replaces the existing on-premise resource. There are many opportunities for repurchasing data on the cloud.
For example, organizations can leverage cloud-based resources or APIs to enrich customer data, obtain data about new prospects for sales or marketing efforts, gain access to creative assets like images or videos, or get large, standardized datasets to train machine learning models. Cloud-based datasets may represent a compelling alternative to existing on-premise assets.
Refactoring involves completely rebuilding a resource to make more effective use of a cloud-native environment. Here is how to apply refactoring to each type of data:
- Structured data—when migrating to the cloud, many organizations change their data format from traditional SQL tables to NoSQL key-value pairs. This lets them leverage massively scalable, high-performance databases like Cassandra, MongoDB, or Elasticsearch.
- Unstructured data—all cloud provided offer value-added services that can be used to prepare, process, and extract insight from unstructured data. By applying these services while ingesting the data, organizations can derive much value from their existing assets. For example, instead of just copying video assets to the cloud, an organization can use AI services or video APIs to extract tags, concepts, transcripts, and other data from video content.
- Sensitive data—sensitive data can be restructured to make it easier to manage and secure in the cloud. For example, a large table containing both sensitive and non-sensitive data can be broken up into separate tables, or even completely separate databases, to enable stronger control over access and authorization.
Retiring involves simply letting go of a resource and decommissioning it as part of cloud migration.
Retaining involves deciding to leave the resource on-premises in its original form. Here is how to apply the retire/retain strategy to each type of data:
- Structured data—a common use case for retaining data sets on-premises is online transaction processing (OLTP). Transactional databases require very high performance and low latency and can find it difficult to function in a cloud environment. If an OLTP system is not expected to scale rapidly, it is often better to retain it on-premises.
- Unstructured data—data that is frequently accessed on-premises, such as email databases or commonly used business documents, is often retained on-premises to ensure fast access by employees.
- Sensitive data—this is the most common strategy employed for sensitive data in cloud migration, retaining it on-premises to avoid the security risks of a cloud environment.
Did you find this article helpful? Tell us what you think onLinkedInOpens a new window ,TwitterOpens a new window , orFacebookOpens a new window . We’d be thrilled to hear from you.