Are you dealing with duplicate data?
Does your data not fall under exact match?
Are the duplicates in your data not consistent for an exact match?
Are you struggling with cleansing of different types of data duplicates?
If you have answered yes to most or all of the aforementioned questions then the solution to your problem is Fuzzy Matching. Fuzzy matching allows you to deal with the above mentioned problems easily and efficiently.
What is Data Matching?
Data Matching is the process of discovering records that refer to the same data set. When records come from multiple data sets and do not have any common key identifier, we can use data matching techniques to detect duplicate records within a single dataset.
We perform the following steps:
Standardize the dataset
Pick unique and standard attributes
Break dataset into similar sized blocks
Match and Assigning weights to the matches
Add it all up — get a TOTAL weight
What is Fuzzy matching?
Fuzzy matching allows you to identify non-exact matches of your dataset. It is the foundation of many search engine frameworks and it helps you get relevant search results even if you have a typo in your query or a different verbal tense.
There are many algorithms that can be used for fuzzy searching on text, but virtually all search engine frameworks (including bleve) use primarily the Levenshtein Distance for fuzzy string matching:
Levenshtein Distance: Also known as Edit Distance, it is the number of transformations (deletions, insertions, or substitutions) required to transform a source string into the target one. For example, if the target term is “book” and the source is “back”, you will need to change the first “o” to “a” and the second “o” to “c”, which will give us a Levenshtein Distance of 2.
Additionally, some frameworks also support the Damerau-Levenshtein distance:
Damerau-Levenshtein distance: It is an extension to Levenshtein Distance, allowing one extra operation: Transposition of two adjacent characters:
Ex: TSAR to STAR
Damerau-Levenshtein distance = 1 (Switching S and T positions cost only one operation)
Levenshtein distance = 2 (Replace S by T and T by S)
How to Use Fuzzy Matching in TALEND?
Step 1: Create an Excel “Sample Data” with 2 columns “Demo Event 1” and “Demo Event 2”.
Demo Event 1: This column contains the records on which we need to apply Fuzzy Logic.
Demo Event 2: This column contains the records that need to be compared with the Column 1 for Fuzzy match.
Step 2: In TALEND use the above Excel as input in the tfileInputExcel component and provide the same file again as input to the same component as shown in the diagram.
Step 3: In the tFuzzyMAtch component choose the following configurations as shown in the below diagram.
Step 4: In the tMap we need to choose the following column to take an output.
Demo_Events_1
MATCHING
VALUE
Step 5: Finally, you need to select an tFileOutputExcel component for the desired output.
In the final Extracted file, the Column “VALUE” shows the difference between the records and matches the records to their duplicate.
Conclusion:
In a nutshell, we can say that the use of TALEND’s Fuzzy Matching helps in ensuring the data quality of any source data against a reference data source by identifying and removing any kind of duplicity created from inconsistent data. This technique is also useful for complex data matching and data duplicate analysis.
About Girikon
Girikon is a reputed provider of high-quality IT services including but not limited to Salesforce consulting, Salesforce implementation and Salesforce support.
Recently i created a Penalization calculator which is written in Perl [.exe] language to calculate panel required to manufacture a printed circuited board. I did it for one of our customers. The output screen for this calculator looks like below:
I wanted to integrate the same penalization calculator in one of the projects which is written in java.
So I code like mentioned below to call exe in java.
package panelcalculator;
import java.io.BufferedReader;
import java.io.InputStream;
import java.io.InputStreamReader;
public class PanelCalculator {
public String calculate(String strParms)
{
String result = “”;
try {
Runtime runtime = Runtime.getRuntime();
Process process = runtime.exec(strParms);
InputStream inStream = process.getInputStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(inStream));
String data = “”;
while ((data = reader.readLine()) != null) {
result += data + ” “;
}
reader.close();
inStream.close();
process.destroy();
}
catch (Exception e)
{
result = e.toString();
}
return result.trim();
}
public static void main(String arg[])
{
PanelCalculator pc = new PanelCalculator();
String result = pc.calculate(“D:\\MyWebapp\\PanelizationCalculator\\PanelCalculator.exe PanelX=12 PanelY=18 NLayers=2 Spec=\”IPC-6012 Class 2\” BoardX=12 BoardY=2.25 Spacing=0.2 LaserDrill=N Flex=N Impedance=N ImpSingleEnd=0 ImpDiff=0 EdgePlating=N”);
if(result!=null){
System.out.println(result);
}
}
}
And finally i got output like following, exactly the same as I wanted:
UsableX=10.2 UsableY=14.6 MaxUp=4
Thank you.
Narendra
If you ask, ‘What is Window Azure? ’ the answer is:-
Window Azure is an open and flexible cloud platform for building, deploying, and managing applications and workloads hosted on a global network of Microsoft managed data centers. Windows Azure is a foundation for running applications and storing data in the cloud . Microsoft Data centers are used to store data.
To support cloud applications and data, Windows Azure has five components, those components are:
Compute: Runs applications in the cloud. These applications largely see a Windows Server environment, although the Windows Azure programming model isn’t exactly the same as the on-premises Windows Server model.
Storage: Stores binary and structured data in the cloud.
Fabric Controller: Deploys, Manages, and Monitors applications. The fabric controller also handles updates to system software throughout the platform.
Content Delivery Network (CDN): Speeds up global access to binary data in Windows Azure storage by maintaining cached copies of that data around the world.
Connect: Allows creating IP-level connections between on-premises computers and Windows Azure applications.
Window azure based on cloud service models, those models are- SaaS, PaaS and IaaS.
Software as a Service: Software as a Service (SaaS) vendors help in building custom applications to provide Solutions tailored to specific needs by developing services that are hosted in the cloud and can be consumed by the end users.
Platform as a Service: Platform as a Service (PaaS) vendors provide end-to-end cloud computing platform with capabilities for application design, development, testing, deployment and hosting.
Infrastructure as a Service: Infrastructure as a Service (IaaS) vendors provide virtualized computing and storage resources in the cloud as a service.
In the current version of Windows Azure, developers can choose from three kinds of roles:-
Web Role: The web role is just like a normal web server. It runs IIS7 and allows you to host up to five HTTP/S ports. You can host several web applications with the same role using host headers.
Worker Role: Worker roles, designed to run a variety of Windows-based code. The biggest difference between a Web role and a Worker role is that Worker roles don’t have IIS configured inside them, and so the code they run isn’t hosted by IIS .Worker roles are commonly used for back-end processes and for hosting many web services.
VM Role: The VM Role is differs from the web and worker role. The VM role is any server image that you create & upload and further can be customized as per your needs. It must run Windows Server 2008 R2.
Why Window azure important?
The Azure Services Platform (Azure) is an internet-scale cloud services platform hosted in Microsoft data centers, which provides an operating system and a set of developer services that can be used individually or together.
Azure’s flexible and interoperable platform can be used to build new applications to run from the cloud or enhance existing applications with cloud-based capabilities.
Its open architecture gives developers the choice to build web applications, applications running on connected devices, PCs, servers, or hybrid solutions offering the best of online and on-premises.
Azure simplifies maintaining and operating applications by providing on-demand compute and storage to host, scale, and manage web and connected applications.
Infrastructure management is automated with a platform that is designed for high availability and dynamic scaling to match usage needs with the option of a pay-as-you-go pricing model.
How to use Window Azure?
Five step to create Window azure application:-
Installation of Windows Azure SDK
Developing First Windows Azure Web Application
Deploying application locally in Development Storage Fabric
Registration for free Windows Azure Trial
Deployment of the Application in Microsoft Data Center
Hurrey!!
Ashutosh