talend

How Can Fuzzy Matching Using TALEND Help Deal With Data Duplicacy?

June 4, 2020
Ayush Joshi

Are you dealing with duplicate data? Does your data not fall under exact match? Are the duplicates in your data not consistent for an exact match? Are you struggling with cleansing of different types of data duplicates? If you have answered yes to most or all of the aforementioned questions then the solution to your problem is Fuzzy Matching. Fuzzy matching allows you to deal with the above mentioned problems easily and efficiently. What is Data Matching? Data Matching is the process of discovering records that refer to the same data set. When records come from multiple data sets and do not have any common key identifier, we can use data matching techniques to detect duplicate records within a single dataset. We perform the following steps: Standardize the dataset Pick unique and standard attributes Break dataset into similar sized blocks Match and Assigning weights to the matches Add it all up — get a TOTAL weight What is Fuzzy matching? Fuzzy matching allows you to identify non-exact matches of your dataset. It is the foundation of many search engine frameworks and it helps you get relevant search results even if you have a typo in your query or a different verbal tense. There are many algorithms that can be used for fuzzy searching on text, but virtually all search engine frameworks (including bleve) use primarily the Levenshtein Distance for fuzzy string matching: Levenshtein Distance: Also known as Edit Distance, it is the number of transformations (deletions, insertions, or substitutions) required to transform a source string into the target one. For example, if the target term is “book” and the source is “back”, you will need to change the first “o” to “a” and the second “o” to “c”, which will give us a Levenshtein Distance of 2. Additionally, some frameworks also support the Damerau-Levenshtein distance: Damerau-Levenshtein distance: It is an extension to Levenshtein Distance, allowing one extra operation: Transposition of two adjacent characters: Ex: TSAR to STAR Damerau-Levenshtein distance = 1 (Switching S and T positions cost only one operation) Levenshtein distance = 2 (Replace S by T and T by S) How to Use Fuzzy Matching in TALEND? Step 1: Create an Excel “Sample Data” with 2 columns “Demo Event 1” and “Demo Event 2”. Demo Event 1: This column contains the records on which we need to apply Fuzzy Logic. Demo Event 2: This column contains the records that need to be compared with the Column 1 for Fuzzy match. Step 2: In TALEND use the above Excel as input in the tfileInputExcel component and provide the same file again as input to the same component as shown in the diagram. Step 3: In the tFuzzyMAtch component choose the following configurations as shown in the below diagram. Step 4: In the tMap we need to choose the following column to take an output. Demo_Events_1 MATCHING VALUE Step 5: Finally, you need to select an tFileOutputExcel component for the desired output. In the final Extracted file, the Column “VALUE” shows the difference between the records and matches the records to their duplicate. Conclusion: In a nutshell, we can say that the use of TALEND’s Fuzzy Matching helps in ensuring the data quality of any source data against a reference data source by identifying and removing any kind of duplicity created from inconsistent data. This technique is also useful for complex data matching and data duplicate analysis. About Girikon Girikon is a reputed provider of high-quality IT services including but not limited to Salesforce consulting, Salesforce implementation and Salesforce support.

Talend Integration with Snowflake and Talend Snowflake Components

April 24, 2020
Sonali Kumari

What is Snowflake? How to Use Snowflake data warehouse? Snowflake Integration with Talend Talend Snowflake Components Related ELT Components for Snowflake What is Snowflake? Snowflake is analytic cloud data warehouse build on top of Amazon Web Services (AWS) cloud Infrastructure and is a true SaaS (Software as a Service) offering platform. There is no need to select, install, configure or manage any hardware. And also, there is no software you need to install, configure or manage. It cannot be run on any private cloud or hosted Infrastructure. All needed ongoing management, maintenance and tuning is handled by snowflake itself. Key Capabilities AWS and Azure Only (As on Date) No PK/FK Enforcement (Informational only) Snowflake SQL DDL/DML SQL Functions User Defined Functions/Stored Procedure (using java Script) Views/Materialized Views ACID Transactions Analytical Aggregations, Windowing and Hierarchical Queries Integration Supports Data integration Tools (Talend, Informatica, Pentaho etc.) Self-Services BI Tools (Tableau, QlikView, Spotfire etc.) Big Data Tools (Kafka, Spark, Databricks etc.) JDBC/ODBC Drivers Native Language Connectors (Python, Go, Node.jsetc.) SQL Interface & Client Snowflake web Interface Snowflake CLI DBeaver Unique Features Scalability (Storage and Compute) User Experience (Performance) No Indexing No Performance Tuning No Partitioning No Physical Storage Design Tunable pay per use How to use Snowflake data warehouse It is pretty easy to learn and handle snowflake. You can find open source 30 days trial version. Provide all the details required and do sign up for the first time. After login you can see the screen like this: From the right-side menu, you can switch Role, Warehouse, Database and Schema. To create all these, you can run simple create SQL query on worksheet provided. There are few tabs you can see databases, Warehouse, Worksheet, History and Account on topmost menu. All created database you can see on database tab and simultaneously warehouse and history details you can find. If you interested in detailed information about snowflake architecture visit snowflake documentation here https://docs.snowflake.com/en/. Snowflake Integration with Talend Creating a metadata connection to snowflake Inside project repository find the Metadata section Locate the Snowflake icon Right click and select snowflake connection menu option Inside connection definition dialog box fill up the details like Name, Account, User ID, Password, Warehouse Name, Schema and database name Initially just fill the parameter details manually to establish a valid connection but best practice is to convert it into a context group with assigned variables of connection parameters In Advanced ensure that Login Timeout, Tracing and Role settings as per requirement Click ok then click Test connection to ensure connection validity Once made a valid connection click next Select the tables as per your requirement then click finish Talend Snowflake Components tSnowflakeConnection: Use to manage a connection to Snowflake that can be reuse by another Snowflake components. You can either drag from metadata snowflake connection and select tSnowflakeConnection component or type tSnowflakeConnection and configure all the properties Standared properties Basic Settings Database Select the database list and apply accordingly Property type Select the way connection details build in or repository Account Enter provided account name within double quotes Snowflake Region Select AWS region or Azure region from dropdown list User Id and password Enter provided login name within double quotes and click […] button to enter password within double quotes Warehouse Enter Warehouse name within double quotes as per your project Schema Enter schema name to be used within double quotes Database Enter database name to be used within double quotes Advance settings Use custom snowflake Region To use customize snowflake region select the checkbox and provide region id as per region used within double quotes like eu-west-1 or east-us Login Timeout Specify login timeout to wait for response when connecting to snowflake before returning an error Role Provide the snowflake role you have assigned if empty it will take public role automatically granted TStatCatcher Statistics This check box is for gather the job processing metadata at the job level as well as component level Global variables ERROR_MESSAGE This is the error message generated by the component when error occurs Usage Usage Rule It creates a snowflake connection and works with other sub jobs that reuse this connection using trigger link tSnowflakeInput: Use to reads data from a snowflake table into the data flow of your job based on an SQL query. Standard properties Basic settings Database Select the database list and apply accordingly Property type Select the way connection details build in or repository Connection Component Select the connection component and use existing connection Account Enter provided account name within double quotes User Id and password Enter provided login name within double quotes and click […] button to enter password within double quotes Warehouse Enter Warehouse name within double quotes as per your project Schema Enter schema name to be used within double quotes Database Enter database name to be used within double quotes Table Click […] button and in select snowflake table to be used Schema and Edit Schema create and store the schema locally for this component or use already created schema in the repository. You can reuse it in various projects and job designs. Manual query Select the checkbox and write proper SQL query within double quotes Condition Enter the Boolean expression to be used to filter data from your snowflake table within double quotes Advance settings Login Timeout Specify login timeout to wait for response when connecting to snowflake before returning an error Role Provide the snowflake role you have assigned if empty it will take public role automatically granted Tracing Use for taking log for snowflake JDBC driver Allow snowflake to convert column to uppercase Select checkbox to convert lowercase to uppercase in the defined table name. This property will not available if you select Manual query check box TStatCatcher Statistics This check box is for gather the job processing metadata at the job level as well as component level Global variables ERROR_MESSAGE This is the error message generated by the component when error occurs NB_LINE Provide the number of rows proceed and returns integers Usage Usage Rule This component sends data to other components via row link tSnowflakeOutput: Use to insert, update, upsert or delete data in a output snowflake table incoming from its preceding component. It uses the bulk loader for high performance database operations provided by snowflake. Standard properties Basic settings Database Select the database list and apply accordingly Property type Select the way connection details build in or repository Connection Component Select the connection component and use existing connection Account Enter provided account name within double quotes User Id and password Enter provided login name within double quotes and click […] button to enter password within double quotes Warehouse Enter Warehouse name within double quotes as per your project Schema Enter schema name to be used within double quotes Database Enter database name to be used within double quotes Table Click […] button and in select snowflake table to be used Schema and Edit Schema create and store the schema locally for this component or use already created schema in the repository. You can reuse it in various projects and job designs. Output Action Select the operation to insert, delete, update or merge(upsert) data in snowflake tables Advance settings Use custom snowflake region To use customize snowflake region select the checkbox and provide region id as per region used within double quotes like eu-west-1 or east-us Login Timeout Specify login timeout to wait for response when connecting to snowflake before returning an error Role Provide the snowflake role you have assigned if empty it will take public role automatically granted Tracing Use for taking log for snowflake JDBC driver Allow snowflake to convert column to uppercase Select checkbox to convert lowercase to uppercase in the defined table name. This property will not available if you select Manual query check box TStatCatcher Statistics This check box is for gather the job processing metadata at the job level as well as component level Global variables ERROR_MESSAGE This is the error message generated by the component when error occurs NB_LINE Provide the number of rows proceed and returns integers NB_REJECT Provide the number of rows rejected and returns integers NB_SUCCESS Provide the number of rows Successfully proceed and returns integers Usage Usage Rule It is end component of a data flow in your job. It receives data from other component via Row > Main link and provide the error information like The name of the column in error, The number of row in error, The category of the error, such as parsing error or a conversion error, The character offset in the line in error, SQLSTATE for the error or Snowflake error code tSnowflakeClose: Use for close active snowflake connection and released the occupied resources Standard properties Basic settings Database Select the database list and apply accordingly Connection Component Select the component that opens the connection you need to close from the drop-down list Advance settings TStatCatcher Statistics This check box is for gather the job processing metadata at the job level as well as component level Global variables ERROR_MESSAGE This is the error message generated by the component when error occurs Usage Usage Rule This component is more commonly used with other snowflake component like tSnowflakeConnection component Related ELT Components for Snowflake tELTInput: Use to adds as many tables as required for the SQL statement to be executed. And the standard tELTInput component belongs to the ELT family. Standard properties Basic settings Schema and Edit Schema create and store the schema locally for this component or use already created schema in the repository. You can reuse it in various projects and job designs Default table name Enter default table name within double quotes Default Schema name Enter default schema name within double quotes Mapping Enter and specify the metadata file for the database to be used. The metadata mapping file is used for data type conversion between database and java Advance settings TStatCatcher Statistics This check box is for gather the job processing metadata at the job level as well as component level Global variables ERROR_MESSAGE This is the error message generated by the component when error occurs To fill up a field or expression with a variable, press ctrl+ space to access the variable list and choose the variable to use from it. Usage Usage Rule This component sends schema information to tELTMap via row link Note that the output link to be used with these components must correspond strictly to the syntax of the table name. Note: The ELT components do not handle actual data flow but only the schema information tELTMap: It is same as tMap component uses the tables provided as input to feed the parameter in the built SQL statement. The statement can include inner or outer joins to be implemented between tables or between one table and its aliases. It belongs to ELT family. Standard properties Basic settings Use an existing Connection Select the connection component click to existing connection and use relevant connection ELT Map Editor The ELT Map editor allows you to define the output schema and make a graphical build of the SQL statement to be executed. The column names of schema can be different from the column names in the database. User Id and password Enter provided login name within double quotes and click […] button to enter password within double quotes Property type Either Build-In or Repository. Build-in: No property data stored centrally. Repository: select the repository file to store the properties Mapping create and store the schema locally for this component or use already created schema in the repository. You can reuse it in various projects and job designs. Advance settings Additional JDBC parameter Add connection properties for additional DB connection you are creating. No need to use if you have selected use an existing connection in basic settings. TStatCatcher Statistics This check box is for gather the job processing metadata at the job level as well as component level Global variables ERROR_MESSAGE This is the error message generated by the component when error occurs To fill up a field or expression with a variable, press ctrl+ space to access the variable list and choose the variable to use from it. Usage Usage Rule This component sends schema information to tELTOutput via row link Note that the output link to be used with these components must correspond strictly to the syntax of the table name. Note: The ELT components do not handle actual data flow but only the schema information tELTOutput: Use for inserts the data according to the output schema defined in the tELTMap component including action and clauses on the table specified. Standard properties Basic settings Action on Table Select an operation like None, Drop and create table, create table, create table if does not exists, drop table if exist and create, Truncate table as per your requirement from the dropdown menu Action on Data Select and perform Insert, Update and Delete operations accordingly Schema and Edit Schema create and store the schema locally for this component or use already created schema in the repository. You can reuse repository n various projects and job designs. Where clause (for Update and Delete only) Enter a clause to filter the data to be updated or deleted during the update or delete operations. Default Table Name Enter default table name within double quotes Default Schema Name Enter default schema name within double quotes User Defined Table Name Select the checkbox to define a different output table name between double quotes, in the appeared Table Name field Mapping Enter and specify the metadata file for the database to be used. The metadata mapping file is used for data type conversion between database and java Advance settings Use Update Statement without Subqueries Select option to generate an UPDATE statement for the database. Option is available only when update is selected from the Action on data in basic settings. TStatCatcher Statistics This check box is for gather the job processing metadata at the job level as well as component level Global variables ERROR_MESSAGE This is the error message generated by the component when error occurs NB_LINE Provide the number of rows proceed and returns integer NB_LINE INSERTED The number of rows inserted and returns integer Usage Usage Rule This component is used with tELTMap via row link Note that the output link to be used with these components must correspond strictly to the syntax of the table name. Note: The ELT components do not handle actual data flow but only the schema information Final Thoughts: Traditional data warehousing methods and technologies are faced with a major challenge to provide the kind of service, simplicity and value required to sustain in today’s demanding and rapidly changing business environment. The magic combination of Snowflake and Talend –a Cloud data integration leader is not just super fun and easy to work with but is also an attractive proposition as a Cloud Data warehousing solution. About Girikon: Girikon is a reputed name in the Salesforce consulting, Salesforce implementation and Salesforce support space offers end-to-end IT services in order to cater to the needs of clients across different industry verticals.

Performance Tips & Reusability in Talend Data Integration

April 21, 2020
Sreenivasulu Pujala

Performance: Usually to increase the speed of execution process or improving the Performance of a job, the developers should follow some performance techniques. This Performance technique is a key matrix of a Jobs’ success or failure. We can speed up the job execution process in below following ways: Global Context: Defining the context group and access context details based on the context group and appropriate context variable throughout the job and we can pass the same context to sub jobs also. If we have more complex job, then we can divide the job into multiple sub jobs to increase the processing time and speed. For the job parallel execution purpose, we can use Multithread option to increase the processing time and speed. (Job –> Extra –> Multithread Execution) By increasing the JVM parameters is also one of the performance techniquesin talend. (Allocating more memory to the jobs). Use Database Bulk Components (While loading huge datasets to database from Talend Job, it is recommended to use Bulk components provided by Talend for almost all databases.) Ex: – tMSSqlBulkExec, tMSSqlOutputBulk, tMSSqlOutputBulkExec etc. Use SELECT Query to retrieve data from the database. Ex: – tOracleinput, tMySqlinput…..etc. Removing the unnecessary data/rows using tFilterColumns Removing the unnecessary fields/columns using tFilterColumns We can also increase the performance by changing the commit size and batch size. Avoiding the tMapcomponent for the single piece of work. It will also increase the performance and time. Reusability: Talend provides many ways to make our code reusable. Custom Code– Talend provides few custom components like tJava, tJavaRow, tJavaFlex These components are used to integrate our custom java code into a talend program. By using these Custom Code components, we can create codes for specific needs and for accessing the codes very quickly. Job: With the help of tRunJob, we can call the another Job. Metadata-Data that gives the information about the data is called as Meta Data. In terms of Talend open Studio, metadata refers to reusable configurations that describe the data, its attributes. Once we defined the metadata, it can be used across multiple jobs. For example, if the Username or Password to an DB account changes and this DB connection is used in 5 different jobs, the details would have to be updated 5 times. However, if we store this configuration in a single metadata component, then it only needs to be updated once, no need to change in differentjobs. Context Groups– We can define the contexts are User-Defined parameters. Context Variables are the values that may change from one environment to another environment like Development to Testing or Testing to Production. In these cases, Talend provides the Context Group facility. Here, a group of context variables that can be added to our job or multiple jobs as a context group from the metadata section. Suppose if we change any value from context group it will reflect to all other jobs wherever we used the same context group. Job Lets:We can use job lets for reusable purpose. In job any piece of code we get repeatedly that piece of code put into job let. SQL Templates: In Talend, we have by default some SQL Templates. If we are using Databases in our jobs, then we can create our own SQL Templates, or we can use default SQL Templates. Use Repository mode for storing the information that you want to use repeatedly in multiple components or Jobs, such as a database connection. Avoid the built-in mode. About GIRIKON: Girikon is a reputed provider of end-to-end IT services including but not limited to Salesforce consulting, Salesforce implementation and Salesforce support.

Why Use TALEND for Salesforce Data Migration?

April 20, 2020
Ayush Joshi

Are you facing the following challenges while migrating data to Salesforce? Tired of deleting the duplicates manually in Excel? Excel keeps hanging up because of large data? Applying VLOOKUP again and again for the same data set? Duplicating the whole data again for UAT and Production environment? Cannot apply complex transformation to your data set? If you have answered yes to most or all of the questions above, then the solution to your problem is TALEND. What is TALEND? TALEND is an open source software integration platform which helps you to turn data into business insights effortlessly. When we hear the term “Data Migration”, questions like – What is data migration? Why is it needed? How is it done? etc., pop up in our mind. This article addresses all the basic queries on Data Migration along with the reasons for choosing TALEND as an ETL tool for Salesforce migration. Which tool should be used in Salesforce Data Migration? Selecting the right data migration tool depends largely on your needs. There are several data migration tools, but they won’t be of help if they fail to meet the specific goals and objectives of your company.. Listed below are some points that you might need to consider while choosing an ETL tool. Let’s understand the Data Migration process before we go ahead and address the aforementioned issues. What is Data Migration and How Does TALEND Fit in? As the name itself suggests, data migration is the process in which data is transferred from one system to the other. These transfer systems can be data storage types or file formats. Data from the old system is transferred to a new system through a particular mapping pattern. The above diagram is a job in TALEND showing how simple it is to extract data from a Source, apply any tansfomation logic and push into different Target destinations. How is it done? Source Data: TALEND can be used for importing the raw data from different sources, andprocess multiple types and formats of Source data for exampleFile, Database, CRM or API. Data Cleansing: In this process we perform the following steps: Detect and Correct (or remove) corrupt or inaccurate records from the dataset. Identify incomplete, incorrect, inaccurate or irrelevant parts of the data. Replace, modify, or delete the dirty or coarse data. Data Deduplication: After Cleansing the data, Duplicate copies of data are removed from the dataset based on a pre-approved field set. Data Transformation: In this process we transform the data from the Source to match the Target Data Upload: After the transformation the data is uploaded to the Target system in which TALEND supports multiple platforms for example Excel, SQL server or Salesforce org. Why choose TALEND? Open Source Software: There are no cost implications for using “Talend Open Studio” or “TOS” since it is an open-source software. Cost is applied if there is a demand for joint data collaboration. Multiple Source/Target Format supported: We can extract the data directly from Salesforce, transform it, and push it directly into Salesforce without the need of an Excel or any Database. If in any case we do not have an Org to Org connection, we can still extract the data using Excel, manipulate, and load the data into Salesforce using TALEND since it supports all the Excel formats. Faster: We can process and transform the data much faster than Excel as TALEND provides functionalities in which we can change the format, append new data onto existing one and create or remove columns on the go. Reusable: In TALEND once the job is created the same job can be used for different target locations we do not need to transform the whole data again for different environments like (UAT or Production). In the case of any future modification we just need to append the same job and we will be spared from going through the whole process of transforming the data again. Supports Complex Data Transformation: TALEND provides the tools to perform highly complex transformations and manipulations on the data. It also provided the functionality to write custom scripts in JAVA wherever there is a need for custom transformations It can also process huge amounts of data which is difficult to manage in Excel. Conclusion: While there are several other tools for data migrations in the market, TALEND has become a preferred choice as it provides with the functionality of accessing data more easily, processing it more efficiently and moving it across various target systems very easily. About Girikon: Girikon is a Salesforce consulting and development company offering quality IT services to clients across the globe. Based out in Phoenix, Arizona, the company has presence across Australia and India.

Best Salesforce Consulting Partner...PERIOD

For the past 14 years, I've worked with numerous Salesforce Consulting partners, from very small, boutique style to very big traditional partners (big 5). Girikon has been by far the best partner I've worked so far. They are very receptive, responsive, highly knowledgeable, technical and very quick to address issues to find agreeable solutions. Their cost and value has no comparison in the industry. I highly recommend them and give them a 5 star.

Director, Informa PLC, USA

Great Salesforce Consultancy

Girikon has been instrumental in the success of our Salesforce roll-out at Refuse Specialists. They are efficient and very accurate in their time estimates for both large and small custom dev projects. They are very good at translating my ideas into actionable work and consistently deliver quick fixes.

VP of IT at Refuse Specialists LLC, USA

Excellent Project / Product Customizations

Great company to work with. Girikon has been able to take the base Sales Force product and customize it to meet our needs. We have thrown them some significant change requests since the initial concept and they have been able to adapt while offering alternate ideas for ways to improve. I would recommend for future projects!

Plant Manager, ITW Deltar Fasteners, USA

Excellent Service and Support

Incredible company that has solved our previous problems to the best of their abilities. Always available and super quick with their responses. Highly recommended!

CEO and Co-Founder, GoGo World, Japan

We look forward to further strengthening our partnership with Girikon.

Our engagement with Girikon has been a great learning experience for me, as an individual; as well as added to our company’s growth and experience. I now, so confidently know that when I hand off a new project to the Girikon team, it is not just going to be done on time, but also done well!

CEO, GNGF, USA

Great company, great people, great results

Very nice people who do good work for a great price. Very responsive, would recommend!

Operations Manager, Pure Power Engineering, USA

Go! Go! World is absolutely looking forward to working with Girikon for future projects.

Girikon analysed, planned and executed our needs perfectly. They took the time to learn our business and our needs specifically. Girikon’s work ethic and understanding of our needs were top-notch.

COO, GoGo World, Japan

We wanted to partner with someone who could provide affordable solutions and talent pool of highly skilled resources. In our partnership with Girikon, we got both! Our association has been growing stronger over the years. No matter what technologies will be involved, our next project will certainly be assigned to the Girikon team.

CEO, Sierra Proto Express, USA

We have been extremely impressed with all of our projects with Girikon. Girikon has helped streamline our internal processes, and made our daily work environment a better place to be. Our work together has reduced the amount of tedious “plug-and-chug” work, helped us to innovate new processes, and allowed us to focus on helping our clients.

IDS, GNGF, USA

We evaluated more than five different companies and chose Girikon as our technology partner. What I appreciate most about their services is that they always deliver with quality. Girikon provided our company end-to-end solution for e-commerce.

Founder Director (Yebhi.com)

I wanted to express my appreciation for the outstanding service received from Girikon. We have had our software for about 7 years now. We still receive the same dedicated service as we did from day one. Very few companies today offer this ongoing commitment to customer satisfaction. I highly recommend Girikon as a development and IT partner!

Jt. Dy. Director General, FIEO

Contact Us

talend

talend

Archives

Latest Blog

Which Salesforce Partners are Exper...

Salesforce Agentforce: A Step-by-St...

Step-by-Step Guide to Integrate Sal...

The Role of AI SMS Messaging in Omn...

Category

Tag Cloud

Client Testimonials

Our Partnerships

Our Certifications

Delighted Clients