Friday, May 14, 2021

Semantic Search: Too many or too few matching pairs ? Dynamically determined selection threshold for matched query pairs

In my recent projects on applying Natural language processing (NLP) methods, a large part is based or contains parts based on semantic search. In a nutshell, we have certain queries (phrases or sentences) on one side and a set of other texts on the other side and our goal is to find the best matching texts to our query. Simply writing, we need to perform semantic search on our data set.

For those who are less familiar with semantic search, let me define the term as:
a kind of lexical comparison of two texts with dominant part of understanding the content of words and phrases, and relations between words or phrases in queries being compared.

While working with semantic search, I encountered a problem with defining the acceptance threshold for my findings. This problem becomes significant when the texts being compared are of significantly different lengths and/or contain significantly different degrees of content. In other words, the problem becomes serious when we deal with the so-called asymmetric semantic search https://www.sbert.net/examples/applications/semantic-search/README.html#symmetric-vs-asymmetric-semantic-search.

In the following, I would like to share a method which allows to dynamically determine the acceptance threshold of found pairs of matched entries. This method may determine the final solution or be a prelude to a more modified version. The project code is available on my Github account "https://github.com/Lobodzinski/Semantic_Search__dynamical_estimation_of_the_cut_of_for_selected_matches".

Let's start describing the method.
  • The data:
    As an experiment I will use reuters data, known as reuters-21578 (https://paperswithcode.com/dataset/reuters-21578). While searching for an answer to our query, we should try to be as precise as possible in formulating the questions. However, sometimes it is not possible. For the purpose of this mini-project, let's formulate our queries in a general form.
    'Behavior of the precious metals market',
    'What is the situation in metal mines',
    'Should fuel prices expect to rise ?',
    'Will food prices rise in the near future ?',
    'I am looking for information about food crops.',
    'Information on the shipbuilding industry'
  • Generation of matched pairs between the queries and the Reuter's texts:
    Our goal is to perform a semantic search. First, we need to generate matching text pairs. In the following I will use the code that is part of the sentence-transformers package "https://github.com/UKPLab/sentence-transformers/tree/master/examples/applications/semantic-search".
  • Similarities and the threshold calculation:
    Having calculated the similarity values, we can move to the main point - choosing the similarity threshold. First, let's look at the similarity plot in the test function for a fixed query ('What is the situation in metal mines ?')
    It is obvious that not all matches shown in the Figure are good (acceptable). So how to choose the threshold value of similarity ?
    The proposed method is fully heuristic and is based on the calculation of the elbow point of the curve of the similarity as a function of the matched text. If we take a look at the examined functional relationship, we can see that this curve (almost always) has an elbow point beyond which the similarity between the found texts and our query changes very slowly. To calculate the "cut off" point (elbow point) I used the KneeLocator package ("https://pypi.org/project/kneed/"). The function KneeLocator ("https://kneed.readthedocs.io/en/stable/parameters.html#s") contains a sensitivity parameter S which can be used to better select our elbow point.
    The following code and its output shows the details of the calculation and its results. For details, please check "https://github.com/Lobodzinski/Semantic_Search__dynamical_estimation_of_the_cut_of_for_selected_matches".
    .

    This part, for each query reads all matched sentences gathered from the Reuters data together with calculated similarities. The threshold is calculated by the function KneeLocator, this part is denoteb by bold text in the code below.
    
    # loop over our list of queries:
    for query in result_df['query'].unique():
    
    	sentences_ = result_df[(result_df['query'] == query)]['sentence'].values
        x = []
        for y_ in sentences_:
        	x.append(y_[:60]+'...')
        
       	
        # similarities between the query and the matched Reuter's texts:
        y1 = result_df[(result_df['query'] == query)]['score'].values
        
        # determne elbow value:  
        x0 = list(range(len(y1)))
                
        kn = KneeLocator(x0, y1, S=1., curve='convex', direction='decreasing') 
        elbow_1 = kn.knee
        print ('Elbow point values:\n tekst_id=', elbow_1, \
                		'; threshold value=',y1[elbow_1])
        
        
     
    Resulting value (the threshold point) is presented on the next Figure :

    So, for our query ('What is the situation in metal mines ?'), we found 14 texts in the Reuter's set. Below I have copied the first 3 and last 2 texts from the set of accepted texts (the whole set is too long to present here). The reader can judge for themselves the similarity between the query and the text.
    For comparison, I have also added the text which is not accepted (15), which is not accepted by this method.
    Accepted texts:
    1
    SIX KILLED IN SOUTH AFRICAN GOLD MINE ACCIDENT Six black miners have been killed and two injured in a rock fall three km underground at a South African gold mine, the owners said on Sunday. lt Rand Mines Properties Ltd>, one of South Africa s big six mining companies, said in a statement that the accident occurred on Saturday morning at the lt East Rand Proprietary Mines Ltd> mine at Boksburg, 25 km east of Johannesburg. A company spokesman could not elaborate on the short statement.
    2
    NORANDA BEGINS SALVAGE OPERATIONS AT MURDOCHVILLE lt Noranda Inc> said it began salvage operations at its Murdochville, Quebec, mine, where a fire last week killed one miner and caused 10 mln dlrs in damage. Another 56 miners were trapped underground for as long as 24 hours before they were brought to safety. Noranda said the cause and full extent of the damage is still unknown but said it does know that the fire destroyed 6,000 feet of conveyor belt. Noranda said work crews have begun securing the ramp leading into the zone where the fire was located. The company said extreme heat from the fire caused severe rock degradation along several ramps and drifts in the mine. Noranda estimated that the securing operation for the zone will not be completed before the end of April. Noranda said the Quebec Health and Safety Commission, the Quebec Provincial Police and Noranda itself are each conducting an investigation into the fire. Production at the mine has been suspended until the investigations are complete. The copper mine and smelter produced 72,000 tons of copper anodes in 1986 and employs 680 people. The smelter continues to operate with available concentrate from stockpiled supplies, Noranda said. Reuter
    3
    NORTHGATE QUEBEC GOLD WORKERS END STRIKE Northgate Exploration Ltd said hourly paid workers at its two Chibougamau, Quebec mines voted on the weekend to accept a new three year contract offer and returned to work today after a one month strike. It said the workers, represented by United Steelworkers of America, would receive a 1.21 dlr an hour pay raise over the life of the new contract and improved benefits. Northgate, which produced 23,400 ounces of gold in first quarter, said that while the strike slowed production, We are still looking forward to a very satisfactory performance. The Chibougamau mines produced 81,500 ounces of gold last year.
    ....
    13
    NORANDA BRUNSWICK MINERS VOTE MONDAY ON CONTRACT Noranda Inc said 1,100 unionized workers at its 63 pct owned Brunswick Mining and Smelter Corp lead zinc mine in New Brunswick would start voting Monday on a tentative contract pact. Company official Andre Fortier said We are hopeful that we can settle without any kind of work interruption. Fortier added that Brunswick s estimated 500 unionized smelter workers were currently meeting about a Noranda contract proposal and would probably vote next week. The mine s contract expires July 1 and the smelter s on July 21. The Brunswick mine produced 413,800 tonnes of zinc and 206,000 tonnes of lead last year at a recovery rate of 70.5 pct zinc and 55.6 pct lead. Concentrates produced were 238,000 tonnes of zinc and 81,000 tonnes of lead.
    14
    COMINCO lt CLT> SETS TENTATIVE TALKS ON STRIKE Cominco Ltd said it set tentative talks with three striking union locals that rejected on Saturday a three year contract offer at Cominco s Trail and Kimberley, British Columbia lead zinc operations. The locals, part of United Steelworkers of America, represent 2,600 production and maintenance workers. No date has been set for the talks, the spokesman replied to a query. The spokesman said talks were still ongoing with the two other striking locals, representing 600 office and technical workers. Production at Trail and Kimberley has been shut down since the strike started May 9. Each of the five locals has a separate contract that expired April 30, but the main issues are similar. The Trail smelter produced 240,000 long tons of zinc and 110,000 long tons of lead last year, while the Sullivan mine at Kimberley produced 2.2 mln long tons of ore last year, most for processing at Trail. Revenues from Trail s smelter totaled 356 mln Canadian dlrs in 1986.


    Not Accepted texts:
    15
    VESSEL LOST IN PACIFIC WAS CARRYING LEAD The 37,635 deadweight tonnes bulk carrier Cumberlande, which sank in the South Pacific last Friday, was carrying a cargo which included lead as well as magnesium ore, a Lloyds Shipping Intelligence spokesman said. He was unable to confirm the tonnages involved. Trade reports circulating the London Metal Exchange said the vessel, en route to New Orleans from Newcastle, New South Wales, had been carrying 10,000 tonnes of lead concentrates. Traders said this pushed lead prices higher in early morning trading as the market is currently sensitive to any fundamental news due to its finely balanced supply demand position and low stocks. Trade sources said that 10,000 tonnes of lead concentrates could convert to around 5,000 tonnes of metal, although this depended on the quality of the concentrates. A loss of this size could cause a gap in the supply pipeline, particularly in North America, they noted. Supplies there have been very tight this year and there is a strike at one major producer, Cominco, and labour talks currently being held at another, Noranda subsidiary Brunswick Mining and Smelting Ltd.
    16
    LTV lt QLTV> TO NEGOTIATE WITH STEELWORKERS LTV Corp s LTV Steel Corp said it agreed to resume negotiations with the United Steelworkers of America at the local plant levels, to discuss those provisions of its proposal that require local implementation. The local steelworker union narrowly rejected a tentative agreement with the company on May 14, it said. LTV also said it agreed to reopen its offer contained in the tentative agreement reached with the union s negotiating committee as part of a plan to resolve problems through local discussions.


    As you can see, the unaccepted texts are not directly related to mining, which is what we are asking about in our query.


Thanks for reading, please feel free to comment and ask questions if anything is unclear.