subscribe to arXiv mailings

On the performance of sequential Bayesian update for database of diverse tsunami scenarios

Authors: Reika Nomura, Louise A. Hirao Vermare, Saneiki Fujita, Donsub Rim, Shuji Moriguchi, Randall J. LeVeque, Kenjiro Terada

Abstract: Although the sequential tsunami scenario detection framework was validated in our previous work, several tasks remain to be resolved from a practical point of view. This study aims to evaluate the performance of the previous tsunami scenario detection framework using a diverse database consisting of complex fault rupture patterns with heterogeneous slip distributions. Specifically, we compare the… ▽ More Although the sequential tsunami scenario detection framework was validated in our previous work, several tasks remain to be resolved from a practical point of view. This study aims to evaluate the performance of the previous tsunami scenario detection framework using a diverse database consisting of complex fault rupture patterns with heterogeneous slip distributions. Specifically, we compare the effectiveness of scenario superposition to that of the previous most likely scenario detection method. Additionally, how the length of the observation time window influences the accuracy of both methods is analyzed. We utilize an existing database comprising 1771 tsunami scenarios targeting the city Westport (WA, U.S.), which includes synthetic wave height records and inundation distributions as the result of fault rupture in the Cascadia subduction zone. The heterogeneous patterns of slips used in the database increase the diversity of the scenarios and thus make it a proper database for evaluating the performance of scenario superposition. To assess the performance, we consider various observation time windows shorter than 15 minutes and divide the database into five testing and learning sets. The evaluation accuracy of the maximum offshore wave, inundation depth, and its distribution is analyzed to examine the advantages of the scenario superposition method over the previous method. We introduce the dynamic time warping (DTW) method as an additional benchmark and compare its results to that of the Bayesian scenario detection method. △ Less

Submitted 4 July, 2024; originally announced July 2024.

Comments: 15 pages, 12 figures

arXiv:2308.02926 [pdf, other]

Towards Consistency Filtering-Free Unsupervised Learning for Dense Retrieval

Authors: Haoxiang Shi, Sumio Fujita, Tetsuya Sakai

Abstract: Domain transfer is a prevalent challenge in modern neural Information Retrieval (IR). To overcome this problem, previous research has utilized domain-specific manual annotations and synthetic data produced by consistency filtering to finetune a general ranker and produce a domain-specific ranker. However, training such consistency filters are computationally expensive, which significantly reduces… ▽ More Domain transfer is a prevalent challenge in modern neural Information Retrieval (IR). To overcome this problem, previous research has utilized domain-specific manual annotations and synthetic data produced by consistency filtering to finetune a general ranker and produce a domain-specific ranker. However, training such consistency filters are computationally expensive, which significantly reduces the model efficiency. In addition, consistency filtering often struggles to identify retrieval intentions and recognize query and corpus distributions in a target domain. In this study, we evaluate a more efficient solution: replacing the consistency filter with either direct pseudo-labeling, pseudo-relevance feedback, or unsupervised keyword generation methods for achieving consistent filtering-free unsupervised dense retrieval. Our extensive experimental evaluations demonstrate that, on average, TextRank-based pseudo relevance feedback outperforms other methods. Furthermore, we analyzed the training and inference efficiency of the proposed paradigm. The results indicate that filtering-free unsupervised learning can continuously improve training and inference efficiency while maintaining retrieval performance. In some cases, it can even improve performance based on particular datasets. △ Less

Submitted 5 August, 2023; originally announced August 2023.

arXiv:2208.14210 [pdf, other]

Learned k-NN Distance Estimation

Authors: Daichi Amagata, Yusuke Arai, Sumio Fujita, Takahiro Hara

Abstract: Big data mining is well known to be an important task for data science, because it can provide useful observations and new knowledge hidden in given large datasets. Proximity-based data analysis is particularly utilized in many real-life applications. In such analysis, the distances to k nearest neighbors are usually employed, thus its main bottleneck is derived from data retrieval. Much efforts h… ▽ More Big data mining is well known to be an important task for data science, because it can provide useful observations and new knowledge hidden in given large datasets. Proximity-based data analysis is particularly utilized in many real-life applications. In such analysis, the distances to k nearest neighbors are usually employed, thus its main bottleneck is derived from data retrieval. Much efforts have been made to improve the efficiency of these analyses. However, they still incur large costs, because they essentially need many data accesses. To avoid this issue, we propose a machine-learning technique that quickly and accurately estimates the k-NN distances (i.e., distances to the k nearest neighbors) of a given query. We train a fully connected neural network model and utilize pivots to achieve accurate estimation. Our model is designed to have useful advantages: it infers distances to the k-NNs at a time, its inference time is O(1) (no data accesses are incurred), but it keeps high accuracy. Our experimental results and case studies on real datasets demonstrate the efficiency and effectiveness of our solution. △ Less

Submitted 27 November, 2022; v1 submitted 29 August, 2022; originally announced August 2022.

Comments: Accepted to SIGSPATIAL2022 (as short paper)

arXiv:2104.06646 [pdf]

Influenza Surveillance using Search Engine, SNS, On-line Shopping, Q&A Service and Past Flu Patients

Authors: Taichi Murayama, Nobuyuki Shimizu, Sumio Fujita, Shoko Wakamiya, Eiji Aramaki

Abstract: Influenza, an infectious disease, causes many deaths worldwide. Predicting influenza victims during epidemics is an important task for clinical, hospital, and community outbreak preparation. On-line user-generated contents (UGC), primarily in the form of social media posts or search query logs, are generally used for prediction for reaction to sudden and unusual outbreaks. However, most studies re… ▽ More Influenza, an infectious disease, causes many deaths worldwide. Predicting influenza victims during epidemics is an important task for clinical, hospital, and community outbreak preparation. On-line user-generated contents (UGC), primarily in the form of social media posts or search query logs, are generally used for prediction for reaction to sudden and unusual outbreaks. However, most studies rely only on the UGC as their resource and do not use various UGCs. Our study aims to solve these questions about Influenza prediction: Which model is the best? What combination of multiple UGCs works well? What is the nature of each UGC? We adapt some models, LASSO Regression, Huber Regression, Support Vector Machine regression with Linear kernel (SVR) and Random Forest, to test the influenza volume prediction in Japan during 2015 - 2018. For that, we use on-line five data resources: (1) past flu patients, (2) SNS (Twitter), (3) search engines (Yahoo! Japan), (4) shopping services (Yahoo! Shopping), and (5) Q&A services (Yahoo! Chiebukuro) as resources of each model. We then validate respective resources contributions using the best model, Huber Regression, with all resources except one resource. Finally, we use Bayesian change point method for ascertaining whether the trend of time series on any resources is reflected in the trend of flu patient count or not. Our experiments show Huber Regression model based on various data resources produces the most accurate results. Then, from the change point analysis, we get the result that search query logs and social media posts for three years represents these resources as a good predictor. Conclusions: We show that Huber Regression based on various data resources is strong for outliers and is suitable for the flu prediction. Additionally, we indicate the characteristics of each resource for the flu prediction. △ Less

Submitted 14 April, 2021; originally announced April 2021.

Comments: 18pages, 3 figures

arXiv:2011.09140 [pdf, other]

Diverse and Non-redundant Answer Set Extraction on Community QA based on DPPs

Authors: Shogo Fujita, Tomohide Shibata, Manabu Okumura

Abstract: In community-based question answering (CQA) platforms, it takes time for a user to get useful information from among many answers. Although one solution is an answer ranking method, the user still needs to read through the top-ranked answers carefully. This paper proposes a new task of selecting a diverse and non-redundant answer set rather than ranking the answers. Our method is based on determin… ▽ More In community-based question answering (CQA) platforms, it takes time for a user to get useful information from among many answers. Although one solution is an answer ranking method, the user still needs to read through the top-ranked answers carefully. This paper proposes a new task of selecting a diverse and non-redundant answer set rather than ranking the answers. Our method is based on determinantal point processes (DPPs), and it calculates the answer importance and similarity between answers by using BERT. We built a dataset focusing on a Japanese CQA site, and the experiments on this dataset demonstrated that the proposed method outperformed several baseline methods. △ Less

Submitted 18 November, 2020; originally announced November 2020.

Comments: COLING2020, 12 pages

arXiv:2011.04241 [pdf, other]

Pointing to Subwords for Generating Function Names in Source Code

Authors: Shogo Fujita, Hidetaka Kamigaito, Hiroya Takamura, Manabu Okumura

Abstract: We tackle the task of automatically generating a function name from source code. Existing generators face difficulties in generating low-frequency or out-of-vocabulary subwords. In this paper, we propose two strategies for copying low-frequency or out-of-vocabulary subwords in inputs. Our best performing model showed an improvement over the conventional method in terms of our modified F1 and accur… ▽ More We tackle the task of automatically generating a function name from source code. Existing generators face difficulties in generating low-frequency or out-of-vocabulary subwords. In this paper, we propose two strategies for copying low-frequency or out-of-vocabulary subwords in inputs. Our best performing model showed an improvement over the conventional method in terms of our modified F1 and accuracy on the Java-small and Java-large datasets. △ Less

Submitted 9 November, 2020; originally announced November 2020.

Comments: 12 pages, accepted to COLING2020

arXiv:2004.10100 [pdf, other]

doi 10.1038/s41598-020-75771-6

Syndromic surveillance using search query logs and user location information from smartphones against COVID-19 clusters in Japan

Authors: Shohei Hisada, Taichi Murayama, Kota Tsubouchi, Sumio Fujita, Shuntaro Yada, Shoko Wakamiya, Eiji Aramaki

Abstract: [Background] Two clusters of coronavirus disease 2019 (COVID-19) were confirmed in Hokkaido, Japan in February 2020. To capture the clusters, this study employs Web search query logs and user location information from smartphones. [Material and Methods] First, we anonymously identified smartphone users who used a Web search engine (Yahoo! JAPAN Search) for the COVID-19 or its symptoms via its comp… ▽ More [Background] Two clusters of coronavirus disease 2019 (COVID-19) were confirmed in Hokkaido, Japan in February 2020. To capture the clusters, this study employs Web search query logs and user location information from smartphones. [Material and Methods] First, we anonymously identified smartphone users who used a Web search engine (Yahoo! JAPAN Search) for the COVID-19 or its symptoms via its companion application for smartphones (Yahoo Japan App). We regard these searchers as Web searchers who are suspicious of their own COVID-19 infection (WSSCI). Second, we extracted the location of the WSSCI via the smartphone application. The spatio-temporal distribution of the number of WSSCI are compared with the actual location of the known two clusters. [Result and Discussion] Before the early stage of the cluster development, we could confirm several WSSCI, which demonstrated the basic feasibility of our WSSCI-based approach. However, it is accurate only in the early stage, and it was biased after the public announcement of the cluster development. For the case where the other cluster-related resources, such as fine-grained population statistics, are not available, the proposed metric would be helpful to catch the hint of emerging clusters. △ Less

Submitted 21 April, 2020; originally announced April 2020.

arXiv:1910.10410 [pdf, other]

BanditRank: Learning to Rank Using Contextual Bandits

Authors: Phanideep Gampa, Sumio Fujita

Abstract: We propose an extensible deep learning method that uses reinforcement learning to train neural networks for offline ranking in information retrieval (IR). We call our method BanditRank as it treats ranking as a contextual bandit problem. In the domain of learning to rank for IR, current deep learning models are trained on objective functions different from the measures they are evaluated on. Since… ▽ More We propose an extensible deep learning method that uses reinforcement learning to train neural networks for offline ranking in information retrieval (IR). We call our method BanditRank as it treats ranking as a contextual bandit problem. In the domain of learning to rank for IR, current deep learning models are trained on objective functions different from the measures they are evaluated on. Since most evaluation measures are discrete quantities, they cannot be leveraged by directly using gradient descent algorithms without an approximation. BanditRank bridges this gap by directly optimizing a task-specific measure, such as mean average precision (MAP), using gradient descent. Specifically, a contextual bandit whose action is to rank input documents is trained using a policy gradient algorithm to directly maximize the reward. The reward can be a single measure, such as MAP, or a combination of several measures. The notion of ranking is also inherent in BanditRank, similar to the current \textit{listwise} approaches. To evaluate the effectiveness of BanditRank, we conducted a series of experiments on datasets related to three different tasks, i.e., web search, community, and factoid question answering. We found that it performs better than state-of-the-art methods when applied on the question answering datasets. On the web search dataset, we found that BanditRank performed better than four strong listwise baselines including LambdaMART, AdaRank, ListNet and Coordinate Ascent. △ Less

Submitted 23 October, 2019; originally announced October 2019.

Comments: 9 pages

arXiv:1908.06664 [pdf, ps, other]

Safe sets in digraphs

Authors: Yandong Bai, Jørgen Bang-Jensen, Shinya Fujita, Anders Yeo

Abstract: A non-empty subset $S$ of the vertices of a digraph $D$ is called a {\it safe set} if \begin{itemize} \item[(i)] for every strongly connected component $M$ of $D-S$, there exists a strongly connected component $N$ of $D[S]$ such that there exists an arc from $M$ to $N$; and \item[(ii)] for every strongly connected component $M$ of $D-S$ and every strongly connected component $N$ of $D[S]$, we ha… ▽ More A non-empty subset $S$ of the vertices of a digraph $D$ is called a {\it safe set} if \begin{itemize} \item[(i)] for every strongly connected component $M$ of $D-S$, there exists a strongly connected component $N$ of $D[S]$ such that there exists an arc from $M$ to $N$; and \item[(ii)] for every strongly connected component $M$ of $D-S$ and every strongly connected component $N$ of $D[S]$, we have $|M|\leq |N|$ whenever there exists an arc from $M$ to $N$. \end{itemize} In the case of acyclic digraphs a set $X$ of vertices is a safe set precisely when $X$ is an {\it in-dominating set}, that is, every vertex not in $X$ has at least one arc to $X$. We prove that, even for acyclic digraphs which are traceable (have a hamiltonian path) it is NP-hard to find a minimum cardinality in-dominating set. Then we show that the problem is also NP-hard for tournaments and give, for every positive constant $c$, a polynomial algorithm for finding a minimum cardinality safe set in a tournament on $n$ vertices in which no strong component has size more than $c\log{}(n)$. Under the so called Exponential Time Hypothesis (ETH) this is close to best possible in the following sense: If ETH holds, then, for every $ε>0$ there is no polynomial time algorithm for finding a minimum cardinality safe set for the class of tournaments in which the largest strong component has size at most $\log^{1+ε}(n)$. We also discuss bounds on the cardinality of safe sets in tournaments. △ Less

Submitted 19 August, 2019; originally announced August 2019.

arXiv:1902.10895 [pdf]

What you get is not always what you see: pitfalls in solar array assessment using overhead imagery

Authors: Wei Hu, Kyle Bradbury, Jordan M. Malof, Boning Li, Bohao Huang, Artem Streltsov, K. Sydny Fujita, Ben Hoen

Abstract: Effective integration planning for small, distributed solar photovoltaic (PV) arrays into electric power grids requires access to high quality data: the location and power capacity of individual solar PV arrays. Unfortunately, national databases of small-scale solar PV do not exist; those that do are limited in their spatial resolution, typically aggregated up to state or national levels. While se… ▽ More Effective integration planning for small, distributed solar photovoltaic (PV) arrays into electric power grids requires access to high quality data: the location and power capacity of individual solar PV arrays. Unfortunately, national databases of small-scale solar PV do not exist; those that do are limited in their spatial resolution, typically aggregated up to state or national levels. While several promising approaches for solar PV detection have been published, strategies for evaluating the performance of these models are often highly heterogeneous from study to study. The resulting comparison of these methods for practical applications for energy assessments becomes challenging and may imply that the reported performance evaluations are overly optimistic. The heterogeneity comes in many forms, each of which we explore in this work: the level of spatial aggregation, the validation of ground truth, inconsistencies in the training and validation datasets, and the degree of diversity of the locations and sensors from which the training and validation data originate. For each, we discuss emerging practices from the literature to address them or suggest directions of future research. As part of our investigation, we evaluate solar PV identification performance in two large regions. Our findings suggest that traditional performance evaluation of the automated identification of solar PV from satellite imagery may be optimistic due to common limitations in the validation process. The takeaways from this work are intended to inform and catalyze the large-scale practical application of automated solar PV assessment techniques by energy researchers and professionals. △ Less

Submitted 25 July, 2022; v1 submitted 28 February, 2019; originally announced February 2019.

Comments: 25 pages

arXiv:1808.09648 [pdf, other]

Adapting Visual Question Answering Models for Enhancing Multimodal Community Q&A Platforms

Authors: Avikalp Srivastava, Hsin Wen Liu, Sumio Fujita

Abstract: Question categorization and expert retrieval methods have been crucial for information organization and accessibility in community question & answering (CQA) platforms. Research in this area, however, has dealt with only the text modality. With the increasing multimodal nature of web content, we focus on extending these methods for CQA questions accompanied by images. Specifically, we leverage the… ▽ More Question categorization and expert retrieval methods have been crucial for information organization and accessibility in community question & answering (CQA) platforms. Research in this area, however, has dealt with only the text modality. With the increasing multimodal nature of web content, we focus on extending these methods for CQA questions accompanied by images. Specifically, we leverage the success of representation learning for text and images in the visual question answering (VQA) domain, and adapt the underlying concept and architecture for automated category classification and expert retrieval on image-based questions posted on Yahoo! Chiebukuro, the Japanese counterpart of Yahoo! Answers. To the best of our knowledge, this is the first work to tackle the multimodality challenge in CQA, and to adapt VQA models for tasks on a more ecologically valid source of visual questions. Our analysis of the differences between visual QA and community QA data drives our proposal of novel augmentations of an attention method tailored for CQA, and use of auxiliary tasks for learning better grounding features. Our final model markedly outperforms the text-only and VQA model baselines for both tasks of classification and expert retrieval on real-world multimodal CQA data. △ Less

Submitted 25 May, 2019; v1 submitted 29 August, 2018; originally announced August 2018.

Comments: Submitted for review at CIKM 2019

arXiv:1709.03260 [pdf, ps, other]

A Short Note on Proximity-based Scoring of Documents with Multiple Fields

Authors: Tomohiro Manabe, Sumio Fujita

Abstract: The BM25 ranking function is one of the most well known query relevance document scoring functions and many variations of it are proposed. The BM25F function is one of its adaptations designed for modeling documents with multiple fields. The Expanded Span method extends a BM25-like function by taking into considerations of the proximity between term occurrences. In this note, we combine these two… ▽ More The BM25 ranking function is one of the most well known query relevance document scoring functions and many variations of it are proposed. The BM25F function is one of its adaptations designed for modeling documents with multiple fields. The Expanded Span method extends a BM25-like function by taking into considerations of the proximity between term occurrences. In this note, we combine these two variations into one scoring method in view of proximity-based scoring of documents with multiple fields. △ Less

Submitted 11 September, 2017; originally announced September 2017.

Comments: 2 pages

arXiv:1703.00073 [pdf, ps, other]

doi 10.7567/JJAP.56.04CF13

Physically unclonable function using initial waveform of ring oscillators on 65 nm CMOS technology

Authors: Tetsufumi Tanamoto, Satoshi Takaya, Nobuaki Sakamoto, Hirotsugu Kasho, Shinichi Yasuda, Takao Marukame, Shinobu Fujita, Yuichiro Mitani

Abstract: A silicon physically unclonable function (PUF) using ring oscillators (ROs) has the advantage of easy application in both an application specific integrated circuit (ASIC) and a field-programmable gate array (FPGA). Here, we provide a RO-PUF using the initial waveform of the ROs based on 65 nm CMOS technology. Compared with the conventional RO-PUF, the number of ROs is greatly reduced and the time… ▽ More A silicon physically unclonable function (PUF) using ring oscillators (ROs) has the advantage of easy application in both an application specific integrated circuit (ASIC) and a field-programmable gate array (FPGA). Here, we provide a RO-PUF using the initial waveform of the ROs based on 65 nm CMOS technology. Compared with the conventional RO-PUF, the number of ROs is greatly reduced and the time needed to generate an ID is within a couple of system clocks. △ Less

Submitted 10 February, 2017; originally announced March 2017.

Comments: 5 pages, 9 figures

Journal ref: Jpn. J. Appl. Phys. 56, 04CF13 (2017)

arXiv:1606.03147 [pdf, ps, other]

High-Speed Magnetoresistive Random-Access Memory Random Number Generator Using Error-Correcting Code

Authors: Tetsufumi Tanamoto, Naoharu Shimomura, Sumio Ikegawa, Mari Matsumoto, Shinobu Fujita, Hiroaki Yoda

Abstract: A high-speed random number generator (RNG) circuit based on magnetoresistive random-access memory (MRAM) using an error-correcting code (ECC) post processing circuit is presented. ECC post processing increases the quality of randomness by increasing the entropy of random number. { We experimentally show that a small error-correcting capability circuit is sufficient for this post processing. It is… ▽ More A high-speed random number generator (RNG) circuit based on magnetoresistive random-access memory (MRAM) using an error-correcting code (ECC) post processing circuit is presented. ECC post processing increases the quality of randomness by increasing the entropy of random number. { We experimentally show that a small error-correcting capability circuit is sufficient for this post processing. It is shown that the ECC post processing circuit powerfully improves the quality of randomness with minimum overhead, ending up with high-speed random number generation. We also show that coupling with a linear feedback shift resistor is effective for improving randomness △ Less

Submitted 9 June, 2016; originally announced June 2016.

Comments: 5 pages, 11 figures

Journal ref: Jpn. J. Appl. Phys. 50, 04DM01 (2011)

arXiv:1605.03290 [pdf, ps, other]

doi 10.1109/TCSII.2016.2602828

Physically Unclonable Function using Initial Waveform of Ring Oscillators

Authors: Tetsufumi Tanamoto, Shinich Yasuda, Satoshi Takaya, Shinobu Fujita

Abstract: A silicon physically unclonable function (PUF) is considered to be one of the key security system solutions for local devices in an era in which the internet is pervasive. Among many proposals, a PUF using ring oscillators (RO-PUF) has the advantage of easy application to FPGA. In the conventional RO-PUF, frequency difference between two ROs is used as one bit of ID. Thus, in order to obtain an ID… ▽ More A silicon physically unclonable function (PUF) is considered to be one of the key security system solutions for local devices in an era in which the internet is pervasive. Among many proposals, a PUF using ring oscillators (RO-PUF) has the advantage of easy application to FPGA. In the conventional RO-PUF, frequency difference between two ROs is used as one bit of ID. Thus, in order to obtain an ID of long bit length, the corresponding number of RO pairs are required and consequently power consumption is large, leading to difficulty in implementing RO-PUF in local devices. Here, we provide a RO-PUF using the initial waveform of the ROs. Because a waveform constitutes a part of the ID, the number of ROs is greatly reduced and the time needed to generate the ID is finished in a couple of system clocks. We also propose a solution to a change of PUF performance attributable to temperature or voltage change. △ Less

Submitted 11 May, 2016; originally announced May 2016.

Comments: 11 pages, 10 figures

Journal ref: IEEE Transactions on Circuits and Systems II: Express Briefs Vol. 64, pp827 - 831 (2017)

arXiv:1510.02138 [pdf]

doi 10.5121/ijcnc.2015.7502

A Scheme for Maximal Resource Utilization in Peer-to-Peer Live Streaming

Authors: Bahaa Aldeen Alghazawy, Satoshi Fujita

Abstract: Peer-to-Peer streaming technology has become one of the major Internet applications as it offers the opportunity of broadcasting high quality video content to a large number of peers with low costs. It is widely accepted that with the efficient utilization of peers and server's upload capacities, peers can enjoy watching a high bit rate video with minimal end-to-end delay. In this paper, we presen… ▽ More Peer-to-Peer streaming technology has become one of the major Internet applications as it offers the opportunity of broadcasting high quality video content to a large number of peers with low costs. It is widely accepted that with the efficient utilization of peers and server's upload capacities, peers can enjoy watching a high bit rate video with minimal end-to-end delay. In this paper, we present a practical scheduling algorithm that works in the challenging condition where no spare capacity is available, i.e., it maximally utilizes the resources and broadcasts the maximum streaming rate. Each peer contacts with only a small number of neighbours in the overlay network and autonomously subscribes to sub-streams according to a budget-model in such a way that the number of peers forwarding exactly one sub-stream will be maximized. The hop-count delay is also taken into account to construct a short depth trees. Finally, we show through simulation that peers dynamically converge to an efficient overlay structure with a short hop-count delay. Moreover, the proposed scheme gives nice features in the homogeneous case and overcomes SplitStream in all simulated scenarios. △ Less

Submitted 7 October, 2015; originally announced October 2015.

Comments: 16 pages in International Journal of Computer Networks & Communications (IJCNC) Vol.7, No.5, September 2015

Journal ref: International Journal of Computer Networks & Communications (IJCNC) Vol.7, No.5, September 2015

arXiv:1204.2712 [pdf, ps, other]

Learning to Rank Query Recommendations by Semantic Similarities

Authors: Sumio Fujita, Georges Dupret, Ricardo Baeza-Yates

Abstract: Logs of the interactions with a search engine show that users often reformulate their queries. Examining these reformulations shows that recommendations that precise the focus of a query are helpful, like those based on expansions of the original queries. But it also shows that queries that express some topical shift with respect to the original query can help user access more rapidly the informat… ▽ More Logs of the interactions with a search engine show that users often reformulate their queries. Examining these reformulations shows that recommendations that precise the focus of a query are helpful, like those based on expansions of the original queries. But it also shows that queries that express some topical shift with respect to the original query can help user access more rapidly the information they need. We propose a method to identify from the query logs of past users queries that either focus or shift the initial query topic. This method combines various click-based, topic-based and session based ranking strategies and uses supervised learning in order to maximize the semantic similarities between the query and the recommendations, while at the same diversifying them. We evaluate our method using the query/click logs of a Japanese web search engine and we show that the combination of the three methods proposed is significantly better than any of them taken individually. △ Less

Submitted 12 April, 2012; originally announced April 2012.

Comments: 2nd International Workshop on Usage Analysis and the Web of Data (USEWOD2012) in the 21st International World Wide Web Conference (WWW2012), Lyon, France, April 17th, 2012

Report number: WWW2012USEWOD/2012/fuduba ACM Class: H.3.3; H.3.5

Showing 1–17 of 17 results for author: Fujita, S