Angluin showed that the class of regular languages could be learned from a Minimally Adequate Teacher (mat) providing membership and equivalence queries. Clark and Eyraud (2007) showed that some context free grammars can be identified in the limit from positive data alone by identifying the congruence classes of the language. In this paper we consider learnability of context free languages ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 24-37, 2010]]>This paper presents an efficient algorithm that identifies a rich subclass of multiple context-free languages in the limit from positive data and membership queries by observing where each tuple of strings may occur in sentences of the language of the learning target. Our technique is based on Clark et al.’s work (ICGI 2008) on learning of a subclass ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 230-244, 2010]]>The Syntactic Concept Lattice is a residuated lattice based on the distributional structure of a language; the natural representation based on this is a context sensitive formalism. Here we examine the possibility of basing a context free grammar (cfg) on the structure of this lattice; in particular by choosing non-terminals to correspond to concepts in this lattice. We present ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 38-51, 2010]]>We introduce a new algorithm for sequential learning of Mealy automata by congruence generator extension (CGE). Our approach makes use of techniques from term rewriting theory and universal algebra for compactly representing and manipulating automata using finite congruence generator sets represented as string rewriting systems (SRS). We prove that the CGE algorithm correctly learns in the limit.

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 148-162, 2010]]>We recapitulate inference from membership and equivalence queries, positive and negative samples. Regular languages cannot be learned from one of those information sources only [1,2,3]. Combinations of two sources allowing regular (polynomial) inference are MQs and EQs [4], MQs and positive data [5,6], positive and negative data [7,8]. We sketch a meta-algorithm fully presented in [...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 288-292, 2010]]>Conference: International Colloquium on Grammatical Inference - ICGI, pp. 122-134, 2010]]>

While Grammar Inference (GI) has been successfully applied to many diverse domains such as speech recognition and robotics, its application to software engineering has been limited, despite wide use of context-free grammars in software systems. This paper reports current developments and future directions in the applicability of GI to software engineering, where GI is seen to offer innovative solutions ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 276-279, 2010]]>We report results on applying the OIL (Order Independent Language) grammar inference algorithm to predict cleavage sites in polyproteins from translation of Potivirus genome. This non-deterministic algorithm is used to generate a group of models which vote to predict the occurrence of the pattern. We built nine models, one for each cleavage site in this kind of virus genome ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 267-270, 2010]]>Although MATLAB has become one of the mainstream languages for the machine learning community, there is still skepticism among the Grammatical Inference (GI) community regarding the suitability of MATLAB for implementing and running GI algorithms. In this paper we will present implementation results of several GI algorithms, e.g., RPNI (Regular Positive and Negative Inference), EDSM (Evidence Driven State Merging), ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 262-266, 2010]]>Tree rewriting systems are sets of tree rewriting rules used to compute by repeatedly replacing equal trees in a given formula until the simplest possible form (normal form) is obtained. The Church-Rosser property is certainly one of the most fundamental properties of tree rewriting system. In this system the simplest form of a given tree is unique since the ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 284-287, 2010]]>Grammatical inference is typically defined as the task of finding a compact representation of a language given a subset of sample sequences from that language. Many different aspects, paradigms and settings can be investigated, leading to different proofs of language learnability or practical systems. The general problem can be seen as a one class classification or discrimination task. In this ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 245-257, 2010]]>In this article, we propose the use of suffix arrays to efficiently implement n-gram language models with practically unlimited size n. This approach, which is used with synchronous back-off, allows us to distinguish between alternative sequences using large contexts. We also show that we can build this kind of models with additional information for each symbol, such as ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 305-308, 2010]]>We develop a framework based on Hölder norms that allows us to easily transfer learnability results. This idea is concretized by applying it to Classical Categorial Grammars (CCG).

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 280-283, 2010]]>In this paper, we present a general framework for supervised classification. This framework provides methods like boosting and only needs the definition of a generalisation operator called lgg. For sequence classification tasks, lgg is a learner that only uses positive examples. We show that grammatical inference has already defined such learners for automata classes like reversible automata or k-TSS ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 189-202, 2010]]>We prove in this work that, under certain conditions, an algorithm that arbitrarily merges states in the prefix tree acceptor of the sample in a consistent way, converges to the minimum DFA for the target language in the limit. This fact is used to learn automata teams, which use the different automata output by this algorithm to classify the test. ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 52-65, 2010]]>Pattern language learning algorithms within the inductive inference model and query learning setting have been of great interest. In this paper an algorithm to learn a parallel communicating grammar system in which the master component is a regular grammar and the other components are pure pattern grammars is given.

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 301-304, 2010]]>In this paper we extend the PAC learning algorithm due to Clark and Thollard for learning distributions generated by PDFA to automata whose transitions may take varying time lengths, governed by exponential distributions.

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 271-275, 2010]]>Unambiguous Non-Terminally Separated (UNTS) grammars have good learnability properties but are too restrictive to be used for natural language parsing. We present a generalization of UNTS grammars called Unambiguous Weakly NTS (UWNTS) grammars that preserve the learnability properties. Then, we study the problem of using them to parse natural language and evaluating against a gold treebank. If the target ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 135-147, 2010]]>Molecular biology is full of linguistic metaphors, from the language of DNA to the genome as “book of life.” Certainly the organization of genes and other functional modules along the DNA sequence invites a syntactic view, which can be seen in certain tools used in bioinformatics such as hidden Markov models. It has also been shown that folding of ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 5-10, 2010]]>We adapt an algorithm (RTI) for identifying (learning) a deterministic real-time automaton (DRTA) to the setting of positive timed strings (or time-stamped event sequences). An DRTA can be seen as a deterministic finite state automaton (DFA) with time constraints. Because DRTAs model time using numbers, they can be exponentially more compact than equivalent DFA models that model time ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 203-216, 2010]]>This paper revisits a problem of the evaluation of computational grammatical inference (GI) systems and discusses what role complexity measures can play for the assessment of GI. We provide a motivation for using the Rademacher complexity and give an example showing how this complexity measure can be used in practice.

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 293-296, 2010]]>We introduce a formal paradigm to study global adaptive behavior of organizations of collaborative agents with local learning capabilities. Our model is based on an extension of the classical language learning setting in which a teacher provides examples to a student that must guess a correct grammar. In our model the teacher is transformed in to a workload dispatcher and ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 163-177, 2010]]>We show that within the Gold paradigm for language learning an informer for a superfinite set can cause an optimal MDL learner to make an infinite amount of mind changes. In this setting an optimal learner can make an infinite amount of wrong choices without approximating the right solution. This result helps us to understand the relation between MDL and ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 258-261, 2010]]>Workflows are an important knowledge representation used to understand and automate processes in diverse task domains. Past work has explored the problem of learning workflows from traces of processing. In this paper, we are concerned with learning workflows from interleaved traces captured during the concurrent processing of multiple task instances. We first present an abstraction of the problem of recovering ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 80-93, 2010]]>We present an exact algorithm for identification of deterministic finite automata (DFA) which is based on satisfiability (SAT) solvers. Despite the size of the low level SAT representation, our approach is competitive with alternative techniques. Our contributions are fourfold: First, we propose a compact translation of DFA identification into SAT. Second, we reduce the SAT search space by adding lower ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 66-79, 2010]]>Grammatical Inference has recently been applied successfully to bioinformatic tasks as protein domain prediction. In this work we present a new approach to infer regular languages. Although used in a biological task, our results may be useful not only in bioinformatics, but also in many applied tasks. To test the algorithm we consider the transmembrane domain prediction task. A preprocessing ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 178-188, 2010]]>This paper is concerned with a subclass of finite state transducers, called strict prefix deterministic finite state transducers (SPDFST’s for short), and studies a problem of identifying the subclass in the limit from positive data. After providing some properties of languages accepted by SPDFST’s, we show that the class of SPDFST’s is polynomial time identifiable in the limit from positive ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 313-316, 2010]]>Conference: International Colloquium on Grammatical Inference - ICGI, 2010]]>

A class L\mathcal{L} is called mitotic if it admits a splitting L0,L1\mathcal{L}_0,\mathcal{L}_1 such that L,L0,L1\mathcal{L},\mathcal{L}_0,\mathcal{L}_1 are all equivalent with respect to a certain reducibility. Such a splitting might be called a symmetric splitting. In this paper we investigate the possibility of ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 109-121, 2010]]>In this paper, a heuristic algorithm for the inference of an arbitrary context-free grammar is presented. The input data consist of a finite set of representative words chosen from a (possibly infinite) context-free language and of a finite set of counterexamples—words which do not belong to the language. The time complexity of the algorithm is polynomially bounded. The experiments have been ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 217-229, 2010]]>This paper discusses the potential synergy between research in grammatical inference and research in artificial intelligence applied to games. There are two aspects to this: the potential as a rich source of challenging and engaging test problems, and the potential for real applications.

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 1-4, 2010]]>This paper takes up the topic of a task of learning fuzzy context-free grammar from data. The induction process is divided into two phases: first the generic grammar is derived from the positive sentences, next the membership grades are assigned to the productions taking into account the occurrences of productions in a learning set. The problem of predicting the ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 309-312, 2010]]>In this work, we give an algorithm that infers Regular Trace Languages. Trace languages can be seen as regular languages that are closed under a partial commutation relation called the independence relation. This algorithm is similar to the RPNI algorithm, but it is based on Asynchronous Cellular Automata. For this purpose, we define Asynchronous Cellular Moore Machines and implement the ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 11-23, 2010]]>A learning algorithm is developed for a class of regular expressions equivalent to the class of all unionless unambiguous regular expressions of loop depth 2. The learner uses one representative example of the target language (where every occurrence of every loop in the target expression is unfolded at least twice) and a number of membership queries. The algorithm works in ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 94-108, 2010]]>Paraphrasing normally involves sophisticated linguistic resources for pre-processing. In the present work Modern Greek paraphrases are automatically generated using statistical significance testing in a novel manner for the extraction of applicable reordering schemata of syntactic constituents. Next, supervised filtering helps remove erroneously generated paraphrases, taking into account the context surrounding the reordering position. The proposed process is knowledge-poor, ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 297-300, 2010]]>Conference: International Colloquium on Grammatical Inference - ICGI, 2009]]>

We present a polynomial algorithm for the inductive inference of a large class of context free languages, that includes all regular languages. The algorithm uses a representation which we call Binary Feature Grammars based on a set of features, capable of representing richly structured context free languages as well as some context sensitive languages. More precisely, we focus on a ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 29-42, 2008]]>We generalize a learning algorithm by Drewes and Hogberg (1) for regular tree languages based on a learning model proposed by An- gluin (2) to recognizable tree languages of arbitrarily many dimensions, so-called multi-dimensional trees. Trees over multi-dimensional tree do- mains have been defined by Rogers (3,4). However, since the algorithm by Drewes and Hogberg relies ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 111-124, 2008]]>Recently Clark and Eyraud (2005, 2007) have shown that substitutable context-free languages are polynomial-time identifiable in the limit from positive data. Substitutability in context-free languages can be thought of as the analogue of reversibility in regular languages. While reversible languages admit a hierarchy, namely k-reversible regular languages for each nonnegative integer k, Clark and Eyraud targeted ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 266-279, 2008]]>Recently, an algorithm - DEES - was proposed for learning rational stochastic tree languages. Given a sample of trees independently and identically drawn according to a distribution defined by a rational stochastic language, DEES outputs a linear representation of a rational series which converges to the target. DEES can then be used to identify in the limit with probability one rational stochastic ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 57-70, 2008]]>Empirical grammatical inference systems are practical systems that learn structure from sequences, in contrast to theoretical grammatical inference systems, which prove learnability of certain classes of grammars. All current empirical grammatical inference evaluation methods are problematic, i.e. dependency on language experts, appropriateness and quality of an underlying grammar of the data, and influence of the parameters of the evaluation ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 301-303, 2008]]>The accuracy of an inferred grammar is commonly computed by measuring the percentage of sequences that are correctly classied from a random sample of sequences produced by the target grammar. This approach is problematic because (a) it is unlikely that a random sample of sequences will adequately test the grammar and (b) the use of a single probability value provides ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 252-265, 2008]]>Comparison of standard language learning paradigms (iden- tification in the limit, query learning, Pac learning) has always been a complex question. Moreover, when to the question of converging to a target one adds computational constraints, the picture becomes even less clear: how much do queries or negative examples help? Can we find good algorithms that change their minds very little ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 43-56, 2008]]>Based on the ideas suggested in [5], the following model for learning from a variant of correction queries to an oracle is proposed: being asked a membership query, the oracle, in the case of negative answer, returns also a correction – a positive datum (that has not been seen in the learning process yet) with the smallest edit distance from the ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 125-138, 2008]]>This paper presents PCFG-BCL, an unsupervised algorithm that learns a probabilistic context-free grammar (PCFG) from positive samples. The algorithm acquires rules of an unknown PCFG through iterative biclustering of bigrams in the training corpus. Our analysis shows that this procedure uses a greedy approach to adding rules such that each set of rules that is added to the ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 224-237, 2008]]>We present a simple computational model that takes into account semantics for language learning, as motivated by readings in the literature of children's language acquisition and by a desire to incorporate a robust notion of semantics in the field of Grammatical Inference. We argue that not only is it more natural to take into account semantics, but also that ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 1-14, 2008]]>We present an improvement of an algorithm due to Clark and Thollard (Journal of Machine Learning Research, 2004) for PAC-learning distributions generated by Probabilistic Deterministic Finite Automata (PDFA). Our algorithm is an attempt to keep the rigorous guarantees of the original one but use sample sizes that are not as astronomical as predicted by the theory. We prove that ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 163-174, 2008]]>Pattern language learning algorithms within the inductive inference model and query learning setting have been of great interest. In this paper, we study the problem of learning pure pattern languages using queries and examples.

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 280-282, 2008]]>In this article we study the inference of commutative regular languages. We first show that commutative regular languages are not inferable from positive samples, and then we study the possible improvement of inference from positive and negative samples. We propose a polynomial algorithm to infer commutative regular languages from positive and negative samples, and we show, from experimental results, that ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 71-83, 2008]]>We study the complexity of identifying (learning) timed automata in the limit from data. Timed automata are finite state models that model time explicitly, i.e., using numbers. Because timed automata use numbers to represent time, they can be exponentially more compact than models that model time implicitly, i.e., using states. We show three results that are essential in ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 238-251, 2008]]>In this paper, we interpret in terms of operations applying on extended finite state automata some algorithms that have been specified on categorial grammars to learn subclasses of context-free languages. The algorithms considered implement specialization strategies. This new perspective also helps to understand how it is possible to control the combinatorial explosion that specialization techniques have to face, thanks ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 200-212, 2008]]>Multiplicity Automata are devices that implement functions from a string space to a eld . Usually the real number's eld is used. From a learning point of view there exist some algorithms that are able to identify any multiplicity automaton from membership and equivalence queries. In this work we show that those algorithms can also be used if the alge- ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 154-162, 2008]]>The left-to-right and right-to-left iterative languages are previously unnoticed subclasses of the regular languages of infinite size that are identifiable in the limit from positive data. Essentially, these language classes are the ones obtained by merging final states in a prefix tree and initial states in a suffix tree of the observed sample, respectively. Strikingly, these ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 84-97, 2008]]>In this paper, we study a learning procedure from positive data for bounded unions of certain class of languages. Our key tools are the notion of characteristic sets and hypergraphs. We generate hypergraphs from given positive data and exploit them in order to find characteristic sets.

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 98-110, 2008]]>In [3], we have provided an algorithm to infer a few subclasses of linear languages through labeled extended Petri nets. The family of equal matrix languages [6] meets both the families of context sensitive languages and context-free languages. In this paper, we prove that an equal matrix language is a Petri net language.We construct labeled extended Petri nets ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 289-291, 2008]]>The induction of monadic node selecting queries from par- tially annotated XML-trees is a key task in Web information extraction. We show how to integrate schema guidance into an RPNI-based learning algorithm, in which monadic queries are represented by pruning node selecting tree transducers. We present experimental results on schema guidance by the DTD of HTML. We study ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 15-28, 2008]]>Conference: International Colloquium on Grammatical Inference - ICGI, 2008]]>

Standard state-merging DFA induction algorithms, such as RPNI or Blue-Fringe, aim at inferring a regular language from positive and negative strings. In particular, the negative information prevents merging incompatible states: merging those states would lead to produce an inconsistent DFA. Whenever available, domain knowledge can also be used to extend the set of incompatible states. We introduce here ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 139-153, 2008]]>Conference: International Colloquium on Grammatical Inference - ICGI, pp. 295-297, 2008]]>

Stochastic graph grammars are probabilistic models suitable for modeling relational data, complex organic molecules, social networks, and various other data distributions [1]. In this paper, we demonstrate that such grammars can be used to reveal useful information about the underlying distribution. In particular, we demonstrate techniques for estimating the expected number of nodes, the expected number of edges, and the ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 292-294, 2008]]>Computational approaches to learning aspects of language typically reduce the problem to learning syntax alone, or learning a lexicon alone. These simplifications have led to disconnected solutions and some unreasonable assumptions about inputs to their algorithms. In this paper, we present an approach that exploits a grammar learning algorithm to learn its own alphabet, or lexicon. We present empirical results ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 283-285, 2008]]>Within the data mining community there has been a lot of interest in mining and learning from graphs (see [1] for a recent overview). Most work in this area has has focussed on finding algorithms that help solve real-world problems. Although useful and interesting results have been obtained, more fundamental issues like learnability properties have hardly been adressed yet. ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 286-288, 2008]]>We show a probabilistic learnability of a subclass of linear languages with queries. Learning via queries is an important problem in grammatical inference but the power of queries to probabilistic learnability is not clear yet. In probabilistic learning model, PAC (Probably Approximately Correct) criterion is an important one and many results have been shown in this model. Angluin has shown ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 187-199, 2008]]>This paper takes up the topic of a task of training Grammar-based Classifier System (GCS) to regular grammars from data. GCS is a new model of Learning Classifier Systems in which the population of classifiers has a form of a context-free grammar rule set in a Chomsky Normal Form. Near-optimal solutions or better than reported in the ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 298-300, 2008]]>The adult-child interaction which takes place during the child’s language acquisition process has been the inspiration for Angluin’s teacher-learner model [1], the forerunner of today’s active learning field. But the initial types of queries have some drawbacks: equivalence queries are both unrealistic and computationally costly; membership queries, on the other hand, are not informative enough, not being able ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 213-223, 2008]]>In this work we propose a method to infer context-sensitive languages from positive structural examples produced by linear grammars. Our approach is based on a representation theorem induced by two operations over strings: duplication and reversal. The inference method produces an acceptor device which is an unconventional model of computation based on biomolecules (DNA computing). We prove that a ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 175-186, 2008]]>In active learning, membership queries and equivalence que- ries have established themselves as the standard combination to be used. However, they are quite \unnatural" for real learning environments (mem- bership queries are oversimplifled and equivalence queries do not have a correspondence in a real life setting). Based on several linguistic argu- ments that support the presence of corrections in children'...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 281-292, 2006]]>This paper presents a new grammar induction algorithm for probabilistic context-free grammars (PCFGs). There is an approach to PCFG induction that is based on parameter estimation. Following this approach, we apply the variational Bayes to PCFGs. The variational Bayes (VB) is an approximation of Bayesian learning. It has been em- pirically shown that VB is less likely to cause ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 84-96, 2006]]>Strings can be mapped into Hilbert spaces using feature maps such as the Parikh map. Languages can then be defined as the pre- image of hyperplanes in the feature space, rather than using grammars or automata. These are the planar languages. In this paper we show that using techniques from kernel-based learning, we can represent and effi- ciently learn, ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 148-160, 2006]]>In this paper we study the application of the Minimum Description Length principle (or two-part-code optimization) to grammar induction in the light of recent developments in Kolmogorov complexity theory. We focus on issues that are important for construction of effective compression algorithms. We define an independent measure for the quality of a theory given a data set: the ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 293-306, 2006]]>In this paper we will present opportunities for applying graph based linguistic formalisms for computer automatic understanding of meaning of wrist medical images. Thanks to the proposed method we can understand the merit content of the image even if the form of the image is very different from any known pattern. It seems that in the near future such technique ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 349-350, 2006]]>Non-terminally separated (NTS) languages are a subclass of deterministic context free languages where there is a stable relationship between the substrings of the language and the non-terminals of the grammar. We show that when the distribution of samples is generated by a PCFG, based on the same grammar as the target language, the class of unambiguous NTS languages ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 59-71, 2006]]>In this paper, we present a theoretical approach for the prob- lem of learning multiplicity tree automata. These automata allows one to define functions which compute a number for each tree. They can be seen as a strict generalization of stochastic tree automata since they al- low to define functions over any field K. A multiplicity automaton admits a support ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 268-280, 2006]]>The class of very simple grammars is known to be polynomial-time identifiable in the limit from positive data. This paper introduces an extension of very simple grammars called right-unique simple grammars, and presents an algorithm that identifies right-unique simple grammars in the limit from positive data. The learning algorithm possesses the following three properties. It computes a ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 45-58, 2006]]>To study the problem of learning from noisy data, the com- mon approach is to use a statistical model of noise. The influence of the noise is then considered according to pragmatic or statistical criteria, by using a paradigm taking into account a distribution of the data. In this article, we study the noise as a nonstatistical phenomenon, by defining ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 19-31, 2006]]>Natural languages contain regular, context-free, and context-sensitive syntactic constructions, yet none of these classes of formal languages can be identified in the limit from positive examples. Mildly context-sensitive languages are able to represent some context-sensitive constructions, those most common in natural languages, such as multiple agreement, crossed agreement, and duplication. These languages are attractive for natural ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 137-147, 2006]]>Many real-world applications such as spell-checking or DNA analysis use the Levenshtein edit-distance to compute similarities be- tween strings. In practice, the costs of the primitive edit operations (in- sertion, deletion and substitution of symbols) are generally hand-tuned. In this paper, we propose an algorithm to learn these costs. The under- lying model is a probabilitic ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 240-252, 2006]]>The notions of iso-arrays, iso-pictures, local iso-picture languages and recognizable iso-picture languages have been introduced and studied in [6]. In [6] we have provided an algorithm to learn local iso-picture languages through identification in the limit using positive data. In this paper, we construct a two-dimensional on-line tessellation automaton to recognize iso-picture ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 327-339, 2006]]>Analysis by reduction is a linguistically motivated method for checking correctness of a sentence. It can be modelled by restarting automata. In this paper we propose a method for learning restarting automata which are strictly locally testable (SLT-R-automata). The method is based on the concept of identification in the limit from positive examples only. Also we characterize the ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 125-136, 2006]]>In this paper we address the problem of grammatical inference in the programming language domain. The grammar of a programming language is an important asset because it is used in developing many software engineering tools. Sometimes, grammars of languages are not available and have to be inferred from the source code; especially in the case of programming language dialects. We ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 201-213, 2006]]>This paper describes the Tenjinno Machine Translation Competition held as part of the International Colloquium on Grammatical Inference 2006. The competition aimed to promote the development of new and better practical grammatical inference algorithms used in machine translation. Tenjinno focuses on formal models used in machine translation. We discuss design issues and decisions made when creating the competition. For the ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 214-226, 2006]]>We propose 10 different open problems in the field of gram- matical inference. In all cases, problems are theoretically oriented but correspond to practical questions. They cover the areas of polynomial learning models, learning from ordered alphabets, learning determinis- tic Pomdps, learning negotiation processes, learning from context-free background knowledge.

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 32-44, 2006]]>In this paper we discuss an approach to named entity recognition (NER) based on grammatical inference (GI). Previous GI approaches have aimed at constructing a grammar underlying a given text source. It has been noted that the rules produced by GI can also be interpreted semantically [16] where a non-terminal describes interchangeable elements which are the instances of the ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 188-200, 2006]]>This paper describes novel methods of learning general context free grammars from sample strings, which are implemented in Synapse system. Main features of the system are incremental learning, rule generation based on bottom-up parsing of positive samples, and search for rule sets. From the results of parsing, a rule generation process, called “bridging,” synthesizes the production rules that ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 72-83, 2006]]>- Extended Abstract - The general goal of query-based learning algorithms for fini te-state machines is to identify a machine, usually of minimum size, that agrees with an a priori fixed (class of) machines. For this, queries on how the underlying system behaves may be issued. A popular setup is that of Angluin's Lalgorithm(Ang87), here adapted to the ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 344-345, 2006]]>We present the first algorithm for learning n-ary node selec- tion queries in trees from completely annotated examples by methods of grammatical inference. We propose to represent n-ary queries by de- terministic n-ary node selecting tree transducers (n-NSTTs). These are tree automata that capture the class of monadic second-order definable n- ary queries. We show ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 253-267, 2006]]>We survey the foundations of kernel methods and the recent developments of kernels for variable-length strings, in the context of biological sequence analysis.

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 7-18, 2006]]>The rapid growth of protein sequence databases is exceeding the capacity of biochemically and structurally characterizing new pro- teins. Therefore, it is very important the development of tools to locate, within protein sequences, those subsequences with an associated func- tion or specific feature. In our work, we propose a method to predict one of those functional motifs (coiled coil), related ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 175-187, 2006]]>We discuss the problem of large scale grammatical inference in the context of the Tenjinno competition, with reference to the infer- ence of deterministic finite state transducers, and discuss the design of the algorithms and the design and implementation of the program that solved the first problem. Though the OSTIA algorithm has good asymp- totic guarantees for this class of ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 227-239, 2006]]>k-reversible languages are regular ones that offer interesting properties under the point of view of identification of formal languages in the limit. Different methods have been proposed to identify k-reversible languages in the limit from positive samples. Non-regular language classes have been reduced to regular reversible languages in order to solve their associated learning problems. In this ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 354-355, 2006]]>Below I show that the class of strings that can be learned by a deterministic DEC grammar is exactly the class of rational numbers between 0 and 1. I call this the class of semi-periodic or rational strings. Dynamically Expanding Context (Dec) grammars were introduced by Kohonen in order to model speech signals ([8]). They can be learned in ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 320-326, 2006]]>In this paper, we present a directed Markov random field model that integrates trigram models, structural language models (SLM) and probabilistic latent semantic analysis (PLSA) for the purpose of sta- tistical language modeling. The SLM is essentially a generalization of shift-reduce probabilistic push-down automata thus more complex and powerful than probabilistic context free grammars (PCFGs). The added context-...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 97-111, 2006]]>A tree pattern p is a first-order term in formal logic, and the language of p is the set of all the tree patterns obtainable by replacing each variable in p with a tree pattern containing no variables. We consider the inductive inference of the unions of these languages from positive examples using strategies that guarantee some forms of ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 307-319, 2006]]>Conference: International Colloquium on Grammatical Inference - ICGI, 2006]]>

We propose a new methodology for ethology in terms of automata induction. Recent studies on Bengalese finch reported unique features of its songs. As opposed to most other songbirds, the songs of the Bengalese finch are neither monotonous nor random; they can be represented by a finite automaton, which we call song syntax [3]. Juvenile finches learn songs from their ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 351-353, 2006]]>We are concerned with a unified algorithm for extending classes of languages identifiable in the limit from positive data. Let L {\mathcal L} be a class of languages to be based on and let X {\mathcal X} be a class of finite subsets of strings. The extended class of L {\mathcal L} , denoted by C(L, X) {\mathcal C}({\mathcal ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 161-174, 2006]]>XML is data format for storing of structured data. With fast growing usage of XML documents there is a demand to natively store the documents in XML databases and process them using XML- optimized tools. With growing complexity of the stored XML data struc- tures grows also the complexity of the queries used to retrieve the re- quired information. This ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 342-343, 2006]]>In this article we present a syntax-based translation system, called TABL (Translation using Alignment-Based Learning). It trans- lates natural language sentences by mapping grammar rules (which are induced by the Alignment-Based Learning grammatical inference frame- work) of the source language to those of the target language. By parsing a sentence in the source language, the grammar rules ...

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 356-357, 2006]]>A preliminary experimental result is reported on language identification tasks by Recurrent Self-Organization Maps (RSOM) with a context map layer, using English part-of-speech strings of variable length. With subsymbolic processing, RSOM suprasymbolically sublimed syntactic rules into a topological configuration.

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 346-348, 2006]]>The aim of this paper is to present MRIA, a new merging states algorithm for inference of Residual Finite State Automata.

Conference: International Colloquium on Grammatical Inference - ICGI, pp. 340-341, 2006]]>